Introduction to XML

XML stands for Extensible Markup Language. XML is a language that lets you create markup languages. That is, a language that uses tags to describe other languages (e.g. HTML, TeX or MathML).

This page is part of a series of reference blog entries to understand XML and related technologies:

What is XML?

XML stands for Extensible Markup Language. XML is a language that lets you create markup languages. That is, a language that uses tags to describe other languages (e.g. HTML, TeX or MathML).

XML does not define the tags of the language. Instead, the XML specification defines the basic rules of the language, such as a tag needs to start with a “<“ and end with a “>“, the rules to name tags, where attributes are placed, and so on. XML also provides the means to define XML applications. A well-formed XML Document is one that satisfies these basic rules.

An XML application is a set of tags that define the vocabulary (element and attributes) and its grammar (how they are combined). XML uses a schema to define the structure of the XML document. The two most common schema languages are DTD (Document Type Definition) or an XML Schema. A valid XML document fulfils the rules of the schema. An invalid XML document does not.

These are the benefits of XML:

  • Provides a clear separation between data and its presentation.
  • Contains self-describing data. It is easy to understand the data semantics inside an XML document.
  • Facilitates interoperability between applications. XML is non-proprietary and easy to read and write.

Note: In XML, markup refers to tags, entity references, comments, CDATA section delimiters, document declarations and processing instructions. Everything that is not a markup is data.

XML Document

An XML document has a prolog and a root element.

The prolog contains metadata of the XML document. The root element sits at the top of a tree hierarchy of elements. Each element can have other child elements, attributes or text data.

XML Prolog

The prolog contains metadata of the XML document. The <?xml>  tag defines the XML version and encoding of the XML document. All XML processors are mandated to understand UTF-8 and UTF-16 encoding. If not provided, the default encoding is UTF-8

The <!DOCTYPE>  tag defines root element and the DTD of the document. The DTD specifies the structure and vocabulary of the XML document.

The prolog is optional, but if provided it must come first.

Elements and Attributes

The building block of an XML Document is the element and the attribute.

Elements contain information and define the hierarchical structure of the XML document. An element has a start tag (e.g. <name> ) and an end tag (e.g. </name> ). Empty elements have no content and can make use of the abbreviated form (e.g. <name/> ).

There are two types of elements. Simple elements contain plain text and cannot have child elements. Complex elements can contain other child elements.

Attributes provide additional information about the element. Attributes are name-value pairs of the form name=“value”  or name=’value’ . Attribute names cannot be repeated within the same element. The order how attributes appear within the element is irrelevant

Entity References

An entity reference is an abbreviation that the XML parser substitutes when it processes the XML document. There are five pre-defined entity references in XML:

CharacterEntitty Reference
&&amp;
<&lt;
>&gt;
&quot;
&apos;

CDATA Sections

A CDATA Section contains text data and avoids having to escape special XML characters. A CDATA section starts with <![CDATA[ and ends with ]]> .  Everything inside is considered text as-is.

The CDATA Section below uses the ‘<’ character directly

Instead of the escaped version ( &lt; )

XML Namespaces

XML namespaces are used to group elements and attributes. XML namespaces allow mixing elements from different vocabularies into the same XML document, so that two tags with the same name but different semantics can be used together. XML namespaces can also easily identify the domain of an element.

The example below shows what happens when tags from different vocabularies clash. Both vocabularies define the  <title> tag but with very different semantics. For one vocabulary, it represents the title of the person. For the other vocabulary it represents the title of the HTML document inside the person’s notes.

XML namespaces come to the rescue to resolve this conflict. There are two ways to declare a namespace, using a default namespace or prefixed namespace.

The default namespace is declared using xmlns="...".  All children tags without a prefix will also belong to this namespace.

The prefixed namespace is declared using the xmlns:prefix="...".  All tags that belong to this namespace must use the prefix.

Finally, this is the resulting XML using namespaces that resolves the above clash:

Example XML Message

To conclude this introduction to XML, we show a more complex XML document below:

Bibliography

The following two tabs change content below.

Eduard Manas

Eduard is a senior IT consultant with over 15 years in the financial sector. He is an experienced developer in Java, C#, Python, Wordpress, Tibco EMS/RV, Oracle, Sybase and MySQL.Outside of work, he likes spending time with family, friends, and watching football.

Latest posts by Eduard Manas (see all)

Leave a Reply