XML stands for Extensible Markup Language. XML is a language that lets you create markup languages. That is, a language that uses tags to describe other languages (e.g. HTML, TeX or MathML).
This page is part of a series of reference blog entries to understand XML and related technologies:
What is XML?
Contents
XML stands for Extensible Markup Language. XML is a language that lets you create markup languages. That is, a language that uses tags to describe other languages (e.g. HTML, TeX or MathML).
XML does not define the tags of the language. Instead, the XML specification defines the basic rules of the language, such as a tag needs to start with a “<“ and end with a “>“, the rules to name tags, where attributes are placed, and so on. XML also provides the means to define XML applications. A well-formed XML Document is one that satisfies these basic rules.
An XML application is a set of tags that define the vocabulary (element and attributes) and its grammar (how they are combined). XML uses a schema to define the structure of the XML document. The two most common schema languages are DTD (Document Type Definition) or an XML Schema. A valid XML document fulfils the rules of the schema. An invalid XML document does not.
These are the benefits of XML:
- Provides a clear separation between data and its presentation.
- Contains self-describing data. It is easy to understand the data semantics inside an XML document.
- Facilitates interoperability between applications. XML is non-proprietary and easy to read and write.
Note: In XML, markup refers to tags, entity references, comments, CDATA section delimiters, document declarations and processing instructions. Everything that is not a markup is data.
XML Document
An XML document has a prolog and a root element.
The prolog contains metadata of the XML document. The root element sits at the top of a tree hierarchy of elements. Each element can have other child elements, attributes or text data.
1 2 3 4 5 6 7 8 | <?xml version=“1.0”?> <!DOCTYPE person SYSTEM “simple.dtd” > <person id=“123”> <name>Jim Carter</name> <title>super cool</title> <age>28</age> </person> |
XML Prolog
The prolog contains metadata of the XML document. The <?xml> tag defines the XML version and encoding of the XML document. All XML processors are mandated to understand UTF-8 and UTF-16 encoding. If not provided, the default encoding is UTF-8
1 | <?xml version=“1.0” encoding=“UTF-8”?> |
The <!DOCTYPE> tag defines root element and the DTD of the document. The DTD specifies the structure and vocabulary of the XML document.
1 | <!DOCTYPE root SYSTEM “simple.dtd”> |
The prolog is optional, but if provided it must come first.
Elements and Attributes
The building block of an XML Document is the element and the attribute.
1 2 3 4 5 | <person id=“123”> <name>Jim Carter</name> <title>super cool</title> <age>28</age> </person> |
Elements contain information and define the hierarchical structure of the XML document. An element has a start tag (e.g. <name> ) and an end tag (e.g. </name> ). Empty elements have no content and can make use of the abbreviated form (e.g. <name/> ).
There are two types of elements. Simple elements contain plain text and cannot have child elements. Complex elements can contain other child elements.
1 2 3 4 5 | <name>Jim Carter</name> <name/> <name id=“123”/> <name></name> <complex><simple>text</simple></complex > |
Attributes provide additional information about the element. Attributes are name-value pairs of the form name=“value” or name=’value’ . Attribute names cannot be repeated within the same element. The order how attributes appear within the element is irrelevant
1 2 | <person id=“123” name=“John Smith” /> <person id=’123’ name=’John Smith’ /> |
Entity References
An entity reference is an abbreviation that the XML parser substitutes when it processes the XML document. There are five pre-defined entity references in XML:
Character | Entitty Reference |
& | & |
< | < |
> | > |
“ | " |
‘ | ' |
CDATA Sections
A CDATA Section contains text data and avoids having to escape special XML characters. A CDATA section starts with <![CDATA[ and ends with ]]> . Everything inside is considered text as-is.
The CDATA Section below uses the ‘<’ character directly
1 2 3 4 5 | <formula> <![CDATA[ i + 1 < k + 1 ]]> </formula> |
Instead of the escaped version ( < )
1 | <formula>i + 1 < k + 1</formula> |
XML Namespaces
XML namespaces are used to group elements and attributes. XML namespaces allow mixing elements from different vocabularies into the same XML document, so that two tags with the same name but different semantics can be used together. XML namespaces can also easily identify the domain of an element.
The example below shows what happens when tags from different vocabularies clash. Both vocabularies define the <title> tag but with very different semantics. For one vocabulary, it represents the title of the person. For the other vocabulary it represents the title of the HTML document inside the person’s notes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | <person> <name>Jim Carter</name> <title>Mr</title> <age>28</age> <notes> <html> <head> <title>Jim Carter Notes</title> </head> <body> <ul> <li>Runner</li> <li>Fighter</li> </ul> </body> </html> </notes> </person> |
XML namespaces come to the rescue to resolve this conflict. There are two ways to declare a namespace, using a default namespace or prefixed namespace.
The default namespace is declared using xmlns="...". All children tags without a prefix will also belong to this namespace.
1 2 3 4 5 | <person xmlns=“http://www.actimem.com/namespaces/person”> <name>Jim Carter</name> <title>super cool</title> <age>28</age> </person> |
The prefixed namespace is declared using the xmlns:prefix="...". All tags that belong to this namespace must use the prefix.
1 2 3 4 5 | <p:person xmlns:p=“http://www.actimem.com/namespaces/person”> <p:name>Jim Carter</p:name> <p:title>super cool</p:title> <p:age>28</p:age> </p:person> |
Finally, this is the resulting XML using namespaces that resolves the above clash:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | <person xmlns=“http://www.actimem.com/namespaces/person”> <name>Jim Carter</name> <title>Mr</title> <age>28</age> <notes> <html:html xmlns:html=“http://www.w3.org/1999/xhtml”> <html:head> <html:title>Jim Carter Notes</html:title> </html:head> <html:body> <html:ul> <html:li>Runner</html:li> <html:li>Fighter</html:li> </html:ul> </html:body> </html:html> </notes> </person> |
Example XML Message
To conclude this introduction to XML, we show a more complex XML document below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | <company> <name>Actimem</name> <address>135 Holborn Road, London, E1 5JK</address> <website>www.actimem.com</website> <director id=“123”> <name>John Smith</name> </director> <employees> <employee id=“123”> <name>John Smith</name> <birthday>1965-05-15</birthday> <yearsOfService>15</yearsOfService> <department>CEO</department> <comments>He is the best director this company has ever had</comments> </employee> <employee id=“3721”> <name>James Lloyd</name> <birthday>1981-03-23</birthday> <yearsOfService>5</yearsOfService> <department>Accounts</department> </employee> <employee id=“1234”> <name>Chris Harrison</name> <birthday>1975-12-01</birthday> <yearsOfService>2</yearsOfService> <department>IT</department> <comments> <![CDATA[ He is a rockie. We should <b>watch out</b> he gets the job done. ]]> </comments> </employee> <employee id=“5342”> <name>James Lloyd</name> <birthday>1979-08-14</birthday> <yearsOfService>6</yearsOfService> <department>IT</department> </employee> </employees> </company> |
Bibliography
- Beginning XML, 5th Edition. Joe Fawcett, Liam R E Quin, Danny Ayers. Wrox. 2012.
- XML 1.1 Bible, 3rd Edition. Elliotte Rusty Harold. Wiley Publishing, Inc, 2004.
- XML in a Nutshell, 3rd Edition. Elliotte Rusty Harold and W. Scott Means. O’Reilly, 2004
- Extensible Markup Language (XML) 1.0. W3C Recommendation
- XML Tutorial. W3Schools.com
- DTD Tutorial. W3Schools.com
- XML Schema Tutorial. W3Schools.com
- XML Schema: elementFormDefault and attributeFormDefault. Intertech
Eduard Manas
Latest posts by Eduard Manas (see all)
- Bash Shell Scripting - April 30, 2018
- Java NIO Buffer - September 2, 2017
- Introduction to Java NIO - August 25, 2017