Document-level information


Well-formed documents

 document ::= prolog element Misc*

<!-- proper nesting -->
<root>This sentence <important>must be read together the
following <super>superword</super></important></root>
<!-- improper nesting -->
<root>This sentence <important>must be read together the
following <super>superword</important></super></root>
 

The characters and their encodings

Characters (Char) can be part of tags, attribute names, or character data. The Unicode encodings UTF-16 and UTF-8 are recognised implicitly by XML parsers. All other encodings must be declared on the XML declaration.

Consider the following little file containing non-ASCII (Latin-1, aka, ISO/IEC 8859-1) characters:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<!DOCTYPE bonneidée [ 
  <!ELEMENT bonneidée (#PCDATA)>
]>
<bonneidée>XML est une excellente idée !</bonneidée>

we note the encoding="ISO-8859-1" specification on the first line (the XML declaration). When we feed this file to some of the parsers this is what we get.
>xml4j.sh  bonneidee.xml
bonneidee.xml: 345 ms (1 elems, 0 attrs, 0 spaces, 28 chars)

>xp.sh bonneidee.xml
<bonneidée>XML est une excellente idée !</bonneidée>

>aelfred.sh bonneidee.xml
Start document
Resolving entity: pubid=null, sysid=file:/user/goossens/bonneidee.xml
Starting external entity:  file:/user/goossens/bonneidee.xml
Doctype declaration:  bonneidée, pubid=null, sysid=null
Start element:  name=bonneidée
Character data:  "XML est une excellente idée !"
End element:  bonneidée
Ending external entity:  file:/user/goossens/bonneidee.xml
End document

>nsgmls xml.dcl bonneidee.xml 
?xml version="1.0" encoding="ISO-8859-1" standalone="yes"
(bonneidée
-XML est une excellente idée !
)bonneidée

Common Syntactic Constructs

White Space

White space consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.

Names and Tokens

Unicode classifies characters for convenience as letters, digits, or other characters. Letters consist of an alphabetic or syllabic base character possibly followed by one or more combining characters, or of an ideographic character.

A Name is a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.

Names beginning with the letters ‘xml’ (in any uppercase/lowercase combinations) are reserved. Names containing the colon character are reserved for specifying the namespace.

Literals

Literal data is any quoted string not containing the quotation mark used as a delimiter for that string. Literals are used for specifying the content of internal entities (EntityValue), the values of attributes (AttValue), and external identifiers (SystemLiteral). Note that a SystemLiteral can be parsed without scanning for markup.

Character Data and Markup

Text consists of intermingled character data and markup. Markup takes the form of start-tags, end-tags, empty-element tags, entity references, character references, comments, CDATA section delimiters, document type declarations, and processing instructions.

The ampersand character (&) and the left angle bracket (<) may appear in their literal form only when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. Elsewhere, they must be escaped using either numeric character references or the strings &amp; and &lt;.

On the other hand, greater-than (>) may be escaped as &gt;, the single-quote (') may be written &apos;, and the double quote (") may be written &quote;.

Comments

Comments may appear anywhere in a document outside other markup or in the DTD at places allowed by the grammar. They are not part of the document's character data (an XML processor may, but need not, make it possible for an application to retrieve the text of comments). The string -- (double-hyphen) must not occur within comments.

An example of a comment is the following:
<!-- A <super>super & grandiose</super> comment -->

Processing Instructions

Processing instructions (PIs) allow documents to contain instructions for applications. PIs are not part of the document's character data, but must be passed through to the application. The PI begins with a target (PITarget) used to identify the application to which the instruction is directed.

CDATA Sections

CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognised as markup.

An example of a CDATA section is shown below:
<![CDATA[Un texte <super>super & grandiose</super>.]]>

CDATA sections come also in handy when you are writing technical documentation, containing, e.g., C or C++ code fragments:
<![CDATA[#include <iostream> 
#include <vector> 
int array[] = { 1, 42, 3 }; // Regular "C" array. 
  ...
  for ( p1 = array; p1 != array + 3; ++p1 ) 
    cout << "array has " << *p1 << "\n"; 
]]>

Prolog and Document Type Declaration

Ideally every XML document should begin with an XML declaration which specifies the version of XML being used.

The prolog

For example, the following is a complete XML document, well-formed but not valid:
<?xml version="1.0">
<welcome>Welcome to ICTP in Trieste!</welcome>
where the number 1.0 indicates the version of XML to which the document conforms.

Note that the following is also well-formed (without XML declaration):
<welcome>Welcome to ICTP in Trieste!</welcome>

The function of the markup in an XML document is to describe its storage and logical structure and to associate attribute-value pairs with its logical structures. The DTD can define constraints on the logical structure and support the use of predefined storage units. An XML document is valid if it has an associated DTD and if the document complies with the constraints expressed in it.

The DTD must appear before the first element in the document. The following turn the previously shown well-formed document into a valid one.
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE welcome [ 
  <!ELEMENT welcome (#PCDATA)>
]>
<welcome>Welcome to ICTP in Trieste!</welcome>

The document type definition

The XML document type declaration contains or points to markup declarations that provide a grammar for a class of documents. This grammar is known as a document type definition (DTD). The DTD can point to an external subset (an external entity) containing markup declarations, it can contain markup declarations directly in an internal subset, or can do both. The DTD for a document consists of both subsets taken together.

A DTD declares element types declaration, attribute-lists entities, and notations. These declarations may be contained in whole or in part within parameter entities.

The external subset

The external subset and external parameter entities differ from the internal subset in that they can contain parameter-entity references in their markup declarations, not only between markup declarations.

For instance we can rewrite the above valid document using an external DTD welcome.dtd with the following content.
<!ELEMENT welcome (#PCDATA)>

The XML file then references this file by specifying its URI:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE welcome SYSTEM "welcome.dtd">
<welcome>Welcome to ICTP in Trieste!</welcome>

It is important to realize that, if both the external and internal subsets are used, the internal subset is considered to occur before the external subset. This has the effect that entity and attribute-list declarations in the internal subset take precedence over those in the external subset.

Standalone Document Declaration

Markup declarations can affect the content of the document, as passed from an XML processor to an application; examples are attribute defaults and entity declarations. The standalone document declaration, which may appear as a component of the XML declaration, signals whether or not there are such declarations which appear external to the document entity.

White space handling

For reasons of readability and maintainability it is often advantageous when editing XML documents to use white space, such as spaces, tabs, and blank lines (denoted by S as defined in rule [3]). This white space is typically irrelevant and should not be included in the final version of the document.

On the other hand, there exist occasions (poetry, course code) where white space in the source is significant and should be preserved in the final output.

There exists a special attribute xml:space that can be associated to an element to signal an intention that in that element, white space should be preserved by applications. An example is a poem whose attribute list should then contain the following declaration in the DTD.
<!ATTLIST poem   xml:space (default|preserve) 'preserve'>
<!ATTLIST code    xml:space (default|preserve) 'preserve'>

With this declaration in a document instance the default behaviour will be to preserve all white space inside the contents of poem and code elements.

End-of-Line handling

XML parsed entities are often stored in computer files which, for editing convenience, are organised into lines that are separated by a combination of carriage-return (#xD) and line-feed (#xA) characters. To simplify the tasks of applications, XML processors will normalise the output by passing to applications a single line-end character #xA.

Language identification

If is often useful to identify the language in which the content is written. A special attribute xml:lang is available to specify the language used in the contents and attribute values of any element in an XML document. This attribute must be declared for valid documents. The values of the attribute are language identifiers as defined by [IETF RFC 1766].

Langcode can be a two-letter language code as defined by [ISO 639], possibly augmented by a further specifying element, for instance a country variant. In such a case It is customary to give the language code in lower case, and the country code (if any) in upper case. Note that these values, unlike other names in XML documents, are case insensitive.
  <p xml:lang="fr">Cachez ce sein que je ne saurais voir.</p>
  <p xml:lang="fr-FR">Ce programme a une bogue.</p>
  <p xml:lang="fr-CA">Ce programme a un bogue.</p>
  <p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>
  <p xml:lang="en-GB">What colour is it?</p>
  <p xml:lang="en-US">What color is it?</p>
    <sp qui="Faust" desc='leise' xml:lang="de">
       <l>Habe nun, ach! Philosophie,</l>
       <l>Juristerei, und Medizin</l>
    </sp> 

The intent declared with xml:lang is considered to apply to all attributes and content of the element where it is specified, unless overridden with an instance of xml:lang on another element within that content.

One can leave the the specification completely open as follows:
 xml:lang  NMTOKEN  #IMPLIED
or else a series of default values for certain element types can also be specified. A multi-lingual collection of poems in French could contain notes in English and Dutch, declared as follows:
  <!ATTLIST poème  xml:lang NMTOKEN 'fr'>
  <!ATTLIST note   xml:lang NMTOKEN 'en'>
  <!ATTLIST noot   xml:lang NMTOKEN 'nl> 


PREVIOUSFIRSTLASTNEXT