| XML, XSL, two of a family of extensible languages | ||
|---|---|---|
![]() | ![]() ![]() | ![]() |
An XML document consists of one or more storage units (entities), that all have content and are identified by name.
Each XML document has one entity called the document entity, which serves as the starting point for the XML processor and may contain the whole document.
Entities may be either parsed or unparsed.
A parsed entity's contents are referred to as its replacement text, that is considered an integral part of the document.
Parsed entities are invoked by name using entity references.
An unparsed entity is a resource whose contents may or may not be text, and if text, may not be XML. Each unparsed entity has an associated notation, identified by name. Beyond a requirement that an XML processor make the identifiers for the entity and notation available to the application, XML places no constraints on the contents of unparsed entities.
Unparsed entities are invoked by name, given in the value of ENTITY or ENTITIES attributes.
General entities are entities for use within the document content.
Parameter entities are parsed entities for use within the DTD.
General and parameter entities use different forms of reference and are recognised in different contexts; they occupy different namespaces, so that a parameter entity and a general entity with the same name are two distinct entities.
A character reference refers to a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices.
[66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';' |
Characters referred to using character references must match the production for Char [2].
A character reference beginning with &#x corresponds to the hexadecimal representation of the character.
A character reference beginning with &# corresponds to the decimal representation of the character.
An entity reference refers to the content of a named entity.
References to parsed general entities use ampersand (&) and semicolon (;) as delimiters.
Parameter-entity references use percent-sign (%) and semicolon (;) as delimiters.
[67] Reference ::= EntityRef | CharRef [68] EntityRef ::= '&' Name ';' [69] PEReference ::= '%' Name ';' |
In a document without any DTD, a document with only an internal DTD subset which contains no parameter entity references, or a document with standalone='yes', the Name given in the entity reference must match that in an entity declaration.
In a well-formed document one needs not declare the following entities: amp, lt, gt, apos, quot, while in valid documents they must be declared in the DTD.
The declaration of a parameter entity must precede any reference to it.
The declaration of a general entity must precede any reference to it which appears in a default value in an attribute-list declaration.
In a valid document with an external subset or external parameter entities with standalone='no', the Name given in the entity reference must match that in an entity declaration.
An entity reference must not contain the name of an unparsed entity.
Unparsed entities may be referred to only in attribute values declared to be of type ENTITY or ENTITIES.
A parsed entity must not contain a recursive reference to itself, either directly or indirectly.
Parameter-entity references are only allowed in the DTD.
Example of character and entity references follow:
The ampersand (&) and less-than (>) signs...
This document was written on &date; by &Authors;.
<!-- declaration of the parameter entity "HTMLsymbol"... -->
<!ENTITY % HTMLsymbol PUBLIC
"-//W3C//ENTITIES Symbols//EN//HTML"
"http://www.w3.org/TR/xhtml1/DTD/HTMLsymbolx.ent">
<!-- ... and here it is referenced (in the DTD only!) -->
%HTMLsymbol;
|
Entities are declared using the syntax below.
[70] EntityDecl ::= GEDecl | PEDecl [71] GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>' [72] PEDecl ::= '<!ENTITY' S '%' S Name S PEDef S? '>' [73] EntityDef ::= EntityValue | (ExternalID NDataDecl?) [74] PEDef ::= EntityValue | ExternalID |
Name identifies the entity in an entity reference, or in the case of an unparsed entity, in the value of an ENTITY or ENTITIES attribute;
If the same entity is declared more than once, the first declaration encountered is binding; this means that definitions in the internal subset override those in the external subset.
If the entity definition is an EntityValue (production [9]) the defined entity is called an internal entity. There is no separate physical storage object, and the content of the entity is given in the declaration.
An internal entity is a parsed entity. An example follows:
<!ENTITY MathML "Mathematical Markup Language"> <!ENTITY XMLS "&MathML; and other extensible markup languages"> |
If the entity is not internal, it is an external entity, declared using the following syntax.
[75] ExternalID ::= 'SYSTEM' S SystemLiteral |
'PUBLIC' S PubidLiteral S SystemLiteral
[76] NDataDecl ::= S 'NDATA' S Name
|
if NdataDecl [76] is present, we are dealing with a general unparsed entity; otherwise it is a parsed entity.
Name [76] must match the declared name of a notation [82].
SystemLiteral in [75] is called the entity's system identifier. It is a URI, which may be used to retrieve the entity. This URI can be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity. Non-ASCII characters in a URI should be represented as UTF-8 using one or more bytes.
In addition to a system identifier, an external identifier may include a public identifier. An XML processor attempting to retrieve the entity's content may use the public identifier to try to generate an alternative URI. If the processor is unable to do so, it must use the URI specified in the system literal.
Examples of external entity declarations:
<!ENTITY myfile SYSTEM "/user/goosssens/gut99.xml''>
<!ENTITY xhtml PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"/user/goossens/xml/dtds/transitional.dtd"
<!ENTITY myFigure SYSTEM "../oxford99.eps" NDATA eps>
|
External parsed entities may each begin with a text declaration.
[77] TextDecl ::= '<?xml' VersionInfo? EncodingDecl S? '?>' |
The text declaration must be provided literally, not by reference to a parsed entity. No text declaration may appear at any position other than the beginning of an external parsed entity.
The document entity is well-formed if it matches the production document [1].
An external general parsed entity is well-formed if it matches the production labelled extParsedEnt [78].
An external parameter entity is well-formed if it matches the production labelled extPE [79].
[78] extParsedEnt ::= TextDecl? content [79] extPE ::= TextDecl? extSubsetDecl |
An internal general parsed entity is well-formed if its replacement text matches the production labelled content [43].
All internal parameter entities are well-formed by definition.
A consequence of well-formedness in entities is that the logical and physical structures in an XML document are properly nested.
No start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another.
Each external parsed entity in an XML document may use a different encoding for its characters.
All XML processors must be able to read entities in either UTF-8 or UTF-16.
Other encodings than UTF-8 and UTF-16 are used around the world, and certain XML processors may want to support such entities. Parsed entities that use such encodings must begin with a text declaration containing an encoding declaration, whose syntax is shown below.
[80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' |
"'" EncName "'" )
[81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*
|
The encoding for the document entity is specified on the XML declaration.
The name of the encoding EncName [81] can only contain Latin characters.
Possible values for EncName are: UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4 (corresponding to various transformations of Unicode/ISO/IEC 10646), ISO-8859-1, ISO-8859-2, etc. for the various parts of ISO-8859, ISO-2022-JP, Shift_JIS, EUC-JP, for various encoded forms of JIS X-0208-1997. For other encodings it is recommended to use character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA]
An encoding declaration can only occur than at the beginning of an external entity.
Entities encoded in ASCII (a subset of UTF-8) do not need an encoding declaration; entities containing texts using diacritics, encoded for instance in Latin 1 (ISO-8859-1), must always have their encoding specified.
<?xml encoding='UTF-8'?><!-- default --> <?xml version="1.0" encoding="ISO-8859-1"> <?xml encoding='EUC-JP'?> |
Section 4.4 in the XML Specification explains in detail the contexts in which character references, entity references, and invocations of unparsed entities might appear and the required behaviour of an XML processor in each case.
Section 4.5 explains how the replacement text for internal entities is constructed.
Consider the following (from Section 4.5):
<!ENTITY % pub "Éditions Gallimard" > <!ENTITY rights "All rights reserved" > <!ENTITY book "La Peste: Albert Camus, © 1947 %pub;. &rights;" > |
La Peste: Albert Camus, © 1947 Éditions Gallimard. &rights; |
All XML processors must recognise the five entities amp, gt, lt, apos, quot, whether they are declared or not
Valid documents must nevertheless declare these entities in their DTD if they are referenced.
<!ENTITY lt "&#60;"> <!ENTITY gt ">"> <!ENTITY amp "&#38;"> <!ENTITY apos "'"> <!ENTITY quot """> |
The < and & characters in the declarations of lt and amp are doubly escaped to meet the requirement that entity replacement be well-formed.
[82] NotationDecl ::= '<!NOTATION' S Name S (ExternalID | PublicID) S?
'>'
[83] PublicID ::= 'PUBLIC' S PubidLiteral
|
A notation identifies by name the format of an unparsed entity, the format of an element that bear a notation attribute, or the application to which a processing instruction is addressed.
Notation declarations declare a name for a notation (used in entity and attribute-list declarations and in attribute specifications), and an external identifier for the notation to allow the XML processor or client application to locate a helper application capable of processing data in the given notation.
XML processors must provide applications with the name and external identifier(s) of any notation declared and referred to in an attribute value, attribute definition, or entity declaration. They may additionally resolve the external identifier into the system identifier, file name, or other information needed to allow the application to call a processor for data in the notation described.
The document entity serves as the root of the entity tree and a starting-point for an XML processor. The document entity has no name and might well appear on a processor input stream without any identification at all.



