The following specification is an extension to HTML5, including XHTML5 documents and documents that conforms to the Polyglot Markup profile.

It specifies how to use the xml:id attribute in XML-wellformed HTML5 documents so that authors MAY use XML applications that rely on tokenized id attributes of XML ID type for referring to named fragments (idrefs) in such documents.

Documents conforming to this specification may be parsed by any XML parser, but will cause (non-fatal) XML errors if the parser – via DTD, schema, default or otherwise – performs ID-type assignment for both the xml:id attribute and the id attribute.

Introduction

While HTML5 operates with idrefs and defines the id attribute as the format’s idref container, some HTML consumers of the XML kind rely on idrefs of the XML ID type (ID-type assigned attributes) for such purposes.[[!xml]]

The XHTML 1.x family of HTML documents included DOCTYPE declarations that pointed to DTDs that defined the id attribute as being of such an XML ID type. This meant that, when consumed as XML, validating XML processors could consume HTML’s id attribute as being of XML ID type.

With the HTML5 specification, the reference to a DTD has been removed from the DOCTYPE declaration, an no other applicable DOCTYPE declaration that points to a DTD has at this time been specified, whether for HTML5 or for XHTML5.

For this class of HTML consumers, HTML5 documents thus need XML parsers that do not rely on reference to a DTD for the assignment of ID-type for the id attribute. However, out of the box, most XML tools support ID-assignement for the xml:id attribute but not for HTML’s id attribute. For this class of XML consumers, HTML5 documents thus need an applicable specification which specifies how to use the xml:id attribute, and it is this use case that this specification aims to solve.

To solve this problem, this specification recommends that HTML’s id attribute can be duplicated with an xml:id attribute, since this attribute is specified to be an attribute of XML ID type but without dependence on DTD and since most XML tools, out of the box, applies ID-assignment to xml:id without performing non-DTD-based ID-assignment for the id attribute of HTML documents. By, as necessary, prepping documents with this attribute, authors may continue to use the XML implementations that require idrefs to be of XML ID type when consuming HTML5 and XHTML5 documents.

Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [[!RFC2119]].

Syntax

Specifications

The id attribute in the XHTML namespace, id, is defined by the HTML specification. [[!HTML5]]

The id attribute in the XML namespace, xml:id, is defined by the xml:id specification. [[!XML-ID]]

HTML adaptations

Usage rules

HTML elements in HTML documents and HTML elements in XML documents MUST NOT replace the id attribute with the id attribute in the XML namespace.

To allow HTML documents and XML documents to be (post) processed by XML processing tools that rely on attributes of the XML ID type for idreferences, authors MAY specify an attribute in no namespace with no prefix and with the literal localname ”xml:id” on HTML elements in HTML documents or the id attribute in the XML namespace on HTML elements in XML documents, but only if an id attribute in no namespace is also specified for the same element and with the same value, byte-for-byte, in both attributes.

The alternative of allowing xml:id to be used independent of id is not permitted because:
  • Documents should not be made dependent of xml:id supporting processors (for instance, if XML tools start to support id, then one should be able to consume the document without editing it first).
  • Since the xml:id attribute, being of XML ID type, is expected to represent the element’s unique identifier, something which the id attribute is expected to do as well, it makes sense to require that the two are identical and thus represents the one and same identifier.
This also follows the pattern of the name attribute in XHTML1, which for the anchor attribute shared the same “name space” as the id attribute, and the recommendation was for the attributes to have identical values. Keeping the elements identical facilitates simple adding/removal of xml:id and can help prevent misguided usage.

It is OPTIONAL whether all, some, none or one id attribute are duplicated with an xml:id attribute. That is: To duplicate one id attribute with an xml:id attribute does imply that all the id attributes have to be duplicated with xml:id attributes.

Same value of xml:id as for id

The permitted attribute values, for id and xml:id, when both are used on HTML elements in HTML documents or in HTML elements in XML documents, is the common subset of the constraints of the id attribute and the constrains of the xml:id attribute:

  1. There can be no whitespace in the attribute. (Required by HTML)
  2. The character choice must be compatible with XML namespaces, which means that colon (:) is forbidden. (Required by xml:id)
  3. The character choice must be compatible with XML 1.0’s requirements for values for attributes of XML ID type. (Required by xml:id)
    • For compatibility with tools that adhere to XML 1.0 4th edition, a list of permitted characters derived from Unicode 3.2 defines the list of permitted characters.
    • For compatibility with tools that adhere to the regime established with XML 1.0 5th edition, the characters of the current version of Unicode minus a list of forbidden characters define the permitted list of characters.
  4. The value must follow the syntax for ID values defined by XML 1.0:
    • Some characters are forbidden from occurring at all.
    • Some characters cannot occur at certain places (e.g. first character cannot be a number)
    • Some characters MUST occur (the value cannot be only a number)
    • The value cannot simply consist of punctuation

(This is different from the lang attribute, which from the outset shares the same syntax rules as xml:lang.)

For a more complete, but non-normative, list of the forbidden characters for xml:id, see the appendix.

Processing

The attribute in no namespace and with no prefix and with the literal localname "xml:id" has no effect on idref processing when consumed as HTML. The ID type of the attribute “id” in the xml namespace in XML documents is not expected to be assigned by Web browsers, as they do typically not implement xpointer and are at this point expected to continue to NOT support it.

How to determine the idref for XHTML5 and HTML5 documents is defined by the HTML5 spec.

How to determine the idref based on xml:id, is defined by the xml:id specification.

If both the id attribute in no namespace and the id attribute in the XML namespace are set on an element, user agents will use the id attribute in the XML namespace if they support both, and the id attribute in no namespace will be ignored for the purposes of determining the element's id.

It might not be possible to apply processing tools that applies a schema or another mean in order to treat the HTML id attributes as being of XML “ID” type, MUST NOT be used to consume documents which applies both id and xml:id on the same element as it is an XML validation error if an element includes two attributes of XML “ID” type.

If the document references a DTD that defines the id attribute as of type ID (for instance a DTD from the XHTML 1.x family), then this specification does not apply. However, let it be mentioned that xml:id should not be applied to elements that already have an attribute of XML “ID” type because, again, it is an XML (validation) error if an element includes two attributes of XML “ID” type.

If the resulting value is not a valid xml:id value, and the parser supports xml:id, the parser could report an error, see the xml:id specification.

The xml:id IDL attribute ...

<!-- I feel something about IDL needs to be here, but I do not know what to say ... --!>

Document requirements

When the above rules are followed, the xml:id attribute MAY be used in any HTML or XHTML document provided the document fulfills the following requirements:

Polyglot Markup requirements

An author that aims to use xml:id and who also wants to adhere to the robustness principles of the Polyglot Markup profile SHOULD duplicate all id attributes with the xml:id attribute. Only then is the xml:id extension, as defined in this specification, considered to be compatible with the principles of the polyglot markup profile. Conformance with the principles of the polyglot markup profile for xml:id, can be viewed along the same lines as the xml:lang attribute: it is a polyglot feature as long as the element includes both xml:lang and langxml:lang must be used on all elements where lang is used.

The Polyglot Markup specification defines an profile of HTML that itself is extensible, and which results in XML well-formed and and HTML5-compatible HTML documents that are robust (with regard to preserving semantics are preserved regardless of parsing) and identical (whether parsed as XML or as XHTML).

While DOM equivalence whether parsed as an XML document or parsing the same document as an HTML document is a core value of polyglot markup, there are some exceptions where prepping the document for as many parsers as possible wins over the DOM equivalence – robustness wins over strict equivalence. A relevant example is xml:lang, which is permitted on the condition that an identical lang is used as well. The fact that xml:id is prefixed with xml:, makes it similar to xml:lang. The DOM difference caused by HTML and XML’s differing handling of the xml: prefix, is tolerated due to the semantic identity and in order to make polyglot markup supported by a wider set of XML parsers, namely parsers that do not support the lang attribute of the XHTML namespace. While XML parsers should see that xml:id belongs in the xml namespace, only XML parsers that implements xml:id will type assign it as an ID attribute (Web browsers that support XHTML are not expected to implement xml:id).

Example documents

The xi:include specifications defines an element include in the xi include namespace. The elemenet can be used to concatenate different documents, or document fragments, into a new documents.[[xinclude]],[[xinclude-11]]. The reference to the ID is done via xpointer syntax in the xpointer attribute.

Here is an XML document “foo.xml” with an include element in the XInclude namespace, which points to a document “polyglot.html”:

<include xmlns="http://www.w3.org/2001/XInclude"
        href="http://dataormen.local/dataimport.xhtml" parse="xml" xpointer="MyBody"
        />

Here is the code of the document “polyglot.html” for which the include element in the above file “foo.xml” refers to the id ”body”.

<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml">    <head>       <title>Include
        my body element, please!</title>       <meta charset="utf-8"/>    </head>    <body
        id="MyBody" xml:id="MyBody">       <h1>Lorem ipsum.</h1>       <p>Dolor
        sint.</p>    </body> </html>

A non-normative list of the forbidden characters based on Resycled Knowledge blog:

  1. Colon
  2. ASCII control characters and their 8-bit counterparts.
  3. ASCII and Latin-1 symbolic characters, with the exceptions of hyphen, period, colon, underscore, and middle dot, which have always been permitted in XML names. These characters are commonly used as syntax delimiters either in XML itself or in other languages, and so are excluded.
  4. The Greek question mark, which looks like a semicolon and is canonically equivalent to a regular semicolon.
  5. The General Punctuation block of Unicode, with the exceptions of the zero-width joiner, zero-width non-joiner, undertie, and character-tie characters, which are required in certain languages to spell words correctly. Various kinds of blank spaces and assorted punctuation don't make sense in names.
  6. The various Unicode symbols blocks reserved for "pattern syntax", from U+2190 to U+2BFF. These characters should never appear in identifiers of any sort, as they are reserved for use as syntactic delimiters in future languages that exploit non-ASCII syntax. Many are assigned, some are not.
  7. The Ideographic Description Characters block, which is used to describe (not create) uncoded Chinese characters.
  8. The surrogate code units (which don't correspond to Unicode characters anyhow) and private-use characters. Using the latter, in names or otherwise, is very bad for interoperability.
  9. The Plane 0 non-characters at U+FDD0 to U+FDEF, U+FFFE, and U+FFFF. The non-characters on the other planes are allowed, not because they are a good idea, but to simplify implementation.