Updated DOCTYPE versioning change proposal (ISSUE-4)

The proposal was updated significantly, based on comments. I've tried to address the "compound" issue as well.

Here's a version with all the parts of a change proposal into a single document.  Since the discussion has been long, the rationale is long.

________________________________
Summary:
 Describe the DOCTYPE element and provide for allowing DOCTYPE definitions.
________________________________
Rationale:


 1.  The DOCTYPE has been part of HTML since its earliest versions, and is still required.  This change proposal makes its history and use clearer, without introducing any HTML interpreter changes.
 2.  The HTML is intended to replace previous versions of the HTML specification and the definition of the text/html MIME type.  Redefining a MIME type should not make previously conforming documents non-conforming; even if features are "deprecated" (conforming but not recommended), the conforming but not-recommended constructs described completely.
 3.  This feature is "in scope". There was an argument that features only intended for use in "controlled environments" were not in scope for the HTML working group.  (This is discussed in http://lists.w3.org/Archives/Public/public-html/2010Jan/0013.html )
 4.  In particular, the working group intends to support "polyglot" documents which are both valid XML and XHTML and also valid as HTML text/html; since XML workflows often require a !DOCTYPE with a PublicIdentifier and a SystemIdentifier, this increases the footprint of "polyglot" documents.
 5.  Other ideas for including a new versioning mechanism have been floated, e.g., an attribute on the <html> element. However, those alternatives have disadvantages - they would introduce the possibility of inconsistencies, where the DOCTYPE contains one version string and the version attribute contains another, and have little or no benefit. In particular, there were claimed advantages of a version attribute on the html element rather than using DOCTPYE:
    *   It was claimed that such a version indicator "easier to type correctly from memory":
       *    If a HTML author is relying on memory, the author should leave out the HTML version string and use the <!DOCTYPE html> form, since it is clearly not a "controlled environment".
       *   In any case, a simpler version indicator is not useful because in fact HTML evolves more continuously and a version indicator that was easy to remember would not actually address the use cases where a version indicator is actually useful.
    *    It was claimed that such a version indicator would be "easier to read":
       *   Even if it were true,  "ease of reading HTML markup directly" is not a strong design goal for HTML, compared to other uses.
       *   The proposal below recommends omitting a version indicator except in limited situations, and recommends readers ignore the version indicator except for specific purposes, so that "ease of reading" only matters in limited situations anyway.
       *   Whether something is "easy to read" is not an independent factor, but dependent on context and familiarity. Since the DOCTYPE element is there anyway, and web authors are familiar with it, and it is documented in every book, online tutorial and other HTML reference, using "DOCTYPE" for a version indicator will result in documents that are "easier to read" because of familiarity.
 6.  There was an argument that the change proposal was somehow related to "vastly increased reverse-engineering costs". This argument does not apply to this change proposal, see http://lists.w3.org/Archives/Public/public-html/2010Jan/0011.html .
 7.  The current HTML5 spec says the DOCTYPE is "mostly useless".  This wording should change:
    *   It was claimed that this means the same thing as "of limited utility". In fact, an informal survey showed that  "mostly useless" and "of limited utility" meant different things to a number of people:
       *    "mostly useless" was much "stronger"
       *   "mostly useless" meant that in almost all situations, the utility was zero, while "of limited utility" meant that the utility was less than expected but not uniformly different.
    *   Even if "mostly useless" and "of limited utility" could mean the same thing in some contexts, "mostly useless" was called "childish" or "petulant" and "inappropriate in a formal standards document".
 8.  Many of the arguments made in previous discussions about versions and doctypes were not careful to distinguish between "version of specification" and "version of implementation". It should be noted that many *want* a version indicator to note "version of implementation", i.e., as an indicator of "best viewed by FireFox 4.0 or later" or some such.  However, this change proposal is very clearly providing for a version of a "specification", and, in particular, of the HTML specification, with the possibility of "mix" specifications added.
 9.  Many of the arguments in previous discussions were arguing against version-specific browser behavior. But this change proposal specifically does NOT allow for (any additional) version-specific behavior, and in fact explicitly disallows it.
 10. There was one suggestion that, instead of PublicIdentifier and SystemIdentifier, that ONLY the SystemIdentifier be allowed, but that the RFC 3151 URN version of the PublicIdentifier might be supplied, e.g.,
<!DOCTYPE SYSTEM "urn:publicid:-:W3C+HTMLWG+hixie:nonsgml+html+20100401:en">
rather than
<!DOCTYPE PUBLIC "-//W3C HTMLWG hixie//NONSGML HTML 20100401//EN" about:legacy-compat>
This suggestion is interesting but doesn't seem improve anything (since the URN isn't easily resolvable) when considering compatibility with existing deployed XML editing workflows.
 11. While everyone *hopes* there are never going to be any further incompatible changes to HTML in the future, there *is* a possibility that in some unfortunate situation, it will be necessary to introduce incompatible changes. In that case, it will be necessary to introduce a new version indicator, to allow (alas) processors to determine which of the incompatible interpretations was meant. While this will be unfortunate, it would be doubly unfortunate to have to introduce a new "place" for a version indicator that was previously non-conforming, which would cause even worse uproar, because documents that *didn't* want the new incompatible behavior would have no place to say explicitly that which version of the incompatible behavior they wanted. By *allowing* a verison indicator in conforming content today, we can avert more serious damage. Having a location for a version indicator, even if it isn't explicitly used, allows it to be used at some point in the future. In the history of computer languages, there are no languages that have not evolved, been extended, or otherwise "versioned" as long as the language has been in use.  This applies to network protocols, character encoding standards, programming languages, and certainly to every known technology found on the web. There are no known cases where a language hasn't gone through some at least minor incompatible change. The standards process is established as a way of evolving specifications and implementations in a way to reduce the likelihood of complete failure to interoperate, but certainly not to guarantee that no incompatible changes will be needed in the future.
 12. There was a suggestion that the final "EN" in the PublicIdentifier might be omitted, but that didn't seem to be allowed in the FPI syntax after all, and if we're going to be FPI compatible, might as well pick up the whole thing. That's why "NONSGML" was added too.


See also background document http://www.w3.org/2001/tag/doc/versioning-html/versioning-html-20090611.html "Architectural Considerations for Language Versioning on the Web".

 For additional rationale and discussion, seethe HTML WG tracker ISSUE-4:  http://www.w3.org/html/wg/tracker/issues/4

________________________________
Impact:

This proposal does not add any new headers or elements to HTML.  It more clearly shows the evolution and reasons for no longer relying on DOCTYPE to affect browser behavior.

This proposal does not require any changes to any browser or HTML interpreter; existing behavior is maintained.

It allows but does not require some validators to perform additional validation, in that there may be additional validation based on the PublicIdentifier or SystemIdentifier.   As behavior does not depend on the DOCTYPE, validating the DOCTYPE is not required.

This proposal allows some HTML documents that were previously conforming to remain conforming.  It also allows the continued use of PublicIdentifier and/or SystemIdentifier DOCTYPEs to be valid in new documents.
________________________________
Specific proposal:

replace section 9.1.1 of the HTML5 specification with:

9.1.1 The DOCTYPE

The DOCTYPE header element is a required element. Originally, when HTML was defined as an application of SGML (see [ISO8879]<http://www.w3.org/TR/html401/references.html#ref-ISO8879>), a valid HTML document declared what version of HTML was used in the document, with a document type declaration which named the document type definition (DTD) in use for the document.  In practice, web authors have not been careful to consistently label versions, and many, if not most, HTML documents on the web do not conform to the DTD that they specify.

It is common for implementations to trigger wildly different behavior ("quirks" modes) due to the presence of specific DOCTYPE declarations, or the absence  of a declaration altogether; see section 9.2.5.4 for details of this behavior.

For these reasons, the DOCTYPE header is REQUIRED for HTML content served as text/html (and optional for content served as an XML media type), but supplying an explicit version indicator is NOT RECOMMENDED except in limited circumstances.

The syntax of the DOCTYPE element is:

<!DOCTYPE html>
<!DOCTYPE html PUBLIC "PublicIdentifier" "SystemIdentifier">
<!DOCTYPE html SYSTEM "about:legacy-compat">


 *   <!DOCTYPE html> is the simplest, recommended form of the DOCTYPE declaration.
 *   The use of public identifiers (required in HTML 4.01) is discouraged in this specification; some public identifiers may trigger different behavior in deployed browsers (Section [#quirks-mode] in this document and [hsvonin]).
 *   The SystemIdentifier is syntactically a URI (not a "URL" or "IRI"). The SystemIdentifier was intended to be a locator for downloading a DTD and entity sets in generic SGML and XML processors, and some XML workflows designed to produce HTML require either a well-known PublicIdentifier , or else a SystemIdentifier that can actually be fetched.
 *   The special URI "about:legacy-compat" is reserved for use as a SystemIdentifier in a declaration of the form:
                  <!DOCTYPE html SYSTEM "about:legacy-compat">.

 *   Except for explicitly defined behavior (used to trigger "quirks mode", see section [#parse-behavior], [#quirks-mode] and [hsvonin]), implementations which consume HTML MUST NOT use the DOCTYPE element to trigger different processing behavior.
 *   Implementations which validate HTML content SHOULD use the latest version of this specification to validate against; validating only against older specifications, or only against the indicated version, is likely to be much less useful.  See Section [#validation].
 *   HTML  documents not served as an XML media type MUST include a DOCTYPE header, since many browsers, in the absence of a DOCTYPE header, will trigger a "quirks" mode of rendering.
 *   Documents served as an XML media type MAY include a DOCTYPE header, either to allow compatible content (so-called "polyglot" documents which are both valid HTML and also valid XHTML) or to support version-specific XML processing. While the DOCTYPE header is not required, including may help in XHTML/HTML crossover.

"html", "PUBLIC" and "SYSTEM" are case insensitive, may have additional spaces around them. The "PublicIdentifier" and "SystemIdentifier" may use either double or single (apostrophe) quote marks.

Note that XML allows additional forms of DOCTYPE declarations which are; however, this proposal is compatible with most widely deployed XML software.

In most instances, the simple <!DOCTYPE html> form is all that is required or recommended. The form with the "SYSTEM about:legacy-compat" is provided to allow for XSLT processors.

9.1.1.1 Public Identifier

A  PublicIdentifier SHOULD NOT be used unless the content is being managed in a controlled environment where the intended version is known, and the document is well-formed; this might be the case in some XML-based workflows and editing environments, or content management systems and other production workflows.

Even though HTML is no longer being defined as an SGML application, previous versions of HTML were, and so the format of PublicIdentifier was defined to be consistent with Formal Public Identifiers of SGML (http://xml.coverpages.org/tauber-fpi.html).

 Until this specification is approved as a W3C recommendation, the  PublicIdentifier  MAY identifying the specification referenced and its date.  The pattern for the PublicIdentifier is simple. The primary template is only the date in yyyymmdd terms:

"-//WHATWG//NONSGML HTML 20100401//EN"                             for the 2010 April 1 version of the WhatWG edition of the specification.
"-//W3C HTMLWG//NONSGML HTML 20100401//EN"                    for the HTML working group editor's draft of the same date.

 If multiple alternative specifications are available in a committee, the draft's or author's nickname or handle may be used to distinguish which specification is being referenced, e.g.,

"-//W3C HTMLWG hixie//NONSGML HTML 20100401//EN"
"-//W3C HTMLWG manu//NONSGML HTML 20100401//EN"

When this specification becomes a W3C Recommendation, and only then, the  PublicIdentifier:
    "-//W3C//NONSGML HTML 5.0//EN"
may be used.

However, HTML documents MUST NOT use "-//W3C//NONSGML HTML 5.0//EN" until the edition of this specification referenced is actually approved and published as a W3C Recommendation.

Note that non-standard behavior may ensue from using any of many well-known Public Identifiers; these were chosen not to trigger any such behavior.

9.1.1.2 PublicIdentifier for compound specifications

Note that a PublicIdentifier only identifies a single specification, not a complete implementation, a suite of specifications, or a combination of vocabularies from multiple specifications. In order to construct a PublicIdentifier for such a combination requires publication of an actual specification which describes that combination.

Groups wishing to support the combination of HTML and other specifications may supply short specifications showing how additional vocabularies may be used with HTML; for example, a short document "how to use RDFa with HTML" might be published. (This document would reference RDFa and HTML but not include either specification). In such case, the "+" format might be used:

"-//W3C RDFAWG//NONSGML HTML+RDFa 20100401//EN" might reference the HTML+RDFA document published by the RDFA working group.

The W3C Hypertext coordination group is encouraged to coordinate assignment of public identifiers.

9.1.1.3  SystemIdentifier

The SystemIdentifier is a URL, either relative or absolute.
If no PublicIdentifier is supplied, the effect is to not have a version at all. In this case, the SystemIdentifier "about:legacy-compat" should be used:     <!DOCTYPE SYSTEM "about:legacy-compat">

If a PublicIdentifier is supplied, the SystemIdentifier may be:


 *   An actual address (URL)  of a DTD and other XML material, as per the XML specification, which can be fetched and used by an XML processor. Note that W3C does not intend to supply or publish any such URLs or DTDs. Note that no current URL used in HTML would occur. This usage should only be used if the URL is actually resolvable.
 *   The empty string, "" . This system identifier can be used in situations where there is no fetchable material related to the XML forms, but that a specific version indicator is wanted and supplied by the PublicIdentifier.

Received on Sunday, 3 January 2010 00:21:59 UTC