[Bug 4372] [Serialization] Lexical checking of doctype-public from bugzilla@wiggum.w3.org on 2007-03-15 (public-qt-comments@w3.org from March 2007)

From: <bugzilla@wiggum.w3.org>
Date: Thu, 15 Mar 2007 18:20:09 +0000
To: public-qt-comments@w3.org
CC:
Message-Id: <E1HRuYb-0004By-Oh@wiggum.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=4372





------- Comment #2 from mike@saxonica.com  2007-03-15 18:20 -------
The relevant rules for XML appear to be:

[12]    PubidLiteral       ::=          '"' PubidChar* '"' | "'" (PubidChar -
"'")* "'"
[13]    PubidChar          ::=          #x20 | #xD | #xA | [a-zA-Z0-9] |
[-'()+,./:=?;!*#@$_%]

and I think it's fairly straightforward for us to add a rule to the
serialization spec that says it's an error if doctype-public doesn't conform to
this syntax.

The more difficult question is what to do about HTML. In principle we could
require that the doctype-public is one of the official FPIs appearing in the
HTML recommendation, for example "-//W3C//DTD HTML 4.01//EN". However, that
would almost certainly break a lot of existing stylesheets, since there's
almost certainly a lot of code getting away with undetected typos in such a
string. Arguably XSLT processors should tell people when they are generating
bad HTML, but I personally don't want to be the one in the firing line on this:
although we could have done it earlier, it's a bad candidate for an erratum.
Also, it's not future-proof: we don't know what FPIs will be allowed in future
versions of HTML. 

I think my preference would be that we impose the same rules for HTML as we do
for XML - that is, a simple restriction on the permitted character set.

Received on Thursday, 15 March 2007 18:20:28 UTC