- From: Rick JELLIFFE <ricko@geotempo.com>
- Date: Thu, 25 May 2000 04:04:41 +0800
- To: xml-editor@w3.org, "xml-dev@xml.org" <xml-dev@xml.org>
John Cowan wrote: > Issue PE28: > > Currently the XML Recommendation is silent about the handling of > documents that contain "impossible" bytes. For example, the byte 0xFF > cannot appear in any UTF-8 encoded document. We are considering making > such violations of the encoding a fatal error. > > PRO: an improperly encoded document is not really a text document at all; > nothing should be done on the basis of it. XML's draconian error handling rule > should lead to a "fatal error", which means the rest of the document must > not be parsed. > > CON: Some parsers may be relying on libraries supplied by the OS, which may > not properly signal erroneous input. Is it too great a burden on the > parser implementor to impose this restriction? I think this goes too far, for basic WF. Instead, I would propose another level of validity "character validity" which XML processors should be encouraged, but not required, to support, or to support as much as they can. Unlike validity, which sits on top of well-formedness, "character validity" sits more-or-less underneath well-formedness as XML's soft underbelly. An XML document that was "character valid" would 1) not have any impossible bytes in any entity 2) not have a BOM if the encoding="utf16le" or "utf16be" (and any other encoding constraints) 3) all names in markup must follow the NAMECHAR conventions. 4) all data Unicode-normalized This would keep a basic XML implementation that did not support "character validity" simple: 1) it can use any library for transcoding 2) it does not have to have any special BOM handling for utf16xe 3) it can tokenize tags based on whitespace and delimiters rather than NAMECHAR or NAMESTRT 4) normalization not checked/enforced A character-validating processor should be the goal for any XML processor not specifically aimed at ultra-lightweight uses. Rick Jelliffe
Received on Wednesday, 24 May 2000 15:56:25 UTC