Re: 8-bit chars in US-ASCII documents (was Re: Embarrassing typo!)

* Liam Quinn wrote:
>The WDG HTML Validator labels US-ASCII documents as ISO-8859-1 when
>passing off to lq-nsgmls, and so it considers that example document valid.
>And it is valid:
>
>  "An XML document is valid if it has an associated document type
>   declaration and if the document complies with the constraints expressed
>   in it." [1]

>The 8-bit character is an error, but it's an error in a similar way to
>including <a href="foo bar"> in an HTML document.

I don't agree here. XML 1.0 reads:

  "It is a fatal error [2] if an XML entity is determined (via default,
  encoding declaration, or higher-level protocol) to be in a certain
  encoding but contains octet sequences that are not legal in that
  encoding. It is also a fatal error if an XML entity contains no
  encoding declaration and its content is not legal UTF-8 or UTF-16."[1]

I'd say, documents that have fatal errors can neither be well-formed nor
valid, but that's not in the spec, instead it states 

  "An error which a conforming XML processor must detect and report to
  the application. After encountering a fatal error, the processor may
  continue processing the data to search for further errors and may
  report such errors to the application. In order to support correction
  of errors, the processor may make unprocessed data from the document
  (with intermingled character data and markup) available to the
  application. Once a fatal error is detected, however, the processor
  must not continue normal processing (i.e., it must not continue to
  pass character data and information about the document's logical
  structure to the application in the normal way)." [2]

Anyway, un-decodeable documents (and documents with illegal octet
sequences are un-decodeable) cannot be parsed properly, so they cannot
be checked for validity or well-formedness. A validator must report such
a fatal error and optionally refuse further processing, IMO.

Btw. this is, as I'm sure you know, worse for HTML documents. XML
documents can be encoded in UTF-8 or UTF-16 without declaring it,
HTML can't, you must always declare the used encoding, since the user
agent must not assume any default character encoding.

[1] http://www.w3.org/TR/REC-xml#NT-EncodingDecl
[2] http://www.w3.org/TR/REC-xml#dt-fatal
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Received on Sunday, 22 April 2001 18:42:16 UTC