Re: Default encoding for new validator

Christopher R. Maden <crism@maden.org> wrote:

>When unable to detect an encoding, the new validator should use the
>prescribed defaults, which I believe still means ISO8859-1 for text/html
>over HTTP, and UTF-8 or UTF-16 for XHTML documents uploaded directly.

The HTTP specification does indeed specify ISO-8859-1 as the default value
in the absense of a "charset" parameter in the Content-Type header. However
HTTP and HTML 4.01 are in direct conflict here as the latter proscribes any
assumption about a default character encoding. And since a file upload is
still a HTTP transaction, although we do not normally think of it that way,
the same applies for any file upload with a text/html media type.


>With the simple interface, validating <URL: http://crism.maden.org/ >
>reports that it is unable to detect the encoding, including using
>Appendix F of XML 1.0.  Using Appendix F is inappropriate for a document
>delivered over HTTP, since the HTTP headers take precedence (and thus it
>should be interpreted as ISO8859-1), but even so, using the Appendix F
>algorithm should result in a determination of UTF-8.  Either way, since
>this page is 7-bit ASCII, the validation ought to work.

The algorithm in Appendix F of the XML Recommendation describes ways to
attempt to automatically detect the character encoding in use in the
absence of information from a higher level protocol. Since the HTTP
transaction contained no encoding information, we attempted the Appendix F
algorithm. That algorithm however, is intended for XML; and as such it
requires either the presence of a UNICODE Byte Order Mark, or an XML
Declaration. In particular, if there is no BOM, we look for the bit
patterns that represent the characters "<?xml" in various encodings.


>The new service looks great, though.

Thanks. :-)

-- 
Interviewer: "In what language do you write your algorithms?"
    Abigail: English.
Interviewer: "What would you do if, say, Telnet didn't work?"
    Abigail: Look at the error message.

Received on Saturday, 26 October 2002 13:43:12 UTC