Re: Encoding Requested Even When XML Decl. Is Ommited

"Sean B. Palmer" <sean@mysterylights.com> wrote:

> The W3C Validator begs for a character encoding even when the page in
> question is being validated as XML, and the XML declaration is
> missing. According to the XML specification, if a declaration is
> missing, then the encoding is either UTF-8 (and possibly its subset,
> US-ASCII) or UTF-16.

That depends on the media type used.  The above rule in XML 1.0 only
applies "in the absence of information provided by an external transport
protocol (e.g. HTTP or MIME)".  According to RFC 3023, if an entity
is received with the charset parameter omitted, the default charset
value is "us-ascii" in the case of "text/xml", and there's no default
value in the case of "application/xml" (thus the default rule in XML 1.0
applies).  In both cases, the charset parameter in the HTTP Content-Type
response header takes precedence.

> I know that there is much room for debate in this area (given section
> 6 in RFC 2854)... but it seems to me that the validator should be able
> to gague the character encoding of an XHTML document without an XML
> declaration.

If you serve an XHTML document as "text/html", then I strongly recommend
to never rely on the default rule (which is extremely messy) and always
provide an explicit charset information.

As a side note, it seems erroneous for the validator to NOT report
well-formedness error when a UTF-8 document that does include
characters above Basic Latin range is served as "text/xml" without
an explicit charset parameter.

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Sunday, 3 March 2002 21:30:27 UTC