Re: Validator tests "charset" parameter of server or browser, not only the "charset" parameter of the XML

Rodrigo Witzel wrote:
> ... the Content-Type was one of
> the XML text/* sub-types (text/xml). The relevant specification (RFC 
> 3023) specifies a strong default of "us-ascii" for such documents so we 
> will use this value regardless of any encoding you may have indicated 
> elsewhere. ..."
> As a matter of fact, your website tests BOTH the markup and the 
> behaviour of my web server. Or even worse, it refuses to test my markup 
> if my server fails the test. If my XML is valid, the test should be 
> passed even though my server doesn't fulfil any other requirements.

How can the validator possibly validate your document if it does not 
know which character encoding to use to read the file?  If it's not 
correctly specified, it must default to something, which may result in 
errors being reported that would not be present had the validator known 
the correct encoding.

Say, for example, your document was encoded as UTF-8 and contained 
characters outside of the US-ASCII subset; yet because your server 
declared the content-type as text/xml but did not indicate the encoding 
with a charset parameter, the validator *must* follow the rules 
specified in RFC 3023 and  parse the file as though it were encoded in 
US-ASCII.  However, because your document contained characters outside 
of the US-ASCII subset, the validator would issue a well-formedness 
error and your document would not validate, even though it would 
validate if it were parsed as UTF-8.

The moral of the story is to either specify the encoding with a charset 
parameter, if you are going to continue using text/xml; but note that 
for this reason, it is not recommended that you use text/* media types 
for XML documents.

The alternative is to use application/xml, application/xhtml+xml or 
other appropriate application/*+xml media type.  The validator will then 
obey the encoding declared in the XML declaration, if present, or 
default to UTF-8 or UTF-16, as decribed in the XML Recommendation based 
the presence (or absense) of the Byte Order Mark.

Lachlan Hunt

Received on Friday, 24 June 2005 11:08:24 UTC