Re: Validator tests "charset" parameter of server or browser, not only the "charset" parameter of the XML

Dear Oliver, Dear Lachlan

Many thanks for your detailed answers. I've read them carefully.

Well, the validity of markup depends on character set information 
provided in the external transport protocol, like HTTP here. I didn't 
know this before.

 From this point of view, my objective remains only an objective to the 
usability of your service as sketched below.

Let's assume I upload the following file via web browser to your 
service. It's an XHTML file with a special character ('ü') in its body.

EXAMPLE1.HTML
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
 <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />
 <title>Example1</title>
</head>
<body><p>ü</p></body>
</html>

If the browser doesn't transfer a HTTP character set parameter, the test 
result of your validation service delivers
   "the bytes found are not valid values in the specified Character 
Encoding"

That's completely right but actually only a part of the information I 
would like to have. Users of your service may often be interested in the 
validity of the markup in case no character set is provided in any 
external transport protocol because their applications read files 
directly from the filesystem. Due to the nature of your service (a WEB 
service), HTTP is involved in any case. Therefore, for the 
abovementioned users your service is only of limited use.

As a user of the service, I would like to be informed about the validity 
of uploaded documents under BOTH assumptions: the character set 
information of HTTP is available and it is not available.

Regards
Rodrigo


Olivier Thereaux wrote:
 > On Thu, Jun 23, 2005, Rodrigo Witzel wrote:
 >
 >>  "Note:The HTTP Content-Type header sent by your web browser (unknown)
 >>did not contain a "charset" parameter, but the Content-Type was one of
 >>the XML text/* sub-types (text/xml). The relevant specification (RFC
 >>3023) specifies a strong default of "us-ascii" for such documents so we
 >>will use this value regardless of any encoding you may have indicated
 >>elsewhere. ..."
 >>
 >>This irritation may be caused by the misleading title of your website.
 >>It's "Markup Validation Service".
 >
 >
 > I am not sure I understand your concern here (having a sample URI would
 > help figure out exactly the issue).
 >
 >  Yes, it is a markup validation service, and markup specifications
 > have rules related to the context in which the markup is served (HTTP,
 > content-type ak media type, etc).
 >
 > For example:
 > * if your content is served with no charset declaration (in either HTTP
 > header or information on the markup), then the charset used to parse
 > your document is the default one for the content-type used
 > * if your content is served at the HTTP level with charset A, and the
 > document itself declares character B, then agents (including the
 > validator) are supposed to use A, because HTTP has precedence.
 > * etc.
 >
 > The rules quickly describe above make it so that if your server does
 > something wrong in serving content that would be otherwise valid, it is
 > nevertheless wrong.
 >

Lachlan Hunt wrote:

> Rodrigo Witzel wrote:
> 
>> ... the Content-Type was one of
>> the XML text/* sub-types (text/xml). The relevant specification (RFC 
>> 3023) specifies a strong default of "us-ascii" for such documents so 
>> we will use this value regardless of any encoding you may have 
>> indicated elsewhere. ..."
>>
>> As a matter of fact, your website tests BOTH the markup and the 
>> behaviour of my web server. Or even worse, it refuses to test my 
>> markup if my server fails the test. If my XML is valid, the test 
>> should be passed even though my server doesn't fulfil any other 
>> requirements.
> 
> 
> How can the validator possibly validate your document if it does not 
> know which character encoding to use to read the file?  If it's not 
> correctly specified, it must default to something, which may result in 
> errors being reported that would not be present had the validator known 
> the correct encoding.
> 
> Say, for example, your document was encoded as UTF-8 and contained 
> characters outside of the US-ASCII subset; yet because your server 
> declared the content-type as text/xml but did not indicate the encoding 
> with a charset parameter, the validator *must* follow the rules 
> specified in RFC 3023 and  parse the file as though it were encoded in 
> US-ASCII.  However, because your document contained characters outside 
> of the US-ASCII subset, the validator would issue a well-formedness 
> error and your document would not validate, even though it would 
> validate if it were parsed as UTF-8.
> 
> The moral of the story is to either specify the encoding with a charset 
> parameter, if you are going to continue using text/xml; but note that 
> for this reason, it is not recommended that you use text/* media types 
> for XML documents.
> 
> The alternative is to use application/xml, application/xhtml+xml or 
> other appropriate application/*+xml media type.  The validator will then 
> obey the encoding declared in the XML declaration, if present, or 
> default to UTF-8 or UTF-16, as decribed in the XML Recommendation based 
> the presence (or absense) of the Byte Order Mark.
> 

Received on Friday, 24 June 2005 15:38:04 UTC