- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 13 Oct 2011 10:31:15 +0300
- To: Andy Schmidt <Andy_Schmidt@HM-Software.com>
- CC: www-validator@w3.org
13.10.2011 08:51, Andy Schmidt wrote: > I just reproduced the error for you in a test location: > > http://validator.w3.org/check?uri=http%3A%2F%2Fwww.mediacenter.studioline.net%2Ftest%2Fbarchetta_sitemap181.xml&charset=%28detect+automatically%29&doctype=Inline&group=0 At the HTTP level, the document http://www.mediacenter.studioline.net/test/barchetta_sitemap181.xml is served as text/xml without a charset parameter. This is against a strong recommendation in RFC 2023 (XML Media Types): Although listed as an optional parameter, the use of the charset parameter is STRONGLY RECOMMENDED, since this information can be used by XML processors to determine authoritatively the character encoding of the XML MIME entity. > I have since found that the problem can be circumvented, by explicitly > forcing the validator to UTF-8 encoding. A better way is to make the server specify the media type (MIME type) as text/xml;charset=utf-8 and not just text/xml (which sets the default to charset=ascii, by the protocol). > However, not everyone might think of that work-around as it is not at > all apparent from the cryptic source code error message that the site is > issuing. That’s why I think it might be a condition that the validator > should handle more gracefully. Error reporting is admittedly bad in this case, due to some bug. I tried to isolate the bug by constructing a page, served as text/xml, internally declared to be utf-8, containing non-Ascii characters. I expected the bug to relate to a failing switch from ascii to utf-8 as the encoding. But that doesn't quite work. I suspect the bug has to do with the character encoding problem. It seems that although _browsers_ display the XML file using UTF-8 interpretation of the data, the _validator_ applies ASCII. And while it can then report the issue in simple cases, it seems to fail with a page containing very long lines. I made a copy of the document and inserted a line break before each <url> tag, and now http://www.cs.tut.fi/~jkorpela/test/sitemap.txml does not trigger the bug any more. The diagnostic is still misleading, but that's at a different level of issues: "Sorry, I am unable to validate this document because on line 427 it contained one or more bytes that I cannot interpret as us-ascii (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. The error was: ascii "\xC3" does not map to Unicode" It's fine*) except for the last part. The notation "\xC3" is obscure, but it apparently means octet C3 (hexadecimal). It is not ASCII at all, and why would mapping to Unicode matter when processing data as ASCII? But I'm afraid this bogus part of the reporting comes from the depths of some software component that cannot be changed. *) Well, mostly fine. I would not write "I cannot interpret as us-ascii", when this is not about the ability of some software calling herself "I" - but of objective impossibility caused, in this case, by the simple definition of ASCII. -- Yucca, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 13 October 2011 07:31:59 UTC