Re: Software error: Can't use an undefined value as an ARRAY reference at /usr/local/validator/httpd/cgi-bin/check line 2713. from Jukka K. Korpela on 2011-10-13 (www-validator@w3.org from October 2011)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 13 Oct 2011 10:31:15 +0300
To: Andy Schmidt <Andy_Schmidt@HM-Software.com>
CC: www-validator@w3.org
Message-ID: <4E9693C3.4040502@cs.tut.fi>

13.10.2011 08:51, Andy Schmidt wrote:

> I just reproduced the error for you in a test location:
>
> http://validator.w3.org/check?uri=http%3A%2F%2Fwww.mediacenter.studioline.net%2Ftest%2Fbarchetta_sitemap181.xml&charset=%28detect+automatically%29&doctype=Inline&group=0

At the HTTP level, the document
http://www.mediacenter.studioline.net/test/barchetta_sitemap181.xml
is served as text/xml without a charset parameter. This is against a 
strong recommendation in RFC 2023 (XML Media Types):

       Although listed as an optional parameter, the use of the charset
       parameter is STRONGLY RECOMMENDED, since this information can be
       used by XML processors to determine authoritatively the character
       encoding of the XML MIME entity.

> I have since found that the problem can be circumvented, by explicitly
> forcing the validator to UTF-8 encoding.

A better way is to make the server specify the media type (MIME type) as 
text/xml;charset=utf-8 and not just text/xml (which sets the default to 
charset=ascii, by the protocol).

> However, not everyone might think of that work-around as it is not at
> all apparent from the cryptic source code error message that the site is
> issuing. That’s why I think it might be a condition that the validator
> should handle more gracefully.

Error reporting is admittedly bad in this case, due to some bug.
I tried to isolate the bug by constructing a page, served as text/xml,
internally declared to be utf-8, containing non-Ascii characters.
I expected the bug to relate to a failing switch from ascii to utf-8
as the encoding. But that doesn't quite work.

I suspect the bug has to do with the character encoding problem. It 
seems that although _browsers_ display the XML file using UTF-8 
interpretation of the data, the _validator_ applies ASCII. And while it 
can then report the issue in simple cases, it seems to fail with a page 
containing very long lines.

I made a copy of the document and inserted a line break before each 
<url> tag, and now
http://www.cs.tut.fi/~jkorpela/test/sitemap.txml
does not trigger the bug any more.

The diagnostic is still misleading, but that's at a different level of 
issues:

"Sorry, I am unable to validate this document because on line 427 it 
contained one or more bytes that I cannot interpret as us-ascii (in 
other words, the bytes found are not valid values in the specified 
Character Encoding). Please check both the content of the file and the 
character encoding indication.

The error was: ascii "\xC3" does not map to Unicode"

It's fine*) except for the last part. The notation "\xC3" is obscure, 
but it apparently means octet C3 (hexadecimal). It is not ASCII at all, 
and why would mapping to Unicode matter when processing data as ASCII? 
But I'm afraid this bogus part of the reporting comes from the depths of 
some software component that cannot be changed.

*) Well, mostly fine. I would not write "I cannot interpret as 
us-ascii", when this is not about the ability of some software calling 
herself "I" - but of objective impossibility caused, in this case, by 
the simple definition of ASCII.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/

Received on Thursday, 13 October 2011 07:31:59 UTC