Re: Problems validating XML from Martin Duerst on 2007-05-30 (www-validator@w3.org from May 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 30 May 2007 18:22:19 +0900
To: olivier Thereaux <ot@w3.org>
Cc: www-validator@w3.org
Message-Id: <6.0.0.20.2.20070530175746.0445bc20@localhost>
Hello Olivier,

At 13:41 07/05/30, olivier Thereaux wrote:

>Many of the issues you raise are already in bugzilla, or have been  
>discussed in the past few days and fixed in the dev version.

Great!

>On May 29, 2007, at 17:19 , Martin Duerst wrote:
>
>> I used the data/file that you can find at
>> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test.xml
>>
>> With 'direct input' at validator.w3.org, I get
>> "This page is not Valid (no Doctype found)!".
>
>Your document uses a custom document type, not in the validator's  
>catalogue.
>And without a media type to help (because you are using the direct  
>input mode) there is no unambiguous way to determine whether to use  
>XML or SGML parsing modes. The errors you get are, I believe,  
>cascading from the fallback to SGML mode, when your DTD elements are  
>XML.

Okay.

>This is a known and documented issue:
>http://www.w3.org/Bugs/Public/show_bug.cgi?id=1391
>
>It has been argued that an XML declaration should be a good enough  
>trigger, but others (Hixie among others, I believe) have disagreed,  
>as it also happens to be a valid SGML PI.

Well, yes, it happens to be a valid SGML PI, of course, because
XML is designed to work with SGML tools, with a particular SGML
declaration. 

>Generally speaking, the validator isn't the most adapted tool for  
>checking XML documents with home-made DTDs, particularly with the  
>Direct Input method. We'd like to make it better in this regard, but  
>that is not a priority. If you want to submit patches to make it  
>better in this regard, without being detrimental to its main job,

I can definitely submit a patch that goes into XML mode if an
XML declaration is present. I don't consider this as being
detrimental to the validator's job, quite to the contrary.
If that's not what you mean, please tell me.

>I believe you're familiar with the code,

Well, that was quite some time ago, and a lot of work has
gone into the validator since, but to some extent, yes.

>and you even have CVS commit access...

I didn't know that, but I'll try to make use of it.
The main problem will not be the validator code, but CVS;
getting from Subversion back to CVS is a pain.

>> Oh well, there was no doctype? I guess the validator is blind, or  
>> what?
>
>That tone is inappropriate. An aggressive or sarcastic tone isn't  
>much welcome on this public list (or you'd better be coming with  
>perfect patches to compensate).

I totally agree if such a tone was targetted at a person.
Even in the above case, I was probably a bit too direct, and
I appologize. But I guess that's just about how the average
validator user would react.

>> And if I tell it to use some preset doctype only if the
>> doctype is missing, it still tells me that the doctype
>> is missing, so it doesn't look like the "use Doctype"
>> setting in the Options is any good.
>
>This has been fixed in the dev version, soon to be beta2.
>http://qa-dev.w3.org/wmvs/HEAD/

Great to know, thanks.

>> Next, I tried with a DTD located relative to the xml file.
>
>We don't do relative SIs. Yet.
>http://www.w3.org/Bugs/Public/show_bug.cgi?id=1521

If that can be handled in the validator code, I'll try to
submit a patch. But it might take a while.

>> Next I tried with a file with some actual non-ASCII characters.
>> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-UTF-8.xml.
>[...]
>> However, the results on the beta validator are detrimental. I get:
>>    Sorry! This document can not be checked.
>>
>>    Sorry, I am unable to validate this document because on line 0 it
>>    contained one or more bytes that I cannot interpret as us-ascii
>>    (in other words, the bytes found are not valid values in the  
>> specified
>>    Character Encoding). Please check both the content of the file  
>> and the
>>    character encoding indication.
>>
>> This happens with both URI and File Upload,
>
>I can't reproduce this. Did you perhaps change the encoding  
>declaration in the document to state UTF-8 instead of us-ascii?

The document didn't change, even the things reported higher up
always had encoding='UTF-8' in the XML declaration. The only
thing that I changed was that when I drafted the mail on Sunday,
the document was served as text/xml, and I used the Charset
override to make sure it was processed as UTF-8.

I realized that serving documents, most of which are real UTF-8,
as text/xml is a server setup problem, so now the document is
served as text/xml; charset=utf-8. I wouldn't expect the beta
validator give different results (except for the 'tentatively'
bit) for charset override and charset from Mime type, but I don't
know the code enough to be sure to exclude this possibility.
Also, in any case, the document never contained any non-ASCII
stuff on line 1 (the only thing there is the XML declaration).

>> even with utf-8 selected
>> in the options. This is a very serious bug, please fix it.
>
>The charset override was broken in the 0.8.0 beta1. It is now fixed.

This would probably explain things, see above.
Is there a plan to release a beta2?


>> For the beta version with file upload or URI input, the "line 0" error
>> raises its ugly head again.
>
>This has been fixed last week I believe.

Great!

Thanks for all your great work,     Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 30 May 2007 09:27:49 UTC