Re: Problems validating XML

Dear Martin,

Thank you for your message. Before getting into the details, now is a  
good time to remind everyone that the feedback instructions strongly  
suggest, before posting a bug report to
* look at bugzilla
* check the list's archives

Many of the issues you raise are already in bugzilla, or have been  
discussed in the past few days and fixed in the dev version.

On May 29, 2007, at 17:19 , Martin Duerst wrote:

> I used the data/file that you can find at
> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test.xml
>
> With 'direct input' at validator.w3.org, I get
> "This page is not Valid (no Doctype found)!".

Your document uses a custom document type, not in the validator's  
catalogue.
And without a media type to help (because you are using the direct  
input mode) there is no unambiguous way to determine whether to use  
XML or SGML parsing modes. The errors you get are, I believe,  
cascading from the fallback to SGML mode, when your DTD elements are  
XML.

This is a known and documented issue:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1391

It has been argued that an XML declaration should be a good enough  
trigger, but others (Hixie among others, I believe) have disagreed,  
as it also happens to be a valid SGML PI.

Generally speaking, the validator isn't the most adapted tool for  
checking XML documents with home-made DTDs, particularly with the  
Direct Input method. We'd like to make it better in this regard, but  
that is not a priority. If you want to submit patches to make it  
better in this regard, without being detrimental to its main job, I  
believe you're familiar with the code, and you even have CVS commit  
access...

> Oh well, there was no doctype? I guess the validator is blind, or  
> what?

That tone is inappropriate. An aggressive or sarcastic tone isn't  
much welcome on this public list (or you'd better be coming with  
perfect patches to compensate).

> And if I tell it to use some preset doctype only if the
> doctype is missing, it still tells me that the doctype
> is missing, so it doesn't look like the "use Doctype"
> setting in the Options is any good.

This has been fixed in the dev version, soon to be beta2.
http://qa-dev.w3.org/wmvs/HEAD/

> With file upload, I get a "missing charset for text/xml", but
> otherwise, the page is suddenly valid.

Since file upload provides media type, the parsing mode issue doesn't  
happen.


> Next, I tried with a DTD located relative to the xml file.

We don't do relative SIs. Yet.
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1521

> Next I tried with a file with some actual non-ASCII characters.
> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-UTF-8.xml.
[...]
> However, the results on the beta validator are detrimental. I get:
>    Sorry! This document can not be checked.
>
>    Sorry, I am unable to validate this document because on line 0 it
>    contained one or more bytes that I cannot interpret as us-ascii
>    (in other words, the bytes found are not valid values in the  
> specified
>    Character Encoding). Please check both the content of the file  
> and the
>    character encoding indication.
>
> This happens with both URI and File Upload,

I can't reproduce this. Did you perhaps change the encoding  
declaration in the document to state UTF-8 instead of us-ascii?


> even with utf-8 selected
> in the options. This is a very serious bug, please fix it.

The charset override was broken in the 0.8.0 beta1. It is now fixed.


> Next I'm trying with some non-ASCII, with an absolute external DTD:
> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-UTF-8-absolute- 
> external-dtd.xml:
>
> For the beta version, with direct input, I get the same result.
> At least the error message is slightly better, it now reads
>     Unable to Determine Parse Mode!
>
>     Neither the MIME Media Type () nor the document type for this  
> document
>     are sufficient to reliably choose a parsing mode. Falling back  
> to SGML
>     mode.
>
> Why is this slightly better? Because saying that an empty mime type  
> isn't
> sufficient to decide between SGML and XML is better than saying that
> an empty mime type is used to serve both SGML and XML.

Good.

> For the beta version with file upload or URI input, the "line 0" error
> raises its ugly head again.

This has been fixed last week I believe.

Thank you,
-- 
olivier

Received on Wednesday, 30 May 2007 04:41:55 UTC