Re: Problems validating XML

Hi Martin,

This is a followup to your message, after some development in xml  
declaration detection, and a number of other fixes.

On May 29, 2007, at 17:19 , Martin Duerst wrote:
> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test.xml

The parse mode detection algorithm has been entirely reviewed.
The test document above now passes validation, when sent through all  
three input modes.

> Next, I tried with a DTD located relative to the xml file.
> The xml file is at http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/ 
> test-external-dtd.xml, the DTD is at
> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/Letter.dtd,
> and I reference it as <!DOCTYPE Letter SYSTEM "Letter.dtd">.

This is still not supported, still recorded as a bug in: http:// 
www.w3.org/Bugs/Public/show_bug.cgi?id=1521


> Next I tried with a file with some actual non-ASCII characters.
> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-UTF-8.xml.
> Of course, 'direct input' produces the same error as before.
> It works with URI checking, and with file upload, provided
> the encoding is set manually in the extended interface.
>
> However, the results on the beta validator are detrimental.

Now passes validation, and as I was mentioning in a previous mail,  
charset override has been fixed between 0.8.0b1 and b2.

> Next I'm trying with some non-ASCII, with an absolute external DTD:
> http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-UTF-8-absolute- 
> external-dtd.xml:
>
> With file upload, and setting charset to UTF-8, I get to a
> "tentatitely valid". Same for checking the URI.

Unsure why it was "tentatively". Probably because of charset override?
It passes nicely now, without needing override.

> But with direct input,
> I get
>     Unknown Parse Mode!
>
>     The MIME Media Type () for this document is used to serve both  
> SGML
>     and XML based documents, and it is not possible to disambiguate it
>     based on the DOCTYPE Declaration in your document. Parsing will
>     continue in SGML mode.
> Which MIME Media Type? With form submission, there is NO MIME Media
> type. And as I said earlier, trying to parse something that
> starts with
>     <?xml version="1.0" encoding='UTF-8'?>
> as generic SGML is a threat to interoperability on the Web.
>
> For the beta version, with direct input, I get the same result.
> At least the error message is slightly better, it now reads
>     Unable to Determine Parse Mode!
>
>     Neither the MIME Media Type () nor the document type for this  
> document
>     are sufficient to reliably choose a parsing mode. Falling back  
> to SGML
>     mode.
>
> Why is this slightly better? Because saying that an empty mime type  
> isn't
> sufficient to decide between SGML and XML is better than saying that
> an empty mime type is used to serve both SGML and XML.

Validation of your test document will no longer display this warning,  
since its xml declaration will be detected.

In case there is no XML declaration, however, and the validator  
*does* fall back to SGML mode for lack on an unambiguous parse mode,  
the warning text has been improved, taking into account:
* whether a mime type is known (or not, in the case of direct input)
* whether a document type was found but was not known, or no doctype  
was found.

Here is a first example of the new warning:

[Warning] Unable to Determine Parse Mode!

    It was not possible to reliably choose a parsing mode for this  
document, because:
  * the MIME Media Type (text/html) can be used for XML or SGML  
document types
  * the Document Type (http://www.w3.org/Style/HTML40-plus-blink.dtd)  
is not in the validator's catalog
  * No XML declaration (e.g <?xml version="1.0"?>) could be found at  
the beginning of the document.

The validator is falling back to SGML mode.

... and a second example

[Warning] Unable to Determine Parse Mode!

It was not possible to reliably choose a parsing mode for this  
document, because:

   * in Direct Input mode, no MIME Media Type is served to the validator
   * No known Document Type could be detected
   * No XML declaration (e.g <?xml version="1.0"?>) could be found at  
the beginning of the document.


I believe this solves (with the notable exception of relative SIs,  
which I leave to you if you are still interested in hacking on it)  
all the issues you were raising.

All the fixes are testable on the beta test instance, which is at the  
moment up to date with latest CVS state.
http://validator-test.w3.org/

Thanks.

-- 
olivier

Received on Thursday, 28 June 2007 11:01:25 UTC