Problems validating XML from Martin Duerst on 2007-05-29 (www-validator@w3.org from May 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Tue, 29 May 2007 17:19:57 +0900
To: www-validator@w3.org
Message-Id: <6.0.0.20.2.20070527142645.05147890@localhost>

Dear Validator Team,

This is in some ways a followup to the thread starting at
http://lists.w3.org/Archives/Public/www-validator/2006Oct/0005.html

Re-checking things, the situation isn't much better.

Here is what I did:

I used the data/file that you can find at
http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test.xml

With 'direct input' at validator.w3.org, I get
"This page is not Valid (no Doctype found)!". Oh well,
there was no doctype? I guess the validator is blind, or what?
The beta-test version has exactly the same problem.
And if I tell it to use some preset doctype only if the
doctype is missing, it still tells me that the doctype
is missing, so it doesn't look like the "use Doctype"
setting in the Options is any good.

With file upload, I get a "missing charset for text/xml", but
otherwise, the page is suddenly valid.

Next, I tried with a DTD located relative to the xml file.
The xml file is at http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-external-dtd.xml, the DTD is at
http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/Letter.dtd,
and I reference it as <!DOCTYPE Letter SYSTEM "Letter.dtd">.
This is an extremely simple (relative) URI Reference, and
the XML spec explicitly references RFC 3986, so it seems
to be a very clear bug. What the validator says is:
Sorry! This document can not be checked.
Fatal Error: cannot find "Letter.dtd"; tried

I could not parse this document, because it makes reference to a
system-specific file instead of using a well-known public identifier
to specify the type of markup being used.

I'm not sure how the validator 'tried', but sure not according
to RFC 3986. The beta version is no better. What it should
say (if indeed said file is missing) is:
http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/Letter.dtd
not found. All the stuff about system-speficic files vs.
well-known public identifiers should be cut. There are
potentially millions of dtds around the world that can
be reached with SYSTEM, whereas the number of "well-known"
public identifiers is one or two dozen, just what the
validator maintainers decide to make available.

When changing the relative URI to an absolute one, things work; please see
http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-absolute-external-dtd.xml

Next I tried with a file with some actual non-ASCII characters.
http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-UTF-8.xml.
Of course, 'direct input' produces the same error as before.
It works with URI checking, and with file upload, provided
the encoding is set manually in the extended interface.

However, the results on the beta validator are detrimental. I get:
Sorry! This document can not be checked.

Sorry, I am unable to validate this document because on line 0 it
contained one or more bytes that I cannot interpret as us-ascii
(in other words, the bytes found are not valid values in the specified
Character Encoding). Please check both the content of the file and the
character encoding indication.

This happens with both URI and File Upload, even with utf-8 selected
in the options. This is a very serious bug, please fix it. With
direct Input (side question: why's direct not capitalized as Direct?
this capitalization would only be correct in German), I'm back to
the validator claiming to be blind and not seeing the doctype.
"line 0" is of course also a mistake, every editor counts lines
from 1, not from 0. Please don't expose implementation details
such as that your programming language uses an index origin of 0
(which is perfectly the right thing to do for a programming langugage)
to the end user.

Next I'm trying with some non-ASCII, with an absolute external DTD:
http://www.sw.it.aoyama.ac.jp/2007/PB1/examples/test-UTF-8-absolute-external-dtd.xml:

With file upload, and setting charset to UTF-8, I get to a
"tentatitely valid". Same for checking the URI. But with direct input,
I get
Unknown Parse Mode!

The MIME Media Type () for this document is used to serve both SGML
and XML based documents, and it is not possible to disambiguate it
based on the DOCTYPE Declaration in your document. Parsing will
continue in SGML mode.
Which MIME Media Type? With form submission, there is NO MIME Media
type. And as I said earlier, trying to parse something that
starts with
<?xml version="1.0" encoding='UTF-8'?>
as generic SGML is a threat to interoperability on the Web.

For the beta version, with direct input, I get the same result.
At least the error message is slightly better, it now reads
Unable to Determine Parse Mode!

Neither the MIME Media Type () nor the document type for this document
are sufficient to reliably choose a parsing mode. Falling back to SGML
mode.

Why is this slightly better? Because saying that an empty mime type isn't
sufficient to decide between SGML and XML is better than saying that
an empty mime type is used to serve both SGML and XML.

For the beta version with file upload or URI input, the "line 0" error
raises its ugly head again.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

Received on Tuesday, 29 May 2007 08:21:16 UTC