validating XML made more difficult than necessary from Martin Duerst on 2006-10-03 (www-validator@w3.org from October 2006)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Tue, 03 Oct 2006 10:46:59 +0900
To: www-validator@w3.org
Message-Id: <6.0.0.20.2.20061003100411.060f9e50@localhost>

Dear Validator Team,

These are some recent experiences with the WWW Markup validator,
and some suggestions on how to improve it.

It is great that the validator can now also be used for validating
arbitrary XML files, but this validation experience is made unneccessarily
difficult.

The file I'm trying to validate is at
http://www.sw.it.aoyama.ac.jp/2006/PB2/examples/book/book.xml,
but I'm mostly talking about validating this same document
from a file on my computer.

First, with file upload, I get a very short indication of what's
wrong, and no chance to fix (read overwrite) it. The error message
is as follows:
Sorry, I am unable to validate this document because on line 10-14, 17,
19, 23-33, 35-41, 44-51, 54-57 it contained one or more bytes that I
cannot interpret as us-ascii (in other words, the bytes found are not
valid values in the specified Character Encoding). Please check both
the content of the file and the character encoding indication.
Tracing this with ethereal, it is clear that this behavior is essentially
correct because Opera uploads this file with a mime type of text/xml.
But why are overrides available on validating an URI, such as at
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.sw.it.aoyama.ac.jp%2F2006%2FPB2%2Fexamples%2Fbook%2Fbook.xml
(which has exactly the same problem, namely that our server sends out
the document as text/xml, which I'll fix as soon as I gave you a chance
to compare things), while no overrides are provided for file upload?
With current browsers, mime types and charsets sent for uploaded files
are at least as uncontrollable by the user as they are for servers.
Adding the overrides should be very easy, please do so.

The second problem happens when I use direct validation. What I get is
the following error message:
The MIME Media Type () for this document is used to serve both SGML
and XML based documents, and it is not possible to disambiguate it
based on the DOCTYPE Declaration in your document. Parsing will
continue in SGML mode.

This page is not Valid
http://www.sw.it.aoyama.ac.jp/2006/PB2/examples/book/book.dtd!

Below are the results of attempting to parse this document with
an SGML parser.

[followed by no such results at all]

I get the same results from the extended interface.

There are a number of problems with this behavior, all of which can
be fixed easily, and except for the first and the last one, any single
fix would fix the basic problem:

- Don't talk about mime types (there was none in the ethereal trace;
multipart/form-data doesn't use them for individual form fields),
explain the problem in a way the user can understand and address.
- A document starting with "<?xml" can easily be guessed to be XML
rather than SGML.
- In this day and age of XML, making SGML the default seems terribly
outdated, even more so because XML is W3C's own technology.
- As you know you may not be able to know whether it's XML or SGML,
provide a switch for the user to tell you.
- If you validate as SGML, please make sure you do so and produce
an actual error message, even if it's just something like
"<?xml": Document can't start with PI before DOCTYPE
or some such (not sure that's the right error message, though).

Many thanks in advance for your help.

Regards, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

Received on Tuesday, 3 October 2006 01:47:34 UTC