W3C home > Mailing lists > Public > www-validator@w3.org > October 2006

Re: validating XML made more difficult than necessary

From: olivier Thereaux <ot@w3.org>
Date: Thu, 5 Oct 2006 14:54:20 +0900
Message-Id: <F72F8E8B-59E8-4441-B512-BAA9ED66A80F@w3.org>
Cc: www-validator@w3.org
To: Martin Duerst <duerst@it.aoyama.ac.jp>

Hello Martin,

On Oct 3, 2006, at 10:46 , Martin Duerst wrote:
>
> It is great that the validator can now also be used for validating
> arbitrary XML files, but this validation experience is made  
> unneccessarily
> difficult.

Arguably that's beyond the validator's capabilities because the  
validator's parser is still rather clumsy when it comes to XML, but  
indeed, the validator can treat any kind of XML document with a  
reference to an XML DTD.


> But why are overrides available on validating an URI, such as at
> http://validator.w3.org/check?uri=http%3A%2F% 
> 2Fwww.sw.it.aoyama.ac.jp%2F2006%2FPB2%2Fexamples%2Fbook%2Fbook.xml
> (which has exactly the same problem, namely that our server sends out
> the document as text/xml, which I'll fix as soon as I gave you a  
> chance
> to compare things), while no overrides are provided for file upload?
> With current browsers, mime types and charsets sent for uploaded files
> are at least as uncontrollable by the user as they are for servers.
> Adding the overrides should be very easy, please do so.

Unfortunately, that's not as simple. For a URI it's easy for the  
validator to propose a form pre-filled with the URI value.
When using file upload, there is no way for the validator to display  
a form with the file to upload already specified.
I suppose what we could do is to provide the user with a new form to  
"upload another file" or "upload the file again".


>
>
> The second problem happens when I use direct validation. What I get is
> the following error message:
>     The MIME Media Type () for this document is used to serve both  
> SGML
>     and XML based documents, and it is not possible to disambiguate it
>     based on the DOCTYPE Declaration in your document. Parsing will
>     continue in SGML mode.

The problem is multiple, and in at the center of our radar:
* the validator gives too much importance to doctype detection,  
rather than media type, to decide whether to use XML mode or not.
   This is something that we plan to fix, see e.g
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500
http://www.w3.org/Bugs/Public/show_bug.cgi?id=24


* As a result, for document types which are not known in its catalog,  
it chooses SGML mode by default. That's a problem
http://www.w3.org/Bugs/Public/show_bug.cgi?id=22

* If we change the validator's behavior to give more importance to  
media type, we need to do something with the direct input, which is  
not HTTP or media type aware.
   (e.g add a drop-down to choose among possible media types: SVG,  
XML, XHTML, HTML, etc)
   The Unicorn tool does that, to some extent.

> - Don't talk about mime types (there was none in the ethereal trace;
>   multipart/form-data doesn't use them for individual form fields),
>   explain the problem in a way the user can understand and address.

I prefer to make the problem go away by giving the user some input to  
specify what kind of "media type" (won't be using the word mime, I  
think) in the direct input interface. I'm aware this may break some  
tools relying on direct input, however. There needs to be a good  
default.

> - A document starting with "<?xml" can easily be guessed to be XML  
> rather than SGML.

I guess, although I'm sure the usual suspects on this list will  
happily prove you wrong with some fun corner case.

> - In this day and age of XML, making SGML the default seems terribly
>   outdated, even more so because XML is W3C's own technology.

Yes, and no. The current web, and thus the market for this validator,  
is still mostly using HTML <= 4.01, as far as I can tell. The current  
web hardly uses anything but text/html, and the HTML working group so  
far has been saying "don't treat it as XML". So unfortunately making  
XML the default does not seem to be in sync with the state of what  
we're dealing with.

> - As you know you may not be able to know whether it's XML or SGML,
>   provide a switch for the user to tell you.

Agreed.

-- 
olivier
Received on Thursday, 5 October 2006 05:54:32 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:23 GMT