validator's parse mode algorithm (Was: XHTML Family Document Types and the validator) from olivier Thereaux on 2007-04-30 (www-validator@w3.org from April 2007)

From: olivier Thereaux <ot@w3.org>
Date: Mon, 30 Apr 2007 11:53:03 -0400
To: Karl Dubost <karl@w3.org>
Cc: Shane McCarron <shane@aptest.com>, www-validator@w3.org
Message-Id: <70D64B13-655F-44FC-8579-1CDCDEE6B20F@w3.org>

On Apr 29, 2007, at 19:05 , Karl Dubost wrote:
> Or if olivier gives me the steps that the validator follows now, I  
> could sketch up a diagram and we may have a better picture of how  
> it could work and if it should be modified or not.

1) First the validator takes the internet media type (mime type,  
content-type)
(either given by the server, or by the browser in upload mode. In  
direct input mode, this step is skipped...)
and compares it to its table (in the validator's config, look for  
<MIME> in
http://dev.w3.org/cvsweb/~checkout~/validator/htdocs/config/ 
validator.conf )

This gives us the parse mode based on content type, which is either  
"XML" (for XML media types) or "TBD" (for text/html, because html and  
xhtml - two different parsing modes - can both be served with this  
media type.

2) Then the validator pre-parses the document to fetch its doctype  
(if any...) and compares it to a second table (see http://dev.w3.org/ 
cvsweb/~checkout~/validator/htdocs/config/types.conf )

3) Now we have:
  - nothing if the document was sent by direct input and has an  
unknown doctype, or no doctype
  - one or two determined parse modes if either mime type of doctype  
are known to us.

4) We finally follow the algorithm:
  - if the parse mode defined with the mime type is unambiguous, we  
use that
  - else, if the parse mode defined with the doctype is know, we use  
that
  - else, we fall back to SGML/HTML mode
(plus some warning shooting if the determined parse modes clash)
(see routine set_parse_mode() in http://dev.w3.org/cvsweb/~checkout~/ 
validator/httpd/cgi-bin/check )

**

If I understand it correctly, Shane's suggestion would be to add to  
step 2) the following:

if the document type is not in our table of know document types, but  
the public identifier matches ^-//W3C//DTD XHTML then the parse mode  
determined by the document type should be XML, because we then know  
that the document type is in the XHTML family, even if we don't know  
everything about it.

This sounds reasonable to me. Any objection?

-- 
olivier

Received on Monday, 30 April 2007 15:53:03 UTC