Re: Invalid XHTML Re: Another test suggesting change in the spec

Here is my code - the intent is to be  liberal.
I've added comments for this message

This code processes the root element of the document.

// check mimetype: text/html or application/xhtml+xml
boolean html = isHtmlMimetype();

// process according to section 2 of spec
if (grddlNamespace)
	checkRootAttrs(attr);

// if namespace URI
if (uri != null && !uri.equals("")) {
    checkSchema(input.resolve(uri));

// Is HTML doc, if has HTML mimetype or HTML namespace
// this allows XHTML2 namespace too.
    html = html || isHtmlNS(uri);
} else
// if no namespace URI, but root element is <html> or case variant
// then its html
  if (localName.equalsIgnoreCase("html"))
    html = true;
// we've done enough processing except in the HTML case
if (!html)
    throw new SeenEnoughExpectedException();
// hmmm if our root element is not html, then
// the document is a mess, better tidy it up.
if (!localName.equalsIgnoreCase("html")) {
    needTidy(); // doesn't usually return
}


I wouldn't expect the spec to call out all of these possibilities.
I suspect I should have a strict mode that switches at least some of it off.

So, my code, does any one of:
a) checks for mimetype (text/html application/xhtml+xml)
b) checks for namespace uri
c) checks for root element name, if unqualified.

- and will tidy up any mess it finds.

This differs from your suggestion which was (a) or ( (b) and (c) ), 
rather than (a) or (b) or (c) which I've implemented. I've also 
implemented (a) as two mimetypes, and (b) as two namespaces.

Jeremy


Dan Connolly wrote:
> On Mon, 2007-04-23 at 14:08 -0400, Chimezie Ogbuji wrote:
> [...]
>> On Mon, 2007-04-23 at 12:45 -0500, Dan Connolly wrote:
> [...]
>>> What _do_ the implementations check or depend on?
>>> MIME type, XML-wf-ness, and root element namespace?
>> GRDDL.py (in its current form) only checks for XML-wf-ness and
>> successful evaluation of the (unambiguous) XPaths outlined in the
>> specification.
> 
> There are no XPaths in the relevant section.
> 
> Oh... wait... yes there are... though only in the informative
> mechanical rules...
> 
> I think those rules match, for example, XHTML inside Atom;
> even inside an Atom document that says "The following
> XHTML is false/fictuional/counter-factual..."
> 
>>> If so, I'd specify something like this...
>>>
>>>   If an information resource has a text/html representation
>>>   whose body is an XML document whose root element
>>>   bears the local name 'html' and the
>>>   namespace name 'http://www.w3.org/1999/xhtml', then ...
>>>
>> +1 On this 
> 
> That quick sketch excludes the following mime types:
>   text/xml
>   application/xml
>   application/xhtml+xml
> 
> I think that's not a good way to specify it... but I do think
> the media type has to specify XML... i.e. text/plain is no good.
> 
>> However, my original question remains: does our dependency on XHTML
>> clash with the faithful infoset 'stance'?
> 
> No. (i.e. not as far as I can tell.)
> 

-- 
Hewlett-Packard Limited
registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England

Received on Monday, 23 April 2007 19:04:11 UTC