Re: HTML 4 Profile for RDFa from Philip Taylor on 2009-05-23 (public-html@w3.org from May 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Sat, 23 May 2009 17:49:19 +0100
To: Julian Reschke <julian.reschke@gmx.de>
CC: Sam Ruby <rubys@intertwingly.net>, Shane McCarron <shane@aptest.com>, RDFa Community <public-rdfa@w3.org>, "public-rdf-in-xhtml-tf.w3.org" <public-rdf-in-xhtml-tf@w3.org>, HTML WG <public-html@w3.org>
Message-ID: <4A18290F.3010402@cam.ac.uk>
Minor correction: I wrote:

> In a HTML5 text/html serialisation with no scripting, you can only get 
> the attribute "xml:lang" in no namespace.

which I think is wrong because of foreign content: you can write <div 
xml:lang=a><svg xml:lang=b></svg></div>, which will result in one 
attribute called "xml:lang" in no namespace on the div, and one called 
"lang" in the XML namespace on the svg. (But you can't get both on the 
same element, unless I'm wrong again.)


Julian Reschke wrote:
>>> Is it still underspecified once we require a valid HTML5 document as 
>>> input?
>>
>> Probably not. But I wouldn't consider it acceptable to require a valid 
>> document as input - people make mistakes all the time, and I want them 
>> to get consistent (and hopefully predictable) RDF triples out of it 
>> regardless of what implementation they use, so the specification has 
>> to deal precisely with invalid input. See 
>> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2009May/0156.html 
>> for an example of someone with precisely this kind of error.
> 
> Understood; I just wanted to understand the scope of the problem.

Okay, sure. My original comment was in the context of there not 
necessarily being a contiguous sequence of characters that corresponds 
to a parsed element, and I think that's closely related (perhaps the 
same as?) the concept of streamability (basically the ability to output 
SAX events without buffering elements). The current streamability 
violations come from:

* Content inside <table>/<tr> instead of inside <td>
* Misnested <i>...<p>...</i>...</p>, <i>...<b>...</i>...</b>, etc
* Head content (link, meta, etc) between </head> and <body>
* Multiple <html> or <body> elements (their attributes get merged)
* Content after </body>
* (Can't think of any others)

Those are all parse errors, and a conforming parser is allowed to abort 
when it sees a parse error, though many of them are quite common in the 
wild.

In any other case, it seems like it ought to be theoretically possible 
to find a substring of the document that corresponds to the content of 
an element, though I may be missing some subtleties. (But current parser 
implementations don't do that, and I don't think they would willingly do 
so - they throw away the input stream and all they can do is 
re-serialise the parsed output.)

>> By "DOM" I generally mean any kind of tree structure of elements and 
>> attributes, either as an explicit data structure (DOM, XOM, 
>> ElementTree) or implicit (SAX). Would any RDFa implementation *not* 
>> parse the input HTML into that kind of structure and operate over the 
>> elements and attributes as distinct objects? (e.g. would they just use 
>> regular expressions over the input byte stream? That seems quite 
>> infeasible to me...)
> 
> Depends on the definition of "tree structure". I've been involved in 
> code that just uses a tokenizer and specialized stack, and 
> implementations like these will not do the re-arranging of elements the 
> HTML5 spec specifies for some kinds of broken input.

If they abort when there are streamability violations, that's fine (and 
is what the Validator.nu parser's unbuffered SAX output does) - the 
stream of start/end element events will always be well-nested and will 
encode a tree structure, and it would be possible to specify DOM-based 
algorithms that could be easily mapped onto that non-DOM implementation.

If they don't abort and instead do some different kind of error 
handling, then they're not a conforming HTML5 parser, and in that case 
we've already failed at the goal of getting consistent behaviour.

>> [...]
> 
> That's impossible, at least for now as RDFa-in-XHTML relies on 
> XML-NS-wellformedness (so XMLNS:* would be recognized as namespace 
> declaration, right?).

Hmm, maybe a better example of what I intended is:

   <div xmlns:t="test1:">
     <div xmlns:T="test2:">
       <span property="t:x T:y">Test</span>
     </div>
   </div>

which is well-formed XML and has a clear definition in RDFa-in-XHTML, 
but the defined behaviour is impossible to reproduce in text/html 
(because xmlns:t and xmlns:T (and XMLNS:T) are parsed identically by an 
HTML parser and there's no way to distinguish them afterwards).

RDFa-in-text/html could:

* Assume attributes are all treated as lowercase (breaking <div 
xmlns:T="..." property="T:..."> which works in XHTML);

* Say CURIEs (in both XHTML and HTML) match prefixes case-insensitively 
(breaking compatibility with current implementations);

* Change text/html parsing to preserve attribute case (breaking 
compatibility with current parsers);

* Use some other prefix-binding mechanism (in both XHTML in HTML) like 
prefix="t=... T=..." instead of xmlns:t="..." (breaking current 
implementations and deployed content, but avoiding the mess of parsing 
differences between XHTML and HTML).

I can't think of any other solutions, so something is going to break no 
matter what is chosen.

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Saturday, 23 May 2009 16:50:05 UTC