Re: HTML 4 Profile for RDFa from Julian Reschke on 2009-05-23 (public-rdfa@w3.org from May 2009)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sat, 23 May 2009 13:17:45 +0200
To: Philip Taylor <pjt47@cam.ac.uk>
CC: Sam Ruby <rubys@intertwingly.net>, Shane McCarron <shane@aptest.com>, RDFa Community <public-rdfa@w3.org>, "public-rdf-in-xhtml-tf.w3.org" <public-rdf-in-xhtml-tf@w3.org>, HTML WG <public-html@w3.org>
Message-ID: <4A17DB59.5030509@gmx.de>

Philip Taylor wrote:
> ...
> Indeed, it would be good have this defined with the level of precision 
> that HTML 5 has, so we can be sure implementations will be able to agree 
> on how to extract RDFa from text/html content.
> 
> A few significant issues that I see in the current version:
> 
> What is "the @xml:lang attribute"? Is it the attribute with local name 

It's unambiguous as long as we talk about a stream of characters, right?

> "xml:lang" in no namespace (as would be produced by an HTML 5 parser 
> (and by current HTML browser parser implementations))? or the attribute 
> with local name "lang" in the namespace 
> "http://www.w3.org/XML/1998/namespace" (as would be produced by an XML 
> parser, and could be inserted in an HTML document via DOM APIs)? or both 
> (in which case both could be specified on one element, in addition to 
> "lang" in no namespace)?

Both can only be specified in the DOM, but not in a serialization (or am 
I missing something?).

That being said, I wouldn't hurt to have a section that defines special 
aspects of processing RDFa from a DOM instead of a HTML document (as a 
series of bytes/characters).

> "If the object of a triple would be an XMLLiteral, and the input to the 
> processor is not well-formed [XML]" - I don't understand what that means 
> in an HTML context. Is it meant to mean something like "the bytes in the 
> HTML file that correspond to the contents of the relevant element could 
> be parsed as well-formed XML (modulo various namespace declaration 
> issues)"? If so, that seems impossible to implement. The input to the 
> RDFa processor will most likely be a DOM, possibly manipulated by the 
> DOM APIs rather than coming straight from an HTML parser, so it may 
> never have had a byte representation at all.
> 
> Even without scripting, there isn't always a contiguous sequence of 
> bytes corresponding to the content of an element. E.g. if the HTML input 
> is:
>   <table>
>     <tr some-attributes-to-say-this-element-outputs-an-XMLLiteral>
>       <td> This text goes inside the table </td>
>       This text gets parsed to *outside* the table
>       <td> This text goes inside the table </td>
>     </tr>
>   </table>
> then (according to the HTML 5 parsing algorithm, and implemented in (at 
> least) Firefox) the content of the <tr> element includes the first and 
> third lines of text, but not the second. How would you decide whether 
> the content is well-formed XML?

Is it still underspecified once we require a valid HTML5 document as input?

> For this to make sense in real HTML implementations, the definition 
> should be in terms of the document layer rather than the byte layer. 

Disagreed. Many implementations never build a DOM. We're not only 
talking about browsers here.

> ...
> How are xmlns:* attributes meant to be processed? E.g. what is the 
> expected output in the following cases:
> 
> <div xmlns:T="test:">
>   <span typeof="t:x" property="t:y">Test</span>
> </div>
> 
> <div XMLNS:t="test:">
>   <span typeof="t:x" property="t:y">Test</span>
> </div>
> 
> <div xmlns:T="test:">
>   <span typeof="T:x" property="T:y">Test</span>
> </div>
> 
> <div xmlns:t="test:">
>   <div xmlns:t="">
>     <span typeof="t:x" property="t:y">Test</span>
>   </div>
> </div>

I would expect the results to be the same for XHTML and HTML serializations.

> <div xmlns:t="test1:" id="d">
>   <span typeof="t:x" property="t:y">Test</span>
> </div>
> <script>
>   document.getElementById('d').setAttributeNS(
>     'http://www.w3.org/2000/xmlns/', 'xmlns:t', 'test2:');
>     /* (now the element has two distinct attributes,
>        each in different namespaces) */
> </script>

That example illustrates why it's dangerous to focus too much on 
processing in the DOM. Many RDFa processors will never execute the 
script. So I think considerations like the one above should be treated 
as a distinct problem (potentially in an appendix of the spec).

> ...

BR, Julian

Received on Saturday, 23 May 2009 11:18:36 UTC