Re: HTML 4 Profile for RDFa from Philip Taylor on 2009-05-14 (public-html@w3.org from May 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Thu, 14 May 2009 21:11:21 +0100
To: Sam Ruby <rubys@intertwingly.net>
CC: Shane McCarron <shane@aptest.com>, RDFa Community <public-rdfa@w3.org>, "public-rdf-in-xhtml-tf.w3.org" <public-rdf-in-xhtml-tf@w3.org>, HTML WG <public-html@w3.org>
Message-ID: <4A0C7AE9.30708@cam.ac.uk>
Sam Ruby wrote:
> Shane McCarron wrote:
>> Folks,
>>
>> Thanks to you all for encouraging me to create a draft profile for 
>> RDFa in HTML 4.  This document has no official standing of course - it 
>> is just something we at ApTest have been using for a while as a way of 
>> pushing metadata into traditional web sites and user agents.
>>
>> You can find the latest version at 
>> http://www3.aptest.com/standards/rdfa-html/
>>
>> Feel free to send comments to me directly or to the public-rdfa@w3.org 
>> list if you want to share them with the community.  I look forward to 
>> seeing what you think!
> 
> A promising start!
> 
> I would hope that we could work together to get HTML 5 included and the 
> various issues that have been discussed to date resolved.

Indeed, it would be good have this defined with the level of precision 
that HTML 5 has, so we can be sure implementations will be able to agree 
on how to extract RDFa from text/html content.

A few significant issues that I see in the current version:

What is "the @xml:lang attribute"? Is it the attribute with local name 
"xml:lang" in no namespace (as would be produced by an HTML 5 parser 
(and by current HTML browser parser implementations))? or the attribute 
with local name "lang" in the namespace 
"http://www.w3.org/XML/1998/namespace" (as would be produced by an XML 
parser, and could be inserted in an HTML document via DOM APIs)? or both 
(in which case both could be specified on one element, in addition to 
"lang" in no namespace)?


"If the object of a triple would be an XMLLiteral, and the input to the 
processor is not well-formed [XML]" - I don't understand what that means 
in an HTML context. Is it meant to mean something like "the bytes in the 
HTML file that correspond to the contents of the relevant element could 
be parsed as well-formed XML (modulo various namespace declaration 
issues)"? If so, that seems impossible to implement. The input to the 
RDFa processor will most likely be a DOM, possibly manipulated by the 
DOM APIs rather than coming straight from an HTML parser, so it may 
never have had a byte representation at all.

Even without scripting, there isn't always a contiguous sequence of 
bytes corresponding to the content of an element. E.g. if the HTML input is:
   <table>
     <tr some-attributes-to-say-this-element-outputs-an-XMLLiteral>
       <td> This text goes inside the table </td>
       This text gets parsed to *outside* the table
       <td> This text goes inside the table </td>
     </tr>
   </table>
then (according to the HTML 5 parsing algorithm, and implemented in (at 
least) Firefox) the content of the <tr> element includes the first and 
third lines of text, but not the second. How would you decide whether 
the content is well-formed XML?

For this to make sense in real HTML implementations, the definition 
should be in terms of the document layer rather than the byte layer. 
(The XMLLiteral should be an XML-fragment serialisation of the element, 
and some error handling (like ignoring the triple) would occur if it's 
impossible to serialise as XML, similar to the requirements in 
<http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>)


How are xmlns:* attributes meant to be processed? E.g. what is the 
expected output in the following cases:

<div xmlns:T="test:">
   <span typeof="t:x" property="t:y">Test</span>
</div>

<div XMLNS:t="test:">
   <span typeof="t:x" property="t:y">Test</span>
</div>

<div xmlns:T="test:">
   <span typeof="T:x" property="T:y">Test</span>
</div>

<div xmlns:t="test:">
   <div xmlns:t="">
     <span typeof="t:x" property="t:y">Test</span>
   </div>
</div>

<div xmlns:t="test1:" id="d">
   <span typeof="t:x" property="t:y">Test</span>
</div>
<script>
   document.getElementById('d').setAttributeNS(
     'http://www.w3.org/2000/xmlns/', 'xmlns:t', 'test2:');
     /* (now the element has two distinct attributes,
        each in different namespaces) */
</script>


Should the same processing rules be used for documents from both HTML 
and XHTML parsers, or would DOM-based implementations need to detect 
where the input came from and switch processing rules accordingly? If 
there is a difference, what happens if I adoptNode from an XHTML 
document into an HTML document, or vice versa?

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Thursday, 14 May 2009 20:12:01 UTC