Re: HTML 4 Profile for RDFa from Shane McCarron on 2009-05-14 (public-rdf-in-xhtml-tf@w3.org from May 2009)

From: Shane McCarron <shane@aptest.com>
Date: Thu, 14 May 2009 16:24:09 -0500
To: Philip Taylor <pjt47@cam.ac.uk>
CC: Sam Ruby <rubys@intertwingly.net>, RDFa Community <public-rdfa@w3.org>, "public-rdf-in-xhtml-tf.w3.org" <public-rdf-in-xhtml-tf@w3.org>, HTML WG <public-html@w3.org>
Message-ID: <4A0C8BF9.7060503@aptest.com>
Philip Taylor wrote:
> Indeed, it would be good have this defined with the level of precision 
> that HTML 5 has, so we can be sure implementations will be able to 
> agree on how to extract RDFa from text/html content.
>
> A few significant issues that I see in the current version:
>
> What is "the @xml:lang attribute"? Is it the attribute with local name 
> "xml:lang" in no namespace (as would be produced by an HTML 5 parser 
> (and by current HTML browser parser implementations))? or the 
> attribute with local name "lang" in the namespace 
> "http://www.w3.org/XML/1998/namespace" (as would be produced by an XML 
> parser, and could be inserted in an HTML document via DOM APIs)? or 
> both (in which case both could be specified on one element, in 
> addition to "lang" in no namespace)?
Well - remember that the document you are looking at is written in the 
context of HTML 4.  In HTML 4 none of what you say above makes any 
sense.  Attributes are tokens - and the token "xml:lang" is what I was 
talking about.   In HTML 4 those attribute names are case-insensitive - 
I need to add something about that to the draft.  Thanks for the reminder!
>
> "If the object of a triple would be an XMLLiteral, and the input to 
> the processor is not well-formed [XML]" - I don't understand what that 
> means in an HTML context. Is it meant to mean something like "the 
> bytes in the HTML file that correspond to the contents of the relevant 
> element could be parsed as well-formed XML (modulo various namespace 
> declaration issues)"? If so, that seems impossible to implement. The 
> input to the RDFa processor will most likely be a DOM, possibly 
> manipulated by the DOM APIs rather than coming straight from an HTML 
> parser, so it may never have had a byte representation at all.
We have no presumption of how an RDFa processor is implemented.  It 
might be client side via a browser.  It might be server side.  It might 
be part of an XML tool-chain.  It doesn't really matter.  In this case, 
the document I wrote is a little too fuzzy because the idea is not 
completely cooked yet.  Here's the problem:  RDFa permits the creation 
of objects that are of type XMLLiteral.  That datatype is tightly 
defined, and as you can imagine it is expected to contain well-formed 
XML.  If a Conforming RDFa Processor were to generate triples that 
contained data of type XMLLiteral, and that data were not "well-formed" 
as defined in XML, then consumers of that data could easily be very 
surprised!


>
> Even without scripting, there isn't always a contiguous sequence of 
> bytes corresponding to the content of an element. E.g. if the HTML 
> input is:
>   <table>
>     <tr some-attributes-to-say-this-element-outputs-an-XMLLiteral>
>       <td> This text goes inside the table </td>
>       This text gets parsed to *outside* the table
>       <td> This text goes inside the table </td>
>     </tr>
>   </table>
> then (according to the HTML 5 parsing algorithm, and implemented in 
> (at least) Firefox) the content of the <tr> element includes the first 
> and third lines of text, but not the second. How would you decide 
> whether the content is well-formed XML?
Yeah - tricky.  I think you need to take a step back and think about 
goals rather than implementation strategies.  The goal here is that all 
implementations extract the same collection of triples from a given 
document.  There are a lot of ways to achieve that.  In the XHTML 
profile of RDFa we relied upon the XML parsing model.   Consequently, we 
are confident that we are being handed well-formed content.  If you do 
an implementation via the DOM, you can also be confident of that, since 
by the time content gets into the DOM you can assume the processor has 
done whatever magic was necessary and you have a node tree that you 
could turn back into content that would be well-formed.  If you are 
writing your own parser that sifts through a document character by 
character... well, you are going to have some work ahead of you!

With regard to your example above.... if I had a DOM based processor, I 
would have missed out on line 3 I imagine.  If I wrote my own I would 
have included it ('cause that is well formed - the XML parser would have 
handed it to me).  In the XHTML profile we (sort of ) address this in 
that we only tightly constrain behavior for *valid* content.  The 
content above is *invalid* according to the XHTML+RDFa schema - so while 
the behavior of existing implementations might be inconsistent, I 
personally won't get too excited about it.

In the HTML profile of RDFa, things are much the same.  We can attempt 
to be very very precise about how the parsing of the content should be 
handled, or we can rely upon the parsing model spelled out by the 
underlying specification (HTML 4 in this case).  Now, I am sure you will 
agree that HTML 4 does a pretty poor job of defining the parsing model, 
but.... is it adequate for our needs in this instance?  My belief is 
that it is adequate - at least for the vast majority of the RDFa 
processing rules.  In particular in that most implementors will rely 
upon existing parsing libraries, and the problems associated with that 
parsing have been largely sorted out over the years.  Even to the point 
that they are being codified in the early draft HTML5 documents.

The only place I have a concern is with regard to creating XMLLiterals.  
This is a very powerful aspect of RDFa, and I am loathe to disable it in 
the HTML profile if I don't have to.  Instead, I would like to identify 
a light-weight model that implementors can use.  For example, we could 
say that if an object is of type XMLLiteral, then its content is escaped 
so that there is no markup (< to &lt; etc).  This would mean that it is 
"well formed XML", and that it could be turned back into its original 
source form, which is the goal of such content.  However, it would also 
mean that a consumer of such content would need to know this and do the 
reverse transformation before using the content.  I don't know what the 
right answer is - maybe we can figure it out together?
>
> For this to make sense in real HTML implementations, the definition 
> should be in terms of the document layer rather than the byte layer. 
> (The XMLLiteral should be an XML-fragment serialisation of the 
> element, and some error handling (like ignoring the triple) would 
> occur if it's impossible to serialise as XML, similar to the 
> requirements in 
> <http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>) 
>
In HTML 5, where there is an XML serialisation method, that might make 
sense.  In HTML 4 however, we don't have that luxury.  I suppose we 
could say that the HTML 4 content is transformed into corresponding 
XHTML 1.0 content... but there are no reliable serializers out there 
that do that really.
>
> How are xmlns:* attributes meant to be processed? E.g. what is the 
> expected output in the following cases:
>
> <div xmlns:T="test:">
>   <span typeof="t:x" property="t:y">Test</span>
> </div>
>
> <div XMLNS:t="test:">
>   <span typeof="t:x" property="t:y">Test</span>
> </div>
>
> <div xmlns:T="test:">
>   <span typeof="T:x" property="T:y">Test</span>
> </div>
>
> <div xmlns:t="test:">
>   <div xmlns:t="">
>     <span typeof="t:x" property="t:y">Test</span>
>   </div>
> </div>
>
> <div xmlns:t="test1:" id="d">
>   <span typeof="t:x" property="t:y">Test</span>
> </div>
> <script>
>   document.getElementById('d').setAttributeNS(
>     'http://www.w3.org/2000/xmlns/', 'xmlns:t', 'test2:');
>     /* (now the element has two distinct attributes,
>        each in different namespaces) */
> </script>
I had not thought about this much before.  Attribute names in HTML / 
SGML are case-insensitive.  CURIE prefix names are of course NOT.  
However, I can almost guarantee you that browser-based implementations 
of the XHTML profile right now would fail to work correctly when faced 
with CURIE prefixes that differ only in case.  Interesting point - I am 
going to test that later.

I think we would be wise to advise document authors to not define 
prefixes that differ only in case.  And in the HTML profile I think it 
would be reasonable to require that prefix names are mapped to 
lower-case during processing.   Or some other solution that gets us to 
the point where a browser-based implementation that requests attribute 
names from a DOM node can still work.  My conclusion here is that prefix 
names should be treated case-insensitively in the HTML profile.  Do you 
agree?
>
> Should the same processing rules be used for documents from both HTML 
> and XHTML parsers, or would DOM-based implementations need to detect 
> where the input came from and switch processing rules accordingly? If 
> there is a difference, what happens if I adoptNode from an XHTML 
> document into an HTML document, or vice versa?
Err... What's adoptNode?  And how are these two documents getting 
together?  I mean, that's sort of out of scope of an HTML 4 profile for 
RDFa.  With regard to the first part of the question, I believe the same 
processing rules can be used.  I have an implementation that does it 
now.  So do lots of other people.  My implementation is DOM based 
though, so that makes it relatively simple to have the same rules work.

Thanks for your comments!

-- 
Shane P. McCarron                          Phone: +1 763 786-8160 x120
Managing Director                            Fax: +1 763 786-8180
ApTest Minnesota                            Inet: shane@aptest.com
Received on Thursday, 14 May 2009 21:24:57 UTC