Re: HTML 4 Profile for RDFa from Philip Taylor on 2009-05-14 (public-rdfa@w3.org from May 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Fri, 15 May 2009 00:58:48 +0100
To: Shane McCarron <shane@aptest.com>
CC: Sam Ruby <rubys@intertwingly.net>, RDFa Community <public-rdfa@w3.org>, "public-rdf-in-xhtml-tf.w3.org" <public-rdf-in-xhtml-tf@w3.org>, HTML WG <public-html@w3.org>
Message-ID: <4A0CB038.6040207@cam.ac.uk>
Shane McCarron wrote:
> Philip Taylor wrote:
>> Indeed, it would be good have this defined with the level of precision 
>> that HTML 5 has, so we can be sure implementations will be able to 
>> agree on how to extract RDFa from text/html content.
>> [...]
> Well - remember that the document you are looking at is written in the 
> context of HTML 4.  In HTML 4 none of what you say above makes any 
> sense.  Attributes are tokens - and the token "xml:lang" is what I was 
> talking about.

Yeah, I'm not sure what else you could do in the context of HTML 4. I'm 
approaching this from the context of HTML 5 - I think it would be 
valuable to define precisely the mapping from text/html to RDF triples, 
so that people can know what to expect when they run their content 
through any RDFa-aware tool, and it only seems to be feasible to define 
that in the HTML 5 context.

(This might be in addition to an HTML 4 extension like in your document, 
not necessarily a replacement, but I'm not personally interested in 
working with HTML 4. Maybe that means the "HTML 4 Profile for RDFa" 
thread is not the best place to discuss this, but better here than 
nowhere...)

>> [Stuff about XMLLiterals]
> We have no presumption of how an RDFa processor is implemented.  It 
> might be client side via a browser.  It might be server side.  It might 
> be part of an XML tool-chain.  It doesn't really matter.

Is there any implementation that is *not* based on some kind of abstract 
document model (like DOM or SAX or some custom tree structure, where 
documents are parsed into elements and attributes before any further 
processing)?

It seems to me that requiring the abstract document model to be 
re-serialised into well-formed XML (regardless of whether it originated 
from an XML parser, or from parsing HTML with missing quotes and 
unclosed <br>s, or from a DOM API, or anywhere else) would be the best 
way to ensure correctness (since the output will always be well-formed 
XML, by definition), functionality (since it would let you use 
XMLLiterals in text/html with few surprises or special cases), and 
practical implementability (since everyone should already have a tree of 
elements and attributes and be able to serialise it into XML). But that 
does rely the concept of a document model, which only really exists in 
HTML 5 and not in HTML 4.

> I think you need to take a step back and think about 
> goals rather than implementation strategies.  The goal here is that all 
> implementations extract the same collection of triples from a given 
> document.

I like that goal :-). I don't want to limit things to a single 
implementation strategy (e.g. DOM) - but some people will use that 
implementation strategy, and if other implementations are required to 
extract the same collection of triples, then it seems sensible to define 
the requirements in a way that can be easily mapped onto that 
implementation strategy (and preferably onto others), rather than 
leaving a huge gap that implementers have to sort out themselves and 
could easily get wrong.

The DOM-based model used by the HTML 5 parsing algorithm can be easily 
mapped onto common implementation strategies (DOM, SAX, XOM, 
ElementTree, etc). The token-based model of HTML 4 can't (hence the 
crazy incompatibilities between HTML parsers, and the need for a huge 
amount of work in HTML 5 to define the mapping for the first time). So 
defining RDFa triple extraction based on HTML 5 seems much more likely 
to achieve the goal than defining it based on HTML 4, and therefore 
seems a more useful thing to work on.

>> <http://www.whatwg.org/specs/web-apps/current-work/multipage/the-xhtml-syntax.html#serializing-xhtml-fragments>) 
>>
> In HTML 5, where there is an XML serialisation method, that might make 
> sense.  In HTML 4 however, we don't have that luxury.

Sounds like another benefit of defining RDFa-in-HTML based on HTML 5 
instead of suffering the restrictions of HTML 4 :-)

> [...] in the HTML profile I think it 
> would be reasonable to require that prefix names are mapped to 
> lower-case during processing.   Or some other solution that gets us to 
> the point where a browser-based implementation that requests attribute 
> names from a DOM node can still work.  My conclusion here is that prefix 
> names should be treated case-insensitively in the HTML profile.  Do you 
> agree?

HTML parsers (by which I mean HTML 5 and web browsers) don't preserve 
the case of element or attribute names. Anything processing the output 
from a parser will see everything as lowercase (or uppercase, depending 
on what API they use), so RDFa mustn't consider the case of attribute 
names to be significant. When comparing prefixes in CURIEs, I suppose it 
could do a case-insensitive comparison, but that would be unnecessary 
complexity and annoyingly inconsistent with XHTML. rdfquery and 
http://www.w3.org/2006/07/SWD/RDFa/impl/js/ appear to treat the 
attribute name as lowercase, and then case-sensitively compare against 
the CURIE prefix.

>> Should the same processing rules be used for documents from both HTML 
>> and XHTML parsers, or would DOM-based implementations need to detect 
>> where the input came from and switch processing rules accordingly? If 
>> there is a difference, what happens if I adoptNode from an XHTML 
>> document into an HTML document, or vice versa?
> Err... What's adoptNode?

http://www.w3.org/TR/DOM-Level-3-Core/core.html#Document3-adoptNode

> And how are these two documents getting together?

I might have an HTML document (containing some RDFa), which uses 
XMLHttpRequest to download an XHTML fragment (also containing some RDFa) 
and inserts it into the current page, and then I might attempt to 
extract RDF triples from the page.

> I mean, that's sort of out of scope of an HTML 4 profile for RDFa.

It's out of scope for HTML 4, but it seems necessary for the goal that 
"all implementations extract the same collection of triples from a given 
document" if you include dynamic implementations. (And 
http://rdfa.info/wiki/Dynamic-content-parsing suggests people are 
interested in dynamic implementations.)

> With regard to the first part of the question, I believe the same 
> processing rules can be used.

So I could use the "lang" attribute (instead of "xml:lang") in XHTML 
documents as well as HTML, because the same processing rules would be 
applied? (If so, it would be nice if the RDFa-in-XHTML specification 
agreed with that.)

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Thursday, 14 May 2009 23:59:38 UTC