Re: RDFa and Web Directions North 2009 from Henri Sivonen on 2009-02-16 (public-rdfa@w3.org from February 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 16 Feb 2009 09:54:18 +0200
To: Mark Birbeck <mark.birbeck@webbackplane.com>
Cc: Sam Ruby <rubys@intertwingly.net>, Kingsley Idehen <kidehen@openlinksw.com>, Dan Brickley <danbri@danbri.org>, Michael Bolger <michael@michaelbolger.net>, public-rdfa@w3.org, RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>, Tim Berners-Lee <timbl@w3.org>, Dan Connolly <connolly@w3.org>, Ian Hickson <ian@hixie.ch>
Message-Id: <3124640F-79B3-4C3B-A2E6-5700F3D84C47@iki.fi>
On Feb 14, 2009, at 01:57, Mark Birbeck wrote:

> You seem to be implying that there is a fundamental impediment to
> creating an RDFa parser using the tools available in an HTML DOM. You
> base this assertion on Henri's document, but all his script shows is
> that objects in an HTML DOM don't have namespace information
> available.
>
> That's no surprise.
>
> My response is that this is irrelevant.

  1) Content consumer software should work both with HTML (text/html)  
and XHTML (application/xhtml+xml) if it works with one of them.

  2) For sane *software* architecture, code above the HTML/XML parsing  
layer should be able to run its dispatch code without any conditional  
branches on the HTMLness or XMLness of the origin of the data it is  
operating on. This applies to native browser code, JavaScript code  
running in a browser and non-browser (X)HTML consumers. (Even easy- 
looking tiny variations add up.)

  3) The point above is not about abstract XML architecture. It is an  
actual way of implementing software including (but not limited to)  
Gecko, WebKit, Presto (as far as can be guessed without seeing the  
code) and Validator.nu. Furthermore, the dominant design (http://en.wikipedia.org/wiki/Dominant_Design 
) of HTML5 parsers for non-browser applications is that they expose an  
XML API so that the application-level code is written as if working  
with an XML parser parsing an equivalent XHTML5 file.

  4) The qname is an artifact of the Namespaces in XML layer in XML  
and should not be significant to the application. The correct way to  
do namespace-wise correct dispatch is to dispatch on the  
[namespace,local] pair. If you are inspecting the qname of an  
attribute or element for any reason other than round-tripping  
serialization, you are Doing it Wrong.

  5) Given the points above, you should also do dispatch on the  
[namespace,local] pair on the HTML side.

  6) All features going into HTML5 should be robust and sane under  
scripting even if the people proposing the feature where interested in  
read-only use case is outside browsers. This includes keeping script- 
generated DOMs serializable.

  7) If, in order to satisfy point #2 above, your feature requires  
using getAttribute (without NS) on getting but setAttributeNS (with  
NS) on setting (to keep the XML DOM serializable!), your feature isn't  
satisfying point #6.

  8) So far, experience shows that even violations all of the above  
points that look small--such as lang vs. xml:lang--are more hurtful  
than people imagine at first. Examples:
   a) Browsers need to inspect two attributes instead of one to  
discover the language.
   b) To abstract problem a) away in non-browser applications in high- 
performance (in terms of CPU instructions executed per application- 
made query for an attribute) manner, the static RAM footprint of the  
Validator.nu HTML Parser is bloated by pointer size times 2328!
   c) The lang & xml:lang part of the HTML5 spec has had the highest  
incidence of validator bugs per spec sentence. (Bugs are bad and  
costly.)
Hence, all violations all the above points should be taken very  
seriously even if in isolation on their face the violations seemed  
ridiculously small to be indignant about. Violations for xml:lang  
legacy are somewhat excusable. Introducing new violations isn't.

  9) If you are defining something in terms all of the namespace  
mapping context, but you can't use DOM Level 3 lookupPrefix() to  
implement it (without violationg point #2), you are Doing it Wrong.

10) Browsers aren't the only kind of Web content consumer software.  
What you are specifying should work with XML API environments other  
than the browser flavor of DOM.

11) SAX2--arguable the most correct and complete XML API there is-- 
when run in the Namespace-aware mode (i.e. the correct mode  
considering contemporary XML architecture) doesn't expose the  
namespace declarations as attributes. Therefore, a SAX2-based RDFa-in- 
XHTML consumer needs to use the non-attribute abstraction  
(startPrefixMapping()) for gathering the namespace mapping context.  
However, the same application-level code (see point #2) wouldn't work  
with an HTML5 parser that implements mapping from text/html to SAX2 as  
defined today in the HTML 5 draft and as sufficient for all the HTML5  
features drafted so far.

12) XOM--arguable the most correct of the well-known XML tree APIs for  
Java--doesn't expose the namespace declarations as attributes.  
Therefore, a XOM-based RDFa-in-XHTML consumer needs to use the non- 
attribute abstraction for using the namespace mapping context.  
However, the same application-level code (see point #2) wouldn't work  
with an HTML5 parser that implements mapping from text/html to XOM as  
defined today in the HTML 5 draft and as sufficient for all the HTML5  
features drafted so far. (XOM even disallows including attributes  
names xmlns:foo in the tree.)

13) If points 9 through 12 were addressed by changing HTML5 parsers to  
expose attributes called xmlns:foo as namespace mapping context, the  
change HTML5 to enable RDFa would be notably more complex than just  
adding a few attributes.

> An RDFa parser needs to be able to 'spot' whether an attribute name
> begins 'xmlns:', but for that we don't need namespace support -- it's
> just string matching, no different to detecting an attribute like
> @data-length [1].

getAttributesNS(null, "data-length") works consistently in text/html  
and application/xhtml+xml.

>> And I wrote that "HTML parsing rules differ in visible ways from  
>> XHTML.
>> Ways that affect the specific names of attributes chose[sic] in  
>> RDFa."
>
> But the attributes in RDFa are not prefixed -- @about, @resource,
> @datatype and @content are new attributes, whilst @rel, @rev, @href
> and @src already exist -- so I don't see in what way the names were
> 'chosen' in a way that was influenced by XHTML.

Thank you for not prefixing the attribute names. However, you did to  
make the attribute values sensitive to the namespace mapping context.

>> A list of the parsers alluded to above would be helpful as an  
>> existence
>> proof for the above assertion.
>
> I think you have this the wrong way round.
>
> The parsing algorithm for RDFa refers to attributes and elements,
> navigated by recursively traversing the hierarchy. It's therefore
> applicable to anything that has such a hierarchical structure, and
> that allows attribute values to be retrieved. Both HTML and XHTML DOMs
> fit this description.

But do they fit the description with the exact same above-parser code?  
(See my point #2 above.)

> So I'd like to see a proof that shows that this simple architecture
> makes it impossible to create an RDFa parser on top of an HTML DOM.
> Henri has not provided a proof of anything other than that an HTML DOM
> doesn't support namespaces, yet for some reason this 'non-proof' gets
> circulated as fact.

It is not circulated as proof that you can't implement an RDFa parser  
on top of an HTML DOM. It is circulated as proof that you can't  
implement an RDFa parser that a) works without conditional branches on  
HTMLness/XMLness and b) without violating Namespace-wise correct  
coding practices on c) *both* HTML and XML parser output.

>> Your recent statement that "I can assure you that the parsing rules  
>> were
>> very explicitly written in such a way that the only thing they  
>> require to do
>> their work is a hierarchy of nodes, and the ability to obtain the  
>> value of
>> an attribute.", while technically true, tends to obscure more than  
>> reveal
>> when it comes to these differences.
>
> Again...what differences? I'm still confused as to what it is that
> we're being different to.
>
> Just in case what you are getting at is that there is somehow a
> difference between parsing RDFa in XHTML and parsing RDFa in HTML, I
> can only say again that there isn't -- there is only one parsing
> algorithm in RDFa.

See my points 9 through 12 above.

Do the existing RDFa parsers run different code (i.e. taking different  
branches) above the HTML and XML parsers?

Obviously, you can make an RDFa parser for text/html if the API the  
parser exposes violates the Infoset or differs from browser behavior  
and you run different code for expanding CURIEs in the text/html and  
application/xhtml+xml cases or you run Namespace-wise bogus code for  
the XML case.

>> Actually, I say differences.  I only have an existence proof for one
>> difference at the moment.  Is there more?  Beats me.  Hence my  
>> assertion
>> that a definitive list would be helpful.
>
> As I said, the "existence proof" of which you speak (Henri's one),
> proves only that namespace properties do not exist in an HTML DOM,
> whilst they do in an XHTML DOM.
>
> That's very different from being an "existence proof" that there are
> two (or more) algorithms for parsing RDFa in a DOM, since RDFa does
> not require namespaces per se.

Again, points 9 through 12 above.

> The only reason I entered this debate was to clarify the single point
> that you made, propagating Henri's false claim -- that since the HTML
> DOM does not provide namespace information, it is therefore not
> possible (or 'more difficult') to create an RDFa parser.

If you violate point #2, you make things more difficult. By how much?  
See point #8.

This problem can be addressed by using absolute URIs instead of CURIEs  
and phasing out CURIEs by declaring xmlns:http="http:" on the XML side  
during the transition. (If that makes the predicates annoyingly long,  
what you have is a fundamental problem with the idea of using URIs as  
identifiers as opposed to using them for application-level addressing  
on the Internet. In that case, you should address that problem  
directly on the level of the RDF model instead of trying to push the  
annoyance around syntactically.)

If you wish to get new features added to HTML5 and the proposed syntax  
depends on element or attribute names that contain the colon  
(xmlns:foo in this case), you are just asking for trouble because the  
colon is special in XML but not in text/html (and if you ask making it  
special in text/html, too, you are asking more than just adding a few  
attributes).

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Monday, 16 February 2009 07:55:03 UTC