- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Mon, 16 Feb 2009 09:54:18 +0200
- To: Mark Birbeck <mark.birbeck@webbackplane.com>
- Cc: Sam Ruby <rubys@intertwingly.net>, Kingsley Idehen <kidehen@openlinksw.com>, Dan Brickley <danbri@danbri.org>, Michael Bolger <michael@michaelbolger.net>, public-rdfa@w3.org, RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>, Tim Berners-Lee <timbl@w3.org>, Dan Connolly <connolly@w3.org>, Ian Hickson <ian@hixie.ch>
On Feb 14, 2009, at 01:57, Mark Birbeck wrote: > You seem to be implying that there is a fundamental impediment to > creating an RDFa parser using the tools available in an HTML DOM. You > base this assertion on Henri's document, but all his script shows is > that objects in an HTML DOM don't have namespace information > available. > > That's no surprise. > > My response is that this is irrelevant. 1) Content consumer software should work both with HTML (text/html) and XHTML (application/xhtml+xml) if it works with one of them. 2) For sane *software* architecture, code above the HTML/XML parsing layer should be able to run its dispatch code without any conditional branches on the HTMLness or XMLness of the origin of the data it is operating on. This applies to native browser code, JavaScript code running in a browser and non-browser (X)HTML consumers. (Even easy- looking tiny variations add up.) 3) The point above is not about abstract XML architecture. It is an actual way of implementing software including (but not limited to) Gecko, WebKit, Presto (as far as can be guessed without seeing the code) and Validator.nu. Furthermore, the dominant design (http://en.wikipedia.org/wiki/Dominant_Design ) of HTML5 parsers for non-browser applications is that they expose an XML API so that the application-level code is written as if working with an XML parser parsing an equivalent XHTML5 file. 4) The qname is an artifact of the Namespaces in XML layer in XML and should not be significant to the application. The correct way to do namespace-wise correct dispatch is to dispatch on the [namespace,local] pair. If you are inspecting the qname of an attribute or element for any reason other than round-tripping serialization, you are Doing it Wrong. 5) Given the points above, you should also do dispatch on the [namespace,local] pair on the HTML side. 6) All features going into HTML5 should be robust and sane under scripting even if the people proposing the feature where interested in read-only use case is outside browsers. This includes keeping script- generated DOMs serializable. 7) If, in order to satisfy point #2 above, your feature requires using getAttribute (without NS) on getting but setAttributeNS (with NS) on setting (to keep the XML DOM serializable!), your feature isn't satisfying point #6. 8) So far, experience shows that even violations all of the above points that look small--such as lang vs. xml:lang--are more hurtful than people imagine at first. Examples: a) Browsers need to inspect two attributes instead of one to discover the language. b) To abstract problem a) away in non-browser applications in high- performance (in terms of CPU instructions executed per application- made query for an attribute) manner, the static RAM footprint of the Validator.nu HTML Parser is bloated by pointer size times 2328! c) The lang & xml:lang part of the HTML5 spec has had the highest incidence of validator bugs per spec sentence. (Bugs are bad and costly.) Hence, all violations all the above points should be taken very seriously even if in isolation on their face the violations seemed ridiculously small to be indignant about. Violations for xml:lang legacy are somewhat excusable. Introducing new violations isn't. 9) If you are defining something in terms all of the namespace mapping context, but you can't use DOM Level 3 lookupPrefix() to implement it (without violationg point #2), you are Doing it Wrong. 10) Browsers aren't the only kind of Web content consumer software. What you are specifying should work with XML API environments other than the browser flavor of DOM. 11) SAX2--arguable the most correct and complete XML API there is-- when run in the Namespace-aware mode (i.e. the correct mode considering contemporary XML architecture) doesn't expose the namespace declarations as attributes. Therefore, a SAX2-based RDFa-in- XHTML consumer needs to use the non-attribute abstraction (startPrefixMapping()) for gathering the namespace mapping context. However, the same application-level code (see point #2) wouldn't work with an HTML5 parser that implements mapping from text/html to SAX2 as defined today in the HTML 5 draft and as sufficient for all the HTML5 features drafted so far. 12) XOM--arguable the most correct of the well-known XML tree APIs for Java--doesn't expose the namespace declarations as attributes. Therefore, a XOM-based RDFa-in-XHTML consumer needs to use the non- attribute abstraction for using the namespace mapping context. However, the same application-level code (see point #2) wouldn't work with an HTML5 parser that implements mapping from text/html to XOM as defined today in the HTML 5 draft and as sufficient for all the HTML5 features drafted so far. (XOM even disallows including attributes names xmlns:foo in the tree.) 13) If points 9 through 12 were addressed by changing HTML5 parsers to expose attributes called xmlns:foo as namespace mapping context, the change HTML5 to enable RDFa would be notably more complex than just adding a few attributes. > An RDFa parser needs to be able to 'spot' whether an attribute name > begins 'xmlns:', but for that we don't need namespace support -- it's > just string matching, no different to detecting an attribute like > @data-length [1]. getAttributesNS(null, "data-length") works consistently in text/html and application/xhtml+xml. >> And I wrote that "HTML parsing rules differ in visible ways from >> XHTML. >> Ways that affect the specific names of attributes chose[sic] in >> RDFa." > > But the attributes in RDFa are not prefixed -- @about, @resource, > @datatype and @content are new attributes, whilst @rel, @rev, @href > and @src already exist -- so I don't see in what way the names were > 'chosen' in a way that was influenced by XHTML. Thank you for not prefixing the attribute names. However, you did to make the attribute values sensitive to the namespace mapping context. >> A list of the parsers alluded to above would be helpful as an >> existence >> proof for the above assertion. > > I think you have this the wrong way round. > > The parsing algorithm for RDFa refers to attributes and elements, > navigated by recursively traversing the hierarchy. It's therefore > applicable to anything that has such a hierarchical structure, and > that allows attribute values to be retrieved. Both HTML and XHTML DOMs > fit this description. But do they fit the description with the exact same above-parser code? (See my point #2 above.) > So I'd like to see a proof that shows that this simple architecture > makes it impossible to create an RDFa parser on top of an HTML DOM. > Henri has not provided a proof of anything other than that an HTML DOM > doesn't support namespaces, yet for some reason this 'non-proof' gets > circulated as fact. It is not circulated as proof that you can't implement an RDFa parser on top of an HTML DOM. It is circulated as proof that you can't implement an RDFa parser that a) works without conditional branches on HTMLness/XMLness and b) without violating Namespace-wise correct coding practices on c) *both* HTML and XML parser output. >> Your recent statement that "I can assure you that the parsing rules >> were >> very explicitly written in such a way that the only thing they >> require to do >> their work is a hierarchy of nodes, and the ability to obtain the >> value of >> an attribute.", while technically true, tends to obscure more than >> reveal >> when it comes to these differences. > > Again...what differences? I'm still confused as to what it is that > we're being different to. > > Just in case what you are getting at is that there is somehow a > difference between parsing RDFa in XHTML and parsing RDFa in HTML, I > can only say again that there isn't -- there is only one parsing > algorithm in RDFa. See my points 9 through 12 above. Do the existing RDFa parsers run different code (i.e. taking different branches) above the HTML and XML parsers? Obviously, you can make an RDFa parser for text/html if the API the parser exposes violates the Infoset or differs from browser behavior and you run different code for expanding CURIEs in the text/html and application/xhtml+xml cases or you run Namespace-wise bogus code for the XML case. >> Actually, I say differences. I only have an existence proof for one >> difference at the moment. Is there more? Beats me. Hence my >> assertion >> that a definitive list would be helpful. > > As I said, the "existence proof" of which you speak (Henri's one), > proves only that namespace properties do not exist in an HTML DOM, > whilst they do in an XHTML DOM. > > That's very different from being an "existence proof" that there are > two (or more) algorithms for parsing RDFa in a DOM, since RDFa does > not require namespaces per se. Again, points 9 through 12 above. > The only reason I entered this debate was to clarify the single point > that you made, propagating Henri's false claim -- that since the HTML > DOM does not provide namespace information, it is therefore not > possible (or 'more difficult') to create an RDFa parser. If you violate point #2, you make things more difficult. By how much? See point #8. This problem can be addressed by using absolute URIs instead of CURIEs and phasing out CURIEs by declaring xmlns:http="http:" on the XML side during the transition. (If that makes the predicates annoyingly long, what you have is a fundamental problem with the idea of using URIs as identifiers as opposed to using them for application-level addressing on the Internet. In that case, you should address that problem directly on the level of the RDF model instead of trying to push the annoyance around syntactically.) If you wish to get new features added to HTML5 and the proposed syntax depends on element or attribute names that contain the colon (xmlns:foo in this case), you are just asking for trouble because the colon is special in XML but not in text/html (and if you ask making it special in text/html, too, you are asking more than just adding a few attributes). -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Monday, 16 February 2009 07:55:03 UTC