- From: Mark Birbeck <mark.birbeck@webbackplane.com>
- Date: Mon, 21 Sep 2009 16:30:32 +0100
- To: James Graham <jgraham@opera.com>
- Cc: Henri Sivonen <hsivonen@iki.fi>, Manu Sporny <msporny@digitalbazaar.com>, HTML WG <public-html@w3.org>, RDFa Developers <public-rdf-in-xhtml-tf@w3.org>
Hi James, I feel bad pointing this out after all of your hard work...but thanks to Elias Torres there has been a Python RDFa parser since early 2006: <http://rdfa.info/2006/06/04/a-python-rdfa-parser/> Even if there wasn't such proof that it's possible, it wouldn't change anything. This is because the issue being discussed was not: Can an RDFa parser process xmlns-based attributes when running on top of an XML stack? The answer to that is self-evidently 'yes'. The question raised by the WHATWG was: Can an RDFa parser process xmlns-based attributes in a browser running in HTML mode? And that is the issue I was addressing in the comments of mine, that you quoted. And once again, the answer is 'yes'. Regards, Mark On Mon, Sep 21, 2009 at 9:58 AM, James Graham <jgraham@opera.com> wrote: > I wasn't sure whether to send this but it seems that it may be apropos to > the conversation that Henri is having so... > > Please ignore it with my apologies if it doesn't add anything of value; I > haven't been following all the discussion. > > Mark Birbeck wrote: >> Hi James, >> >> I think you're really going have to be more specific. You say things >> like "one will soon run into the following problem", but you don't say >> what the problem is. You say "clearly the tree[s] produced...will >> require different processing", but you don't say why they require >> different processing. >> >> Here's some typical code: >> >> function getMappingsFromElement(element, mappingList) { >> var attributes = element.attributes, attrName, i; >> >> if (attributes) { >> for (i = 0; i < attributes.length; i++) { >> attrName = attributes[i].nodeName; >> >> if (attrName.substring(0, 5) === "xmlns") { >> if (attrName.length === 5) { >> mappingList.add("", attributes[i].nodeValue); >> } else if (attrName.substring(5, 6) === ':') { >> mappingList.add(attrName.substring(6), >> attributes[i].nodeValue); >> } >> } >> } >> } > > Here is that code rewritten in python, assuming that you are using lxml > as your tree library: > > def get_mappings(element, map_list): > for name, value in element.attrib.iteritems(): > if name.startswith("xmlns"): > if name == "xmlns": > map_list.append(("", value)) > elif name[5] == ":": > map_list.append((name, value)) > > Now lets's try the code with my example from before. > > First the xml case: > > from lxml import etree > map_list = [] > element = etree.fromstring("<html xmlns='http://www.w3.org/1999/xhtml' > xmlns:foo='http://foo.example'></html>") > get_mappings(element, map_list) > print map_list > > >>> [] > > now the html5lib case > > import html5lib > map_list = [] > tree = html5lib.parse("<html xmlns='http://www.w3.org/1999/xhtml' > xmlns:foo='http://foo.example'></html>", treebuilder="lxml") > element = tree.getroot() > get_mappings(element, map_list) > print map_list > > >>> [('', 'http://www.w3.org/1999/xhtml')] > > So in the first case I didn't get any namespace mappings at all and in > the second I only got the one bound to the html namespace. The reason > for this is that lxml is namespace aware. Since Namespaces in XML 1.0 > makes things that start xmlns: and xmlns= special compared to other > attributes, they are handled differently by the api; they do not > appear in the ordinary list of attributes, at least when the tree is > created using an XML parser. When the tree is created using html5lib, > the xmlns attribute does appear in the tree; an attribute called xmlns > doesn't have any special meaning in html and lxml doesn't prevent you > from creating such an attribute (although this is arguably a bug since > it is possible to make it serialize an element with _two_ attributes > named xmlns which will throw a well-formedness error when parsing). > > The xmlns:foo attribute doesn't appear in the attributes collection in > either case because lxml enforces the XML Namespaces 1.0 restriction > forbidding colons from the local part of attribute names. Because this > restriction doesn't apply in HTML, html5lib is forced to use the rules > in section 9.2.7 of HTML5 [1] which cause the ":" to be replaced with > "U00003A" so that the attribute xmlns:foo is written xmlnsU00003Afoo > which, conveniently cannot be created by a html parser other than > through this substitution mechanism (because it contains uppercase > ascii characters). On the other hand it since it is, by construction, > a valid XML+XML Namespaces local name, it could be created by an XML > parser. > > So, to summarise the situation, if we want to implement an RDFa parser > using lxml we cannot simply look for attributes named xmlns or with > names starting "xmlns:". Instead we must: > > * For xml documents use the .nsmap property of elements that stores > the prefix:uri map in scope for a particular element > * For html documents look for attributes named xmlns or attributes > with names starting xmlnsU00003A > > Since the technique applied to HTML documents may produce false prefix > bindings when applied to XML documents one must know in advance what > type of document one is working with. > >> But in any case, the spec simply says "get the values of xmlns-based >> attributes and crack them open, and we'll keep track of scoping >> ourselves". > > As you can see that approach only works if the tree API you are > working with happens to have a particular design in which namespace > deceleration attributes are visible in the same way as ordinary > attributes. It happens that DOM is designed in this way, presumably > because it is the result of a namespace aware API grafted atop a > namespace unaware API. However it is not, in general, a good assumption > for tree APIs that postdate Namespaces in XML 1.0. > > It is also worth noting that a consumer trying to use lxml is rather > *better* off than one trying to use the plain ElementTree API that > ships in the python standard library (ElementTree pioneered the API > that lxml uses). As far ElementTree does not provide any built-in way > to get at prefix mappings, so one is forced to hook in to the parser > event stream and record the prefixes going in and out of scope > manually [2]. > > [1] > http://www.whatwg.org/specs/web-apps/current-work/#coercing-an-html-dom-into-an-infoset > [2] http://effbot.org/zone/element-namespaces.htm > >
Received on Monday, 21 September 2009 15:31:35 UTC