- From: James Graham <jgraham@opera.com>
- Date: Mon, 21 Sep 2009 10:58:00 +0200
- To: Mark Birbeck <mark.birbeck@webbackplane.com>
- CC: Henri Sivonen <hsivonen@iki.fi>, Manu Sporny <msporny@digitalbazaar.com>, HTML WG <public-html@w3.org>, RDFa Developers <public-rdf-in-xhtml-tf@w3.org>
I wasn't sure whether to send this but it seems that it may be apropos to the conversation that Henri is having so... Please ignore it with my apologies if it doesn't add anything of value; I haven't been following all the discussion. Mark Birbeck wrote: > Hi James, > > I think you're really going have to be more specific. You say things > like "one will soon run into the following problem", but you don't say > what the problem is. You say "clearly the tree[s] produced...will > require different processing", but you don't say why they require > different processing. > > Here's some typical code: > > function getMappingsFromElement(element, mappingList) { > var attributes = element.attributes, attrName, i; > > if (attributes) { > for (i = 0; i < attributes.length; i++) { > attrName = attributes[i].nodeName; > > if (attrName.substring(0, 5) === "xmlns") { > if (attrName.length === 5) { > mappingList.add("", attributes[i].nodeValue); > } else if (attrName.substring(5, 6) === ':') { > mappingList.add(attrName.substring(6), attributes[i].nodeValue); > } > } > } > } Here is that code rewritten in python, assuming that you are using lxml as your tree library: def get_mappings(element, map_list): for name, value in element.attrib.iteritems(): if name.startswith("xmlns"): if name == "xmlns": map_list.append(("", value)) elif name[5] == ":": map_list.append((name, value)) Now lets's try the code with my example from before. First the xml case: from lxml import etree map_list = [] element = etree.fromstring("<html xmlns='http://www.w3.org/1999/xhtml' xmlns:foo='http://foo.example'></html>") get_mappings(element, map_list) print map_list >>> [] now the html5lib case import html5lib map_list = [] tree = html5lib.parse("<html xmlns='http://www.w3.org/1999/xhtml' xmlns:foo='http://foo.example'></html>", treebuilder="lxml") element = tree.getroot() get_mappings(element, map_list) print map_list >>> [('', 'http://www.w3.org/1999/xhtml')] So in the first case I didn't get any namespace mappings at all and in the second I only got the one bound to the html namespace. The reason for this is that lxml is namespace aware. Since Namespaces in XML 1.0 makes things that start xmlns: and xmlns= special compared to other attributes, they are handled differently by the api; they do not appear in the ordinary list of attributes, at least when the tree is created using an XML parser. When the tree is created using html5lib, the xmlns attribute does appear in the tree; an attribute called xmlns doesn't have any special meaning in html and lxml doesn't prevent you from creating such an attribute (although this is arguably a bug since it is possible to make it serialize an element with _two_ attributes named xmlns which will throw a well-formedness error when parsing). The xmlns:foo attribute doesn't appear in the attributes collection in either case because lxml enforces the XML Namespaces 1.0 restriction forbidding colons from the local part of attribute names. Because this restriction doesn't apply in HTML, html5lib is forced to use the rules in section 9.2.7 of HTML5 [1] which cause the ":" to be replaced with "U00003A" so that the attribute xmlns:foo is written xmlnsU00003Afoo which, conveniently cannot be created by a html parser other than through this substitution mechanism (because it contains uppercase ascii characters). On the other hand it since it is, by construction, a valid XML+XML Namespaces local name, it could be created by an XML parser. So, to summarise the situation, if we want to implement an RDFa parser using lxml we cannot simply look for attributes named xmlns or with names starting "xmlns:". Instead we must: * For xml documents use the .nsmap property of elements that stores the prefix:uri map in scope for a particular element * For html documents look for attributes named xmlns or attributes with names starting xmlnsU00003A Since the technique applied to HTML documents may produce false prefix bindings when applied to XML documents one must know in advance what type of document one is working with. > But in any case, the spec simply says "get the values of xmlns-based > attributes and crack them open, and we'll keep track of scoping > ourselves". As you can see that approach only works if the tree API you are working with happens to have a particular design in which namespace deceleration attributes are visible in the same way as ordinary attributes. It happens that DOM is designed in this way, presumably because it is the result of a namespace aware API grafted atop a namespace unaware API. However it is not, in general, a good assumption for tree APIs that postdate Namespaces in XML 1.0. It is also worth noting that a consumer trying to use lxml is rather *better* off than one trying to use the plain ElementTree API that ships in the python standard library (ElementTree pioneered the API that lxml uses). As far ElementTree does not provide any built-in way to get at prefix mappings, so one is forced to hook in to the parser event stream and record the prefixes going in and out of scope manually [2]. [1] http://www.whatwg.org/specs/web-apps/current-work/#coercing-an-html-dom-into-an-infoset [2] http://effbot.org/zone/element-namespaces.htm
Received on Monday, 21 September 2009 08:58:23 UTC