Re: FPWD Review Request: HTML+RDFa

I wasn't sure whether to send this but it seems that it may be apropos 
to the conversation that Henri is having so...

Please ignore it with my apologies if it doesn't add anything of value; 
I haven't been following all the discussion.

Mark Birbeck wrote:
 > Hi James,
 >
 > I think you're really going have to be more specific. You say things
 > like "one will soon run into the following problem", but you don't say
 > what the problem is. You say "clearly the tree[s] produced...will
 > require different processing", but you don't say why they require
 > different processing.
 >
 > Here's some typical code:
 >
 >   function getMappingsFromElement(element, mappingList) {
 >     var attributes = element.attributes, attrName, i;
 >
 >     if (attributes) {
 >       for (i = 0; i < attributes.length; i++) {
 >         attrName = attributes[i].nodeName;
 >
 >         if (attrName.substring(0, 5) === "xmlns") {
 >           if (attrName.length === 5) {
 >             mappingList.add("", attributes[i].nodeValue);
 >           } else if (attrName.substring(5, 6) === ':') {
 >             mappingList.add(attrName.substring(6), 
attributes[i].nodeValue);
 >           }
 >         }
 >       }
 >     }

Here is that code rewritten in python, assuming that you are using lxml
as your tree library:

def get_mappings(element, map_list):
      for name, value in element.attrib.iteritems():
          if name.startswith("xmlns"):
              if name == "xmlns":
                  map_list.append(("", value))
              elif name[5] == ":":
                  map_list.append((name, value))

Now lets's try the code with my example from before.

First the xml case:

from lxml import etree
map_list = []
element = etree.fromstring("<html xmlns='http://www.w3.org/1999/xhtml' 
xmlns:foo='http://foo.example'></html>")
get_mappings(element, map_list)
print map_list

  >>> []

now the html5lib case

import html5lib
map_list = []
tree = html5lib.parse("<html xmlns='http://www.w3.org/1999/xhtml' 
xmlns:foo='http://foo.example'></html>", treebuilder="lxml")
element = tree.getroot()
get_mappings(element, map_list)
print map_list

  >>> [('', 'http://www.w3.org/1999/xhtml')]

So in the first case I didn't get any namespace mappings at all and in
the second I only got the one bound to the html namespace. The reason
for this is that lxml is namespace aware. Since Namespaces in XML 1.0
makes things that start xmlns: and xmlns= special compared to other
attributes, they are handled differently by the api; they do not
appear in the ordinary list of attributes, at least when the tree is
created using an XML parser. When the tree is created using html5lib,
the xmlns attribute does appear in the tree; an attribute called xmlns
doesn't have any special meaning in html and lxml doesn't prevent you
from creating such an attribute (although this is arguably a bug since
it is possible to make it serialize an element with _two_ attributes
named xmlns which will throw a well-formedness error when parsing).

The xmlns:foo attribute doesn't appear in the attributes collection in
either case because lxml enforces the XML Namespaces 1.0 restriction
forbidding colons from the local part of attribute names. Because this
restriction doesn't apply in HTML, html5lib is forced to use the rules
in section 9.2.7 of HTML5 [1] which cause the ":" to be replaced with
"U00003A" so that the attribute xmlns:foo is written xmlnsU00003Afoo
which, conveniently cannot be created by a html parser other than
through this substitution mechanism (because it contains uppercase
ascii characters). On the other hand it since it is, by construction,
a valid XML+XML Namespaces local name, it could be created by an XML
parser.

So, to summarise the situation, if we want to implement an RDFa parser
using lxml we cannot simply look for attributes named xmlns or with
names starting "xmlns:". Instead we must:

* For xml documents use the .nsmap property of elements that stores
   the prefix:uri map in scope for a particular element
* For html documents look for attributes named xmlns or attributes
   with names starting xmlnsU00003A

Since the technique applied to HTML documents may produce false prefix
bindings when applied to XML documents one must know in advance what
type of document one is working with.

 > But in any case, the spec simply says "get the values of xmlns-based
 > attributes and crack them open, and we'll keep track of scoping
 > ourselves".

As you can see that approach only works if the tree API you are
working with happens to have a particular design in which namespace
deceleration attributes are visible in the same way as ordinary
attributes. It happens that DOM is designed in this way, presumably
because it is the result of a namespace aware API grafted atop a
namespace unaware API. However it is not, in general, a good assumption
for tree APIs that postdate Namespaces in XML 1.0.

It is also worth noting that a consumer trying to use lxml is rather
*better* off than one trying to use the plain ElementTree API that
ships in the python standard library (ElementTree pioneered the API
that lxml uses). As far ElementTree does not provide any built-in way
to get at prefix mappings, so one is forced to hook in to the parser
event stream and record the prefixes going in and out of scope
manually [2].

[1] 
http://www.whatwg.org/specs/web-apps/current-work/#coercing-an-html-dom-into-an-infoset
[2] http://effbot.org/zone/element-namespaces.htm

Received on Monday, 21 September 2009 08:58:23 UTC