Re: FPWD Review Request: HTML+RDFa

Hi James,

I feel bad pointing this out after all of your hard work...but thanks
to Elias Torres there has been a Python RDFa parser since early 2006:

  <http://rdfa.info/2006/06/04/a-python-rdfa-parser/>

Even if there wasn't such proof that it's possible, it wouldn't change anything.

This is because the issue being discussed was not:

  Can an RDFa parser process xmlns-based attributes when running on
  top of an XML stack?

The answer to that is self-evidently 'yes'.

The question raised by the WHATWG was:

  Can an RDFa parser process xmlns-based attributes in a browser
  running in HTML mode?

And that is the issue I was addressing in the comments of mine, that you quoted.

And once again, the answer is 'yes'.

Regards,

Mark


On Mon, Sep 21, 2009 at 9:58 AM, James Graham <jgraham@opera.com> wrote:
> I wasn't sure whether to send this but it seems that it may be apropos to
> the conversation that Henri is having so...
>
> Please ignore it with my apologies if it doesn't add anything of value; I
> haven't been following all the discussion.
>
> Mark Birbeck wrote:
>> Hi James,
>>
>> I think you're really going have to be more specific. You say things
>> like "one will soon run into the following problem", but you don't say
>> what the problem is. You say "clearly the tree[s] produced...will
>> require different processing", but you don't say why they require
>> different processing.
>>
>> Here's some typical code:
>>
>>   function getMappingsFromElement(element, mappingList) {
>>     var attributes = element.attributes, attrName, i;
>>
>>     if (attributes) {
>>       for (i = 0; i < attributes.length; i++) {
>>         attrName = attributes[i].nodeName;
>>
>>         if (attrName.substring(0, 5) === "xmlns") {
>>           if (attrName.length === 5) {
>>             mappingList.add("", attributes[i].nodeValue);
>>           } else if (attrName.substring(5, 6) === ':') {
>>             mappingList.add(attrName.substring(6),
>> attributes[i].nodeValue);
>>           }
>>         }
>>       }
>>     }
>
> Here is that code rewritten in python, assuming that you are using lxml
> as your tree library:
>
> def get_mappings(element, map_list):
>     for name, value in element.attrib.iteritems():
>         if name.startswith("xmlns"):
>             if name == "xmlns":
>                 map_list.append(("", value))
>             elif name[5] == ":":
>                 map_list.append((name, value))
>
> Now lets's try the code with my example from before.
>
> First the xml case:
>
> from lxml import etree
> map_list = []
> element = etree.fromstring("<html xmlns='http://www.w3.org/1999/xhtml'
> xmlns:foo='http://foo.example'></html>")
> get_mappings(element, map_list)
> print map_list
>
>  >>> []
>
> now the html5lib case
>
> import html5lib
> map_list = []
> tree = html5lib.parse("<html xmlns='http://www.w3.org/1999/xhtml'
> xmlns:foo='http://foo.example'></html>", treebuilder="lxml")
> element = tree.getroot()
> get_mappings(element, map_list)
> print map_list
>
>  >>> [('', 'http://www.w3.org/1999/xhtml')]
>
> So in the first case I didn't get any namespace mappings at all and in
> the second I only got the one bound to the html namespace. The reason
> for this is that lxml is namespace aware. Since Namespaces in XML 1.0
> makes things that start xmlns: and xmlns= special compared to other
> attributes, they are handled differently by the api; they do not
> appear in the ordinary list of attributes, at least when the tree is
> created using an XML parser. When the tree is created using html5lib,
> the xmlns attribute does appear in the tree; an attribute called xmlns
> doesn't have any special meaning in html and lxml doesn't prevent you
> from creating such an attribute (although this is arguably a bug since
> it is possible to make it serialize an element with _two_ attributes
> named xmlns which will throw a well-formedness error when parsing).
>
> The xmlns:foo attribute doesn't appear in the attributes collection in
> either case because lxml enforces the XML Namespaces 1.0 restriction
> forbidding colons from the local part of attribute names. Because this
> restriction doesn't apply in HTML, html5lib is forced to use the rules
> in section 9.2.7 of HTML5 [1] which cause the ":" to be replaced with
> "U00003A" so that the attribute xmlns:foo is written xmlnsU00003Afoo
> which, conveniently cannot be created by a html parser other than
> through this substitution mechanism (because it contains uppercase
> ascii characters). On the other hand it since it is, by construction,
> a valid XML+XML Namespaces local name, it could be created by an XML
> parser.
>
> So, to summarise the situation, if we want to implement an RDFa parser
> using lxml we cannot simply look for attributes named xmlns or with
> names starting "xmlns:". Instead we must:
>
> * For xml documents use the .nsmap property of elements that stores
>  the prefix:uri map in scope for a particular element
> * For html documents look for attributes named xmlns or attributes
>  with names starting xmlnsU00003A
>
> Since the technique applied to HTML documents may produce false prefix
> bindings when applied to XML documents one must know in advance what
> type of document one is working with.
>
>> But in any case, the spec simply says "get the values of xmlns-based
>> attributes and crack them open, and we'll keep track of scoping
>> ourselves".
>
> As you can see that approach only works if the tree API you are
> working with happens to have a particular design in which namespace
> deceleration attributes are visible in the same way as ordinary
> attributes. It happens that DOM is designed in this way, presumably
> because it is the result of a namespace aware API grafted atop a
> namespace unaware API. However it is not, in general, a good assumption
> for tree APIs that postdate Namespaces in XML 1.0.
>
> It is also worth noting that a consumer trying to use lxml is rather
> *better* off than one trying to use the plain ElementTree API that
> ships in the python standard library (ElementTree pioneered the API
> that lxml uses). As far ElementTree does not provide any built-in way
> to get at prefix mappings, so one is forced to hook in to the parser
> event stream and record the prefixes going in and out of scope
> manually [2].
>
> [1]
> http://www.whatwg.org/specs/web-apps/current-work/#coercing-an-html-dom-into-an-infoset
> [2] http://effbot.org/zone/element-namespaces.htm
>
>

Received on Monday, 21 September 2009 15:31:35 UTC