W3C home > Mailing lists > Public > public-xg-webid@w3.org > January 2012

Re: rdfa parsing issue -- was: fixed https://foafssl.org/test/WebId

From: Damian Steer <pldms@mac.com>
Date: Fri, 06 Jan 2012 14:57:04 +0000
Message-ID: <4F070BC0.4070207@mac.com>
To: Henry Story <henry.story@bblfish.net>
CC: Jürgen Jakobitsch <j.jakobitsch@semantic-web.at>, "public-xg-webid@w3.org XG" <public-xg-webid@w3.org>
Hi Henry and Jürgen,

On 06/01/12 12:49, Henry Story wrote:

> Shellac's parser parses the xhtml correctly as xhtml in fact, but 
> when the html parser is used it comes to a different conclusion.

Yes, this is becoming a classic issue, and has nothing to do with RDFa
(although RDFa obscures the issue horribly).

> RDFA 1 is defined in xhtml only I understand, so it is true that we
> are going beyond what the spec by trying to parse html too. Perhaps
> this will be a lot simplified with rdfa1.1 which can be made to work
> with html5.

Yes, RDFa 1.0 is only really defined for xhtml, although useful work was
done on html 5 at the time (there are some html 5 tests). RDFa 1.1 does
address html 5, but note that it doesn't change anything here.

The problem is this:

    <div rel="foaf:depiction" href="http://2sea.org/2sealogo.png"/>
    <div rel="cert:key">
	...
    </div>

An xml parser sees a closed div, followed by another div. An html parser
sees a broken div so repairs it as follows:

    <div rel="foaf:depiction" href="http://2sea.org/2sealogo.png">
      <div rel="cert:key">
        ...
      </div>
    </div> <!-- close that div -->

i.e. one div contains another now, and thus you find

<http://2sea.org/2sealogo.png> cert:key ....

I ought to add a utility to switch the parser based on content type,
however in practice there's so much broken xhtml out there that tag soup
parsing is much safer (although it does lead to issues like this).

My advice would be to expect tag soup parsing in the wild and change the
html:

    <div rel="foaf:depiction" href="http://2sea.org/2sealogo.png"></div>

Hope this makes sense,

Damian
Received on Friday, 6 January 2012 14:58:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 6 January 2012 14:58:22 GMT