Re: XMLLiteral handling in RDFa in HTML from Sam Ruby on 2009-05-27 (public-html@w3.org from May 2009)

From: Sam Ruby <rubys@intertwingly.net>
Date: Wed, 27 May 2009 09:52:59 -0400
To: Shelley Powers <shelley.just@gmail.com>
CC: Manu Sporny <msporny@digitalbazaar.com>, Toby Inkster <tai@g5n.co.uk>, RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>, HTMLWG WG <public-html@w3.org>
Message-ID: <4A1D45BB.7020801@intertwingly.net>
Shelley Powers wrote:
> 
>     It is worse than that.  If you consider only the set of valid, and
>     well-formed XHTML 1.1 documents, it is the case that parsing all such
>     documents as text/html will produce a DOM, but it is not the case that
>     all such DOMs will be identical to the ones produced if the same sources
>     were parsed as application/xhtml+xml.
> 
>     More info: http://wiki.whatwg.org/wiki/HTML_vs._XHTML
> 
>     Most of the differences deal with things like titles, textarea, scripts,
>     and style elements.  Also, <![CDATA[...]]> ends up being treated as a
>     comment in HTML.
> 
>     The first observation is that even the Microdata proposal in the current
>     HTML 5 specification doesn't meet the criteria specified above(*), as
>     titles which contain the strings "&amp;" or "&lt;" will produce
>     different triples when those documents are parsed as text/html vs
>     application/xhtml+xml.
> 
>     As an aside: for purposes of this discussion, I suggest adopting the
>     approach of identifying content based on the MIME type.  My weblog, for
>     example, is only XHTML when served to browsers that support
>     application/xhtml+xml.  For all other browsers (e.g. Lynx, IE8), it is
>     simply HTML.  I also suggest dropping version numbers when referring to
>     HTML or XHTML.
> 
> I don't think we can drop references to HTML5, as the work is still 
> ongoing and we can't be sure what impact future efforts on the document 
> will have on RDFa. To repeat again, property came close to being 
> redefined, and that would have had seriously negative impacts on RDFa, 
> regardless of whether you used the XHTML or HTML version.

Apologies if I was unclear.  Some seem to have a notion that syntaxes 
can be different depending on what the enclosing format is.  I'm not 
convinced that that is a good idea when it comes to XHTML vs HTML.  I'm 
totally convinced that that is a bad idea when it comes to HTML4 vs HTML5.

I believe that the rules for RDFa in HTML should be the same for 
documents with the HTML5 doctype as they are for the HTML4 doctype.

>      >> Sometimes an RDFa parser, dealing with HTML,
>      >> will hit a situation where it needs to generate an XMLLiteral from
>      >> non-wellformed HTML. In these situations, it seems to me that we
>     have a
>      >> choice of three potential "the parser MUST" actions, all of
>     which are
>      >> roughly consistent with RDFa in XHTML:
>      >>
>      >> 1. The parser MUST ignore this triple altogether. A simple
>     solution, and
>      >> it means that the HTML graph would be a subset of the XHTML
>     graph. RDF
>      >> vocabularies are generally defined so that if a graph G is true,
>     then
>      >> any graph H such that H is a subset of G is also true.
>      >
>      > The XHTML parser can't ignore the triple due to a parser error,
>     or if it
>      > corrects the parser error, shouldn't output the malformed XMLLiteral.
>      >
>      > The HTML5lib parser will never see that the XMLLiteral was malformed.
>      >
>      >> 2. The parser MUST add the triple to the graph as normal, but
>     MUST NOT
>      >> set the literal's datatype to XMLLiteral. They could either
>     leave the
>      >> literal as an untyped literal (that happened to have a lot of angled
>      >> brackets in it) or perhaps set it to some HTMLLiteral datatype
>     of our
>      >> own concoction.
>      >
>      > This would be a problem because the XML-based parser implementations
>      > would switch the datatype of the object to something like
>      > XMLCharacterStream, while the html5lib parser would output an
>     XMLLiteral.
>      >
>      > I don't believe that there is any such thing as an malformed
>     XMLLiteral
>      > in HTML5... is there? Can anybody think of an example of an invalid
>      > XMLLiteral in an html5 parser?
> 
>     <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/">
>       <span property="dc:description"><a$></span>
>     </div>
> 
>      >> 3. The parser MUST coerce the HTML fragment into a well-formed
>     (but not
>      >> necessarily valid) XHTML fragment. The HTML5 draft gives us decent
>      >> algorithms for doing this.
>      >
>      > It does, but HTML5 has nothing to do with XHTML1.1 and XHTML2 - why
>      > should we apply HTML5's parsing rules to XHTML1.1 and XHTML2
>     documents?
> 
>     Browsers will apply HTML parsing rules to XHTML1.1 documents served as
>     text/html.  This can affect the triples produced by jquery.rdfa.js.
> 
>      > I don't think that this is something we can 'MUST' ourselves out
>     of...
>      > relaxing the conformance requirements to not include XMLLiterals
>     seems
>      > to be a mechanism that would:
>      >
>      > a) Allow variance in IF and HOW XMLLiterals are generated - which
>     will
>      > vary based on if a document is being parsed by a SAX-based XML
>     parser in
>      > XHTML1.1, or a DOM-based Javascript parser in HTML5.
>      > b) Not automatically disqualify all DOM-based HTML5
>     implementations, or
>      > non-raw-stream-based XHTML1.1 implementations.
>      >
>      > Although, even this approach bothers me quite a bit... as does
>     getting
>      > rid of XMLLiterals all-together.
> 
>     There are a set of documents which will produce the same RDF triples
>     independent of whether the document is processed as text/html vs
>     application/xhtml+xml.
> 
>     1) I suggest that the syntaxes for RDFa in application/xhtml+xml vs RDFa
>     in text/html not be considered separately, but be developed together and
>     with an eye towards maximizing the set mentioned above.
> 
> There is already a recommended syntax for RDFa in XHTML. Whatever 
> decision made going forward, that won't change.

It is possible for recommendations to be superseded.  I'm not suggesting 
that radical changes are required (nor, in fact, am I precluding that 
possibility); I'm just saying that instead of a separate RDFa in HTML 
specification, I think it would make sense for the successor to the 
current RDFa in XHTML document to cover documents served as 
application/xhtml+xml and documents served as text/html; particularly 
given that the current document purports to describe RDFa in XHTML 
despite the fact that most XHTML is served as text/html and therefore 
processed in a number of ways differently than what the current spec 
describes.

But that's just my input, others may have different perspectives.

> However, the RDFa group could follow the SVG working group's approach by 
> providing its own RDFa 'tiny' specification, as a next version, which 
> will then end up being a subset of a RDFa 'full' specification at a 
> later time.
> 
> The 'tiny' version could specifically address the limitations that 
> HTML/DOM implementation casts, while allowing folks to freely use the 
> original RDFa specification if they serve their pages up as XHTML. The 
> next 'full' version of RDFa can then build on the 'tiny' specification, 
> adding back in capabilities allowed through the use of XHTML, while 
> ensuring consistent RDF triples.
> 
> I don't think its important to generate the "same" set of triples, as to 
> ensure that people know what to expect when they serve their pages up as 
> HTML compared to XHTML, and that the triples they get with HTML will be 
> a subset of those they would get with XHTML.

I don't believe that there is a proper sub/superset relationship between 
HTML and XHTML w.r.t. RDFa; but given that RDFa in HTML is essentially 
undefined, anything is possible.  At the moment, I believe that it is 
entirely likely that an RDFa parser such as jquery.rdfa.js would pick up 
*more* triples in a document which varied the case of a number of 
elements if those documents were served as text/html vs the same 
documents served as application/xhtml+xml.

>     2) (in the fullness of time) it would be helpful if there were a
>     validator which identified documents which cause different triples to be
>     produced.  If such a tool also identifies other conformance issues with
>     the document, it would be helpful to have an option to turn the
>     reporting of such issues off as with many documents this will obscure
>     the set of errors that affect the production of triples.
> 
> And I'm sure there will be, in the fullness of time
> 
>     3) Test cases should be produced with the goal of ensuring that parsers
>     looking to produce RDF triples (whether it be from microformats,
>     microdata, or RDFa) respect the MIME type of the document.
> 
> I believe that test cases have already been added in this regard.

Got a URL handy?

>      > -- manu
> 
>     - Sam Ruby
> 
>     (*) I don't know whether this is an oversight, or even a problem, but
>     when looking into the HTML5 draft, I couldn't find where itemprop
>     attributes have any effect on the RDF triples produced.
> 
> The section discussing how to generate RDF from microdata.

- Sam Ruby
Received on Wednesday, 27 May 2009 13:53:41 UTC