Re: XMLLiteral handling in RDFa in HTML

Date: Wed, 27 May 2009 08:28:12 -0500
It is worse than that.  If you consider only the set of valid, and
> well-formed XHTML 1.1 documents, it is the case that parsing all such
> documents as text/html will produce a DOM, but it is not the case that
> all such DOMs will be identical to the ones produced if the same sources
> were parsed as application/xhtml+xml.
> More info: http://wiki.whatwg.org/wiki/HTML_vs._XHTML
> Most of the differences deal with things like titles, textarea, scripts,
> and style elements.  Also, <![CDATA[...]]> ends up being treated as a
> comment in HTML.
> The first observation is that even the Microdata proposal in the current
> HTML 5 specification doesn't meet the criteria specified above(*), as
> titles which contain the strings "&amp;" or "&lt;" will produce
> different triples when those documents are parsed as text/html vs
> application/xhtml+xml.
> As an aside: for purposes of this discussion, I suggest adopting the
> approach of identifying content based on the MIME type.  My weblog, for
> example, is only XHTML when served to browsers that support
> application/xhtml+xml.  For all other browsers (e.g. Lynx, IE8), it is
> simply HTML.  I also suggest dropping version numbers when referring to

I don't think we can drop references to HTML5, as the work is still ongoing
and we can't be sure what impact future efforts on the document will have on
RDFa. To repeat again, property came close to being redefined, and that
would have had seriously negative impacts on RDFa, regardless of whether you
used the XHTML or HTML version.

> >> Sometimes an RDFa parser, dealing with HTML,
> >> will hit a situation where it needs to generate an XMLLiteral from
> >> non-wellformed HTML. In these situations, it seems to me that we have a
> >> choice of three potential "the parser MUST" actions, all of which are
> >> roughly consistent with RDFa in XHTML:
> >>
> >> 1. The parser MUST ignore this triple altogether. A simple solution, and
> >> it means that the HTML graph would be a subset of the XHTML graph. RDF
> >> vocabularies are generally defined so that if a graph G is true, then
> >> any graph H such that H is a subset of G is also true.
> >
> > The XHTML parser can't ignore the triple due to a parser error, or if it
> > corrects the parser error, shouldn't output the malformed XMLLiteral.
> >
> > The HTML5lib parser will never see that the XMLLiteral was malformed.
> >
> >> 2. The parser MUST add the triple to the graph as normal, but MUST NOT
> >> set the literal's datatype to XMLLiteral. They could either leave the
> >> literal as an untyped literal (that happened to have a lot of angled
> >> brackets in it) or perhaps set it to some HTMLLiteral datatype of our
> >> own concoction.
> >
> > This would be a problem because the XML-based parser implementations
> > would switch the datatype of the object to something like
> > XMLCharacterStream, while the html5lib parser would output an XMLLiteral.
> >
> > I don't believe that there is any such thing as an malformed XMLLiteral
> > in HTML5... is there? Can anybody think of an example of an invalid
> > XMLLiteral in an html5 parser?
> <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/">
>    <span property="dc:description"><a$></span>
> </div>
> >> 3. The parser MUST coerce the HTML fragment into a well-formed (but not
> >> necessarily valid) XHTML fragment. The HTML5 draft gives us decent
> >> algorithms for doing this.
> >
> > It does, but HTML5 has nothing to do with XHTML1.1 and XHTML2 - why
> > should we apply HTML5's parsing rules to XHTML1.1 and XHTML2 documents?
> Browsers will apply HTML parsing rules to XHTML1.1 documents served as
> text/html.  This can affect the triples produced by jquery.rdfa.js.
> > I don't think that this is something we can 'MUST' ourselves out of...
> > relaxing the conformance requirements to not include XMLLiterals seems
> > to be a mechanism that would:
> >
> > a) Allow variance in IF and HOW XMLLiterals are generated - which will
> > vary based on if a document is being parsed by a SAX-based XML parser in
> > XHTML1.1, or a DOM-based Javascript parser in HTML5.
> > b) Not automatically disqualify all DOM-based HTML5 implementations, or
> > non-raw-stream-based XHTML1.1 implementations.
> >
> > Although, even this approach bothers me quite a bit... as does getting
> > rid of XMLLiterals all-together.
> There are a set of documents which will produce the same RDF triples
> independent of whether the document is processed as text/html vs
> application/xhtml+xml.
> 1) I suggest that the syntaxes for RDFa in application/xhtml+xml vs RDFa
> in text/html not be considered separately, but be developed together and
> with an eye towards maximizing the set mentioned above.

There is already a recommended syntax for RDFa in XHTML. Whatever decision
made going forward, that won't change.

However, the RDFa group could follow the SVG working group's approach by
providing its own RDFa 'tiny' specification, as a next version, which will
then end up being a subset of a RDFa 'full' specification at a later time.

The 'tiny' version could specifically address the limitations that HTML/DOM
implementation casts, while allowing folks to freely use the original RDFa
specification if they serve their pages up as XHTML. The next 'full' version
of RDFa can then build on the 'tiny' specification, adding back in
capabilities allowed through the use of XHTML, while ensuring consistent RDF

I don't think its important to generate the "same" set of triples, as to
ensure that people know what to expect when they serve their pages up as
HTML compared to XHTML, and that the triples they get with HTML will be a
subset of those they would get with XHTML.

> 2) (in the fullness of time) it would be helpful if there were a
> validator which identified documents which cause different triples to be
> produced.  If such a tool also identifies other conformance issues with
> the document, it would be helpful to have an option to turn the
> reporting of such issues off as with many documents this will obscure
> the set of errors that affect the production of triples.

And I'm sure there will be, in the fullness of time

> 3) Test cases should be produced with the goal of ensuring that parsers
> looking to produce RDF triples (whether it be from microformats,
> microdata, or RDFa) respect the MIME type of the document.

I believe that test cases have already been added in this regard.

> > -- manu
> - Sam Ruby
> (*) I don't know whether this is an oversight, or even a problem, but
> when looking into the HTML5 draft, I couldn't find where itemprop
> attributes have any effect on the RDF triples produced.
The section discussing how to generate RDF from microdata.
