- From: Sam Ruby <rubys@intertwingly.net>
- Date: Wed, 27 May 2009 01:57:56 -0400
- To: Manu Sporny +ADw-msporny+AEA-digitalbazaar.com+AD4
- CC: Toby Inkster +ADw-tai+AEA-g5n.co.uk+AD4, RDFa mailing list +ADw-public-rdf-in-xhtml-tf+AEA-w3.org+AD4, HTMLWG WG +ADw-public-html+AEA-w3.org+AD4
Manu Sporny wrote: > Toby Inkster wrote: >> On Mon, 2009-05-25 at 20:55 -0400, Manu Sporny wrote: >>> So, thoughts on this issue? >> I don't think that a big song and dance is needed over this. The issue >> seems pretty simple to me. > > Hmm, I don't think it is that simple, and here's why... > > If you have the following markup: > > <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/"> > <span property="dc:description"><br>para1</span> > </div> > > A SAX-based parser (such as Expat), parsing an XHTML document will fail > to generate a triple due to a parser error. Even if you do some sort of > self-healing and continue processing the document, the XMLLiteral should > not be produced because the contents are not well-formed XML. > > However, an HTML5lib-based parser would correct the input to the > following before a purely DOM-based RDFa processor could see the > contents of the SPAN element: > > <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/"> > <span property="dc:description"><br/>para1</span> > </div> > > which would then generate the following triple: > > <#foo> > <http://purl.org/dc/elements/1.1/description> > '<br xmlns="http://www.w3.org/1999/xhtml" > xmlns:dc="http://purl.org/dc/elements/1.1/" />para1'^^rdf:XMLLiteral . > > So, we have the exact same markup generating two completely different > sets of XMLLiteral triples. If one of our goals is to generate the same > triples across different types of markup - we are failing to do so with > the current set of processing rules. It is worse than that. If you consider only the set of valid, and well-formed XHTML 1.1 documents, it is the case that parsing all such documents as text/html will produce a DOM, but it is not the case that all such DOMs will be identical to the ones produced if the same sources were parsed as application/xhtml+xml. More info: http://wiki.whatwg.org/wiki/HTML_vs._XHTML Most of the differences deal with things like titles, textarea, scripts, and style elements. Also, <![CDATA[...]]> ends up being treated as a comment in HTML. The first observation is that even the Microdata proposal in the current HTML 5 specification doesn't meet the criteria specified above(*), as titles which contain the strings "&" or "<" will produce different triples when those documents are parsed as text/html vs application/xhtml+xml. As an aside: for purposes of this discussion, I suggest adopting the approach of identifying content based on the MIME type. My weblog, for example, is only XHTML when served to browsers that support application/xhtml+xml. For all other browsers (e.g. Lynx, IE8), it is simply HTML. I also suggest dropping version numbers when referring to HTML or XHTML. >> Sometimes an RDFa parser, dealing with HTML, >> will hit a situation where it needs to generate an XMLLiteral from >> non-wellformed HTML. In these situations, it seems to me that we have a >> choice of three potential "the parser MUST" actions, all of which are >> roughly consistent with RDFa in XHTML: >> >> 1. The parser MUST ignore this triple altogether. A simple solution, and >> it means that the HTML graph would be a subset of the XHTML graph. RDF >> vocabularies are generally defined so that if a graph G is true, then >> any graph H such that H is a subset of G is also true. > > The XHTML parser can't ignore the triple due to a parser error, or if it > corrects the parser error, shouldn't output the malformed XMLLiteral. > > The HTML5lib parser will never see that the XMLLiteral was malformed. > >> 2. The parser MUST add the triple to the graph as normal, but MUST NOT >> set the literal's datatype to XMLLiteral. They could either leave the >> literal as an untyped literal (that happened to have a lot of angled >> brackets in it) or perhaps set it to some HTMLLiteral datatype of our >> own concoction. > > This would be a problem because the XML-based parser implementations > would switch the datatype of the object to something like > XMLCharacterStream, while the html5lib parser would output an XMLLiteral. > > I don't believe that there is any such thing as an malformed XMLLiteral > in HTML5... is there? Can anybody think of an example of an invalid > XMLLiteral in an html5 parser? <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/"> <span property="dc:description"><a$></span> </div> >> 3. The parser MUST coerce the HTML fragment into a well-formed (but not >> necessarily valid) XHTML fragment. The HTML5 draft gives us decent >> algorithms for doing this. > > It does, but HTML5 has nothing to do with XHTML1.1 and XHTML2 - why > should we apply HTML5's parsing rules to XHTML1.1 and XHTML2 documents? Browsers will apply HTML parsing rules to XHTML1.1 documents served as text/html. This can affect the triples produced by jquery.rdfa.js. > I don't think that this is something we can 'MUST' ourselves out of... > relaxing the conformance requirements to not include XMLLiterals seems > to be a mechanism that would: > > a) Allow variance in IF and HOW XMLLiterals are generated - which will > vary based on if a document is being parsed by a SAX-based XML parser in > XHTML1.1, or a DOM-based Javascript parser in HTML5. > b) Not automatically disqualify all DOM-based HTML5 implementations, or > non-raw-stream-based XHTML1.1 implementations. > > Although, even this approach bothers me quite a bit... as does getting > rid of XMLLiterals all-together. There are a set of documents which will produce the same RDF triples independent of whether the document is processed as text/html vs application/xhtml+xml. 1) I suggest that the syntaxes for RDFa in application/xhtml+xml vs RDFa in text/html not be considered separately, but be developed together and with an eye towards maximizing the set mentioned above. 2) (in the fullness of time) it would be helpful if there were a validator which identified documents which cause different triples to be produced. If such a tool also identifies other conformance issues with the document, it would be helpful to have an option to turn the reporting of such issues off as with many documents this will obscure the set of errors that affect the production of triples. 3) Test cases should be produced with the goal of ensuring that parsers looking to produce RDF triples (whether it be from microformats, microdata, or RDFa) respect the MIME type of the document. > -- manu - Sam Ruby (*) I don't know whether this is an oversight, or even a problem, but when looking into the HTML5 draft, I couldn't find where itemprop attributes have any effect on the RDF triples produced.
Received on Wednesday, 27 May 2009 06:06:26 UTC