- From: Sam Ruby <rubys@intertwingly.net>
- Date: Wed, 27 May 2009 09:52:59 -0400
- To: Shelley Powers <shelley.just@gmail.com>
- CC: Manu Sporny <msporny@digitalbazaar.com>, Toby Inkster <tai@g5n.co.uk>, RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>, HTMLWG WG <public-html@w3.org>
Shelley Powers wrote: > > It is worse than that. If you consider only the set of valid, and > well-formed XHTML 1.1 documents, it is the case that parsing all such > documents as text/html will produce a DOM, but it is not the case that > all such DOMs will be identical to the ones produced if the same sources > were parsed as application/xhtml+xml. > > More info: http://wiki.whatwg.org/wiki/HTML_vs._XHTML > > Most of the differences deal with things like titles, textarea, scripts, > and style elements. Also, <![CDATA[...]]> ends up being treated as a > comment in HTML. > > The first observation is that even the Microdata proposal in the current > HTML 5 specification doesn't meet the criteria specified above(*), as > titles which contain the strings "&" or "<" will produce > different triples when those documents are parsed as text/html vs > application/xhtml+xml. > > As an aside: for purposes of this discussion, I suggest adopting the > approach of identifying content based on the MIME type. My weblog, for > example, is only XHTML when served to browsers that support > application/xhtml+xml. For all other browsers (e.g. Lynx, IE8), it is > simply HTML. I also suggest dropping version numbers when referring to > HTML or XHTML. > > I don't think we can drop references to HTML5, as the work is still > ongoing and we can't be sure what impact future efforts on the document > will have on RDFa. To repeat again, property came close to being > redefined, and that would have had seriously negative impacts on RDFa, > regardless of whether you used the XHTML or HTML version. Apologies if I was unclear. Some seem to have a notion that syntaxes can be different depending on what the enclosing format is. I'm not convinced that that is a good idea when it comes to XHTML vs HTML. I'm totally convinced that that is a bad idea when it comes to HTML4 vs HTML5. I believe that the rules for RDFa in HTML should be the same for documents with the HTML5 doctype as they are for the HTML4 doctype. > >> Sometimes an RDFa parser, dealing with HTML, > >> will hit a situation where it needs to generate an XMLLiteral from > >> non-wellformed HTML. In these situations, it seems to me that we > have a > >> choice of three potential "the parser MUST" actions, all of > which are > >> roughly consistent with RDFa in XHTML: > >> > >> 1. The parser MUST ignore this triple altogether. A simple > solution, and > >> it means that the HTML graph would be a subset of the XHTML > graph. RDF > >> vocabularies are generally defined so that if a graph G is true, > then > >> any graph H such that H is a subset of G is also true. > > > > The XHTML parser can't ignore the triple due to a parser error, > or if it > > corrects the parser error, shouldn't output the malformed XMLLiteral. > > > > The HTML5lib parser will never see that the XMLLiteral was malformed. > > > >> 2. The parser MUST add the triple to the graph as normal, but > MUST NOT > >> set the literal's datatype to XMLLiteral. They could either > leave the > >> literal as an untyped literal (that happened to have a lot of angled > >> brackets in it) or perhaps set it to some HTMLLiteral datatype > of our > >> own concoction. > > > > This would be a problem because the XML-based parser implementations > > would switch the datatype of the object to something like > > XMLCharacterStream, while the html5lib parser would output an > XMLLiteral. > > > > I don't believe that there is any such thing as an malformed > XMLLiteral > > in HTML5... is there? Can anybody think of an example of an invalid > > XMLLiteral in an html5 parser? > > <div about="#foo" xmlns:dc="http://purl.org/dc/elements/1.1/"> > <span property="dc:description"><a$></span> > </div> > > >> 3. The parser MUST coerce the HTML fragment into a well-formed > (but not > >> necessarily valid) XHTML fragment. The HTML5 draft gives us decent > >> algorithms for doing this. > > > > It does, but HTML5 has nothing to do with XHTML1.1 and XHTML2 - why > > should we apply HTML5's parsing rules to XHTML1.1 and XHTML2 > documents? > > Browsers will apply HTML parsing rules to XHTML1.1 documents served as > text/html. This can affect the triples produced by jquery.rdfa.js. > > > I don't think that this is something we can 'MUST' ourselves out > of... > > relaxing the conformance requirements to not include XMLLiterals > seems > > to be a mechanism that would: > > > > a) Allow variance in IF and HOW XMLLiterals are generated - which > will > > vary based on if a document is being parsed by a SAX-based XML > parser in > > XHTML1.1, or a DOM-based Javascript parser in HTML5. > > b) Not automatically disqualify all DOM-based HTML5 > implementations, or > > non-raw-stream-based XHTML1.1 implementations. > > > > Although, even this approach bothers me quite a bit... as does > getting > > rid of XMLLiterals all-together. > > There are a set of documents which will produce the same RDF triples > independent of whether the document is processed as text/html vs > application/xhtml+xml. > > 1) I suggest that the syntaxes for RDFa in application/xhtml+xml vs RDFa > in text/html not be considered separately, but be developed together and > with an eye towards maximizing the set mentioned above. > > There is already a recommended syntax for RDFa in XHTML. Whatever > decision made going forward, that won't change. It is possible for recommendations to be superseded. I'm not suggesting that radical changes are required (nor, in fact, am I precluding that possibility); I'm just saying that instead of a separate RDFa in HTML specification, I think it would make sense for the successor to the current RDFa in XHTML document to cover documents served as application/xhtml+xml and documents served as text/html; particularly given that the current document purports to describe RDFa in XHTML despite the fact that most XHTML is served as text/html and therefore processed in a number of ways differently than what the current spec describes. But that's just my input, others may have different perspectives. > However, the RDFa group could follow the SVG working group's approach by > providing its own RDFa 'tiny' specification, as a next version, which > will then end up being a subset of a RDFa 'full' specification at a > later time. > > The 'tiny' version could specifically address the limitations that > HTML/DOM implementation casts, while allowing folks to freely use the > original RDFa specification if they serve their pages up as XHTML. The > next 'full' version of RDFa can then build on the 'tiny' specification, > adding back in capabilities allowed through the use of XHTML, while > ensuring consistent RDF triples. > > I don't think its important to generate the "same" set of triples, as to > ensure that people know what to expect when they serve their pages up as > HTML compared to XHTML, and that the triples they get with HTML will be > a subset of those they would get with XHTML. I don't believe that there is a proper sub/superset relationship between HTML and XHTML w.r.t. RDFa; but given that RDFa in HTML is essentially undefined, anything is possible. At the moment, I believe that it is entirely likely that an RDFa parser such as jquery.rdfa.js would pick up *more* triples in a document which varied the case of a number of elements if those documents were served as text/html vs the same documents served as application/xhtml+xml. > 2) (in the fullness of time) it would be helpful if there were a > validator which identified documents which cause different triples to be > produced. If such a tool also identifies other conformance issues with > the document, it would be helpful to have an option to turn the > reporting of such issues off as with many documents this will obscure > the set of errors that affect the production of triples. > > And I'm sure there will be, in the fullness of time > > 3) Test cases should be produced with the goal of ensuring that parsers > looking to produce RDF triples (whether it be from microformats, > microdata, or RDFa) respect the MIME type of the document. > > I believe that test cases have already been added in this regard. Got a URL handy? > > -- manu > > - Sam Ruby > > (*) I don't know whether this is an oversight, or even a problem, but > when looking into the HTML5 draft, I couldn't find where itemprop > attributes have any effect on the RDF triples produced. > > The section discussing how to generate RDF from microdata. - Sam Ruby
Received on Wednesday, 27 May 2009 13:53:41 UTC