- From: Norman Gray <norman@astro.gla.ac.uk>
- Date: Tue, 14 Nov 2006 10:26:12 +0000
- To: public-grddl-comments@w3.org
Greetings. I was talking to Harry Halpin last week, about GRDDL and HTML, and he suggested I post a couple of comments here. I haven't been following the list GRDDL discussions in detail, so apologies in advance if I'm misunderstanding an issue. * media types and content sniffing: RFC2616 section 7.2.1 says that you may guess the media type if the content-type header is absent, implying (very nearly unambiguously) that you must not if the header is present. However sniffing the XML header to determine which type of GRDDL transformation is present, if any, doesn't violate this [I imagine I'm misunderstanding this as an issue, but I said I'd look up the reference]. * media types, 2: the spec refers to transformations on well-formed XML documents, and specifically XHTML. I presume that this refers to all documents with media types text/xml, application/xml, and */*+xml -- would that be correct? Would it be worth making explicit? * error behaviour: The GRDDL spec doesn't say what a GRDDL processor should do if fed something which isn't one of these media types, or which purports to be but isn't, or isn't well-formed. Ought it to discuss this? Possibilities would include may/must one of halt and catch fire, signal an error, produce an empty model, do the best it can; or leave it explicitly unspecified or implementation specified. I'm not convinced it's necessary for the spec to include this, but it might be of interest to services (such as Yahoo, say), which might want to add a transformation at the top of documents that wrap user content. Can they expect that their content will be parsed right up to the point where the user's malformed content starts, for example? * media types and HTML: The only MIME type described for XHTML is text/html <http://www.w3.org/TR/2000/REC-xhtml1-20000126/#media>. Now, there might still be one or two documents out there on the web which are not well-formed XHTML, but which are served as text/html. A GRDDL processor might just treat this as an error, and recover or object as appropriate. However, it can probably do better, since John Cowan's TagSoup parser <http://home.ccil.org/~cowan/XML/tagsoup/ > will take any old nonsense, and produce from it a (well- structured) SAX stream. Consider for example Joe Kappa's homepage (hold your nose: not pretty), which starts: > <head profile=http://www.w3.org/2003/g/data-view> > Joe Lambda's Home page [an example of RDF in XHTML] > <link rel=transformation href=http://www.w3.org/2003/12/rdf-in- > xhtml-xslts/grokFOAF.xsl> > > <div class=foaf-person> > <h1>Joe Lambda's homepage</h2> > > <strong>Note: this should obsolete the <a > href="/2003/12/rdf-in-xhtml-xslts/complete-example.html">older version > presently.</a> > > <p>Hi there, my name is <span class=foaf-name>Joe Lambda</span>, > and I work > at <a href="http://www.acme.com" rel="foaf-work">ACME Inc.</a>. You > can > contact me by email at <a > href=mailto:joe.lambda@example.org>joe.lambda@example.org</a>, or > get more > info on my <a href="http://www.example.org/~jlambda/" > rel="foaf-home">personal home page > > <h2>People I know > <li><a href="http://www.example.org/~bfoo/" rel="foaf-knows">Bill > Foo</li></a> > <li><a href="mailto:gbaz@example.com" rel="foaf-knows">G. Baz</a> > </div> > ... Processing with % java -jar tsaxon.jar -H joe-kappa.html grokFOAF.xsl >joe- kappa.rdf (tsaxon is a convenience version of Saxon that John Cowan distributes with his TagSoup built in) produces output which is identical to that produced from Joe Lambda's page, modulo a redundant charset declaration. This isn't just a curiosity. There's a lot of this sort of stuff on the web, some of it surely being claimed as XHTML, and Postel's law says that a GRDDL processor should probably try to cope with it without puking. Such a processor might try to recover from parser errors, or could just use TagSoup for text/html content and handle anything. If the resulting RDF is wrong (ie, not what the author intended), then this isn't the processor's fault. This seems to fit in with GRDDL's pragmatic motivations. * SAX streams: One could make this a little more abstract, and avoid mentioning a specific parser, by specifying GRDDL behaviour as acting on a SAX stream (or another post-parse data model, such as the Infoset or a DOM). That opens the door to strategies like using TagSoup for text/html content, but also things like vcard4j <http:// vcard4j.sourceforge.net/>, which produces a DOM from vcard input (text/directory media type). I hope these comments are useful. All the best, Norman -- ------------------------------------------------------------------------ ---- Norman Gray / http://nxg.me.uk eurovotech.org / University of Leicester, UK
Received on Tuesday, 14 November 2006 10:26:31 UTC