GRDDL and HTML from Norman Gray on 2006-11-14 (public-grddl-comments@w3.org from October to December 2006)

From: Norman Gray <norman@astro.gla.ac.uk>
Date: Tue, 14 Nov 2006 10:26:12 +0000
To: public-grddl-comments@w3.org
Message-Id: <309CB289-4A09-4ECD-AC3A-1C1D3A688C49@astro.gla.ac.uk>
Greetings.

I was talking to Harry Halpin last week, about GRDDL and HTML, and he  
suggested I post a couple of comments here.  I haven't been following  
the list GRDDL discussions in detail, so apologies in advance if I'm  
misunderstanding an issue.

* media types and content sniffing: RFC2616 section 7.2.1 says that  
you may guess the media type if the content-type header is absent,  
implying (very nearly unambiguously) that you must not if the header  
is present.  However sniffing the XML header to determine which type  
of GRDDL transformation is present, if any, doesn't violate this  [I  
imagine I'm misunderstanding this as an issue, but I said I'd look up  
the reference].

* media types, 2: the spec refers to transformations on well-formed  
XML documents, and specifically XHTML.  I presume that this refers to  
all documents with media types text/xml, application/xml, and */*+xml  
-- would that be correct?  Would it be worth making explicit?

* error behaviour: The GRDDL spec doesn't say what a GRDDL processor  
should do if fed something which isn't one of these media types, or  
which purports to be but isn't, or isn't well-formed.  Ought it to  
discuss this?  Possibilities would include may/must one of halt and  
catch fire, signal an error, produce an empty model, do the best it  
can; or leave it explicitly unspecified or implementation specified.   
I'm not convinced it's necessary for the spec to include this, but it  
might be of interest to services (such as Yahoo, say), which might  
want to add a transformation at the top of documents that wrap user  
content.  Can they expect that their content will be parsed right up  
to the point where the user's malformed content starts, for example?

* media types and HTML: The only MIME type described for XHTML is  
text/html <http://www.w3.org/TR/2000/REC-xhtml1-20000126/#media>.   
Now, there might still be one or two documents out there on the web  
which are not well-formed XHTML, but which are served as text/html.   
A GRDDL processor might just treat this as an error, and recover or  
object as appropriate.  However, it can probably do better, since  
John Cowan's TagSoup parser <http://home.ccil.org/~cowan/XML/tagsoup/ 
 > will take any old nonsense, and produce from it a (well- 
structured) SAX stream.

Consider for example Joe Kappa's homepage (hold your nose: not  
pretty), which starts:

> <head profile=http://www.w3.org/2003/g/data-view>
> Joe Lambda's Home page [an example of RDF in XHTML]
> <link rel=transformation href=http://www.w3.org/2003/12/rdf-in- 
> xhtml-xslts/grokFOAF.xsl>
>
> <div class=foaf-person>
> <h1>Joe Lambda's homepage</h2>
>
> <strong>Note: this should obsolete the <a
> href="/2003/12/rdf-in-xhtml-xslts/complete-example.html">older version
> presently.</a>
>
> <p>Hi there, my name is <span class=foaf-name>Joe Lambda</span>,  
> and I work
> at <a href="http://www.acme.com" rel="foaf-work">ACME Inc.</a>. You  
> can
> contact me by email at <a
> href=mailto:joe.lambda@example.org>joe.lambda@example.org</a>, or  
> get more
> info on my <a href="http://www.example.org/~jlambda/"
> rel="foaf-home">personal home page
>
> <h2>People I know
>   <li><a href="http://www.example.org/~bfoo/" rel="foaf-knows">Bill
>   Foo</li></a>
>   <li><a href="mailto:gbaz@example.com" rel="foaf-knows">G. Baz</a>
> </div>
 > ...

Processing with

     % java -jar tsaxon.jar -H joe-kappa.html grokFOAF.xsl >joe- 
kappa.rdf

(tsaxon is a convenience version of Saxon that John Cowan distributes  
with his TagSoup built in) produces output which is identical to that  
produced from Joe Lambda's page, modulo a redundant charset declaration.

This isn't just a curiosity.  There's a lot of this sort of stuff on  
the web, some of it surely being claimed as XHTML, and Postel's law  
says that a GRDDL processor should probably try to cope with it  
without puking.  Such a processor might try to recover from parser  
errors, or could just use TagSoup for text/html content and handle  
anything.  If the resulting RDF is wrong (ie, not what the author  
intended), then this isn't the processor's fault.  This seems to fit  
in with GRDDL's pragmatic motivations.

* SAX streams: One could make this a little more abstract, and avoid  
mentioning a specific parser, by specifying GRDDL behaviour as acting  
on a SAX stream (or another post-parse data model, such as the  
Infoset or a DOM).  That opens the door to strategies like using  
TagSoup for text/html content, but also things like vcard4j <http:// 
vcard4j.sourceforge.net/>, which produces a DOM from vcard input  
(text/directory media type).

I hope these comments are useful.

All the best,

Norman


-- 
------------------------------------------------------------------------ 
----
Norman Gray  /  http://nxg.me.uk
eurovotech.org  /  University of Leicester, UK
Received on Tuesday, 14 November 2006 10:26:31 UTC