Re: GRDDL and HTML from Harry Halpin on 2006-11-27 (public-grddl-comments@w3.org from October to December 2006)

From: Harry Halpin <hhalpin@ibiblio.org>
Date: Mon, 27 Nov 2006 03:18:38 -0500 (EST)
To: Norman Gray <norman@astro.gla.ac.uk>
Cc: public-grddl-comments@w3.org
Message-ID: <Pine.LNX.4.64.0611270309190.16124@tribal.metalab.unc.edu>
Norman,
 	We discussed your comments at our last telecon (Draft minutes 
[1]). There is broad support amongst members of the GRDDL WG for letting RDF being served as 
"application/xml" be used by GRDDL, but the details will be worked out in 
test-cases, and once we have the test-cases done we'll try to get 
consensus on the issue.

 	As for your comments on non-XML and HTML, it does appear that 
since GRDDL is defined over the XPath Data Model, it is possible that 
however you get that data model out of the data (be it tidy, tagsoup, 
etc.) then one can in practice use a GRDDL transform. We will mention this 
use-case in the next edition of our Use Case document (current 
version[2]). However, we might add since there is not a standardized 
"tagsoup" algorithm, it makes sense while people *can* pull GRDDL results 
out of non-XML HTML, it is much safer to do so with XHTML. So the WG 
will likely only fully endorse using GRDDL with XHTML, although we will 
mention it is possible to use it with non-XML HTML "at your own risk" in 
our Use Case docs.

 	Thanks for your insightful and detailed comments, and we hope to 
hear from you again!


[1]http://lists.w3.org/Archives/Public/public-grddl-wg/2006Nov/0096.html
[2]http://www.w3.org/2001/sw/grddl-wg/doc43/scenario-gallery.htm

  On Tue, 14 Nov 2006, Norman Gray wrote:

>
>
> Greetings.
>
> I was talking to Harry Halpin last week, about GRDDL and HTML, and he 
> suggested I post a couple of comments here.  I haven't been following the 
> list GRDDL discussions in detail, so apologies in advance if I'm 
> misunderstanding an issue.
>
> * media types and content sniffing: RFC2616 section 7.2.1 says that you may 
> guess the media type if the content-type header is absent, implying (very 
> nearly unambiguously) that you must not if the header is present.  However 
> sniffing the XML header to determine which type of GRDDL transformation is 
> present, if any, doesn't violate this  [I imagine I'm misunderstanding this 
> as an issue, but I said I'd look up the reference].
>
> * media types, 2: the spec refers to transformations on well-formed XML 
> documents, and specifically XHTML.  I presume that this refers to all 
> documents with media types text/xml, application/xml, and */*+xml -- would 
> that be correct?  Would it be worth making explicit?
>
> * error behaviour: The GRDDL spec doesn't say what a GRDDL processor should 
> do if fed something which isn't one of these media types, or which purports 
> to be but isn't, or isn't well-formed.  Ought it to discuss this? 
> Possibilities would include may/must one of halt and catch fire, signal an 
> error, produce an empty model, do the best it can; or leave it explicitly 
> unspecified or implementation specified.  I'm not convinced it's necessary 
> for the spec to include this, but it might be of interest to services (such 
> as Yahoo, say), which might want to add a transformation at the top of 
> documents that wrap user content.  Can they expect that their content will be 
> parsed right up to the point where the user's malformed content starts, for 
> example?
>
> * media types and HTML: The only MIME type described for XHTML is text/html 
> <http://www.w3.org/TR/2000/REC-xhtml1-20000126/#media>.  Now, there might 
> still be one or two documents out there on the web which are not well-formed 
> XHTML, but which are served as text/html.  A GRDDL processor might just treat 
> this as an error, and recover or object as appropriate.  However, it can 
> probably do better, since John Cowan's TagSoup parser 
> <http://home.ccil.org/~cowan/XML/tagsoup/> will take any old nonsense, and 
> produce from it a (well-structured) SAX stream.
>
> Consider for example Joe Kappa's homepage (hold your nose: not pretty), which 
> starts:
>
>> <head profile=http://www.w3.org/2003/g/data-view>
>> Joe Lambda's Home page [an example of RDF in XHTML]
>> <link rel=transformation 
>> href=http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokFOAF.xsl>
>> 
>> <div class=foaf-person>
>> <h1>Joe Lambda's homepage</h2>
>> 
>> <strong>Note: this should obsolete the <a
>> href="/2003/12/rdf-in-xhtml-xslts/complete-example.html">older version
>> presently.</a>
>> 
>> <p>Hi there, my name is <span class=foaf-name>Joe Lambda</span>, and I work
>> at <a href="http://www.acme.com" rel="foaf-work">ACME Inc.</a>. You can
>> contact me by email at <a
>> href=mailto:joe.lambda@example.org>joe.lambda@example.org</a>, or get more
>> info on my <a href="http://www.example.org/~jlambda/"
>> rel="foaf-home">personal home page
>> 
>> <h2>People I know
>>  <li><a href="http://www.example.org/~bfoo/" rel="foaf-knows">Bill
>>  Foo</li></a>
>>  <li><a href="mailto:gbaz@example.com" rel="foaf-knows">G. Baz</a>
>> </div>
>> ...
>
> Processing with
>
>   % java -jar tsaxon.jar -H joe-kappa.html grokFOAF.xsl >joe-kappa.rdf
>
> (tsaxon is a convenience version of Saxon that John Cowan distributes with 
> his TagSoup built in) produces output which is identical to that produced 
> from Joe Lambda's page, modulo a redundant charset declaration.
>
> This isn't just a curiosity.  There's a lot of this sort of stuff on the web, 
> some of it surely being claimed as XHTML, and Postel's law says that a GRDDL 
> processor should probably try to cope with it without puking.  Such a 
> processor might try to recover from parser errors, or could just use TagSoup 
> for text/html content and handle anything.  If the resulting RDF is wrong 
> (ie, not what the author intended), then this isn't the processor's fault. 
> This seems to fit in with GRDDL's pragmatic motivations.
>
> * SAX streams: One could make this a little more abstract, and avoid 
> mentioning a specific parser, by specifying GRDDL behaviour as acting on a 
> SAX stream (or another post-parse data model, such as the Infoset or a DOM). 
> That opens the door to strategies like using TagSoup for text/html content, 
> but also things like vcard4j <http://vcard4j.sourceforge.net/>, which 
> produces a DOM from vcard input (text/directory media type).
>
> I hope these comments are useful.
>
> All the best,
>
> Norman
>
>
>

-- 
 				--harry

 	Harry Halpin
 	Informatics, University of Edinburgh
         http://www.ibiblio.org/hhalpin
Received on Monday, 27 November 2006 08:19:07 UTC