Re: GRDDL and HTML from Harry Halpin on 2006-11-27 (public-grddl-comments@w3.org from October to December 2006)

From: Harry Halpin <hhalpin@ibiblio.org>
Date: Mon, 27 Nov 2006 12:07:01 -0500 (EST)
To: Norman Gray <norman@astro.gla.ac.uk>
Cc: public-grddl-comments@w3.org
Message-ID: <Pine.LNX.4.64.0611271151070.32765@tribal.metalab.unc.edu>
By all means continue with this line of comments - again, the GRDDL WG has 
explicitly not made any decision on how to treat GRDDL in non-XML HTML 
(and, it has not made a decision on how to deal with RDF served as 
"application/xml" in the context of GRDDL), so it's open season for 
comments on both these issues.

A few brief comments are inline:

  On Mon, 27 Nov 2006, Norman Gray wrote:

>
>
> Harry and all, hello.
>
> Thanks for your comments.
>
> Oh dear: the following's become rather long, but it does have suggested text 
> in it....
>
> On 2006 Nov 27 , at 08.18, Harry Halpin wrote:
>
>> 	As for your comments on non-XML and HTML, it does appear that since 
>> GRDDL is defined over the XPath Data Model, [...]
>
> That's not terrifically explicit in the GRDDL specification: "XPath" doesn't 
> appear anywhere in the document, and it talks throughout of X(HT)ML 
> _documents_.  It's broadly implied by the fact that GRDDL is specified as an 
> XSLT transform, but that's all as far as I can see.

 	This is something that should be flagged and a debate we're 
having, although I might add we have decided that GRDDL, while it *can* be 
specified as an XSLT transform, may be specified in other languages [1]

> Why or when (just by the way) would you want to get XML from somewhere other 
> than a document?  I can think of a few fairly wacky scenarios, but one surely 
> reasonable one is where you're using a GRDDL processor to grub through an XML 
> database.  Perhaps you have a collection of rather heterogeneous data objects 
> in an XML database, and you decide that you can most neatly manage metadata 
> there by including GRDDL transformations in strategic places.  If you want to 
> do that, then you'd be wise not to serialise and reparse the contents, but 
> pipe the database contents straight into a SAX stream, and plug that into 
> your transformer. [thinks: hmm, _I_ have a heterogeneous XML database, and it 
> just now occurs to me that this scenario may not be fanciful after all ...]

 	Yes, this is a reasonable case and thanks for bringing it up, as 
this sort of thing is becoming popular as well.

> The vCard-to-SAX case is reasonable, too, I think, modulo some subtleties 
> about where the GRDDL declarations actually appear.
>
> I'm not claiming this is a big deal -- I'm not climbing on a hobbyhorse, 
> don't worry -- just that abstracting the definition keeps things a little 
> more flexible for the future.
>
> And XML's not about angle-brackets!

 	Agreed. It's about infosets!
>
>
> More significantly:
>
> You mentioned, Harry:
>
>> However, we might add since there is not a standardized "tagsoup" 
>> algorithm, it makes sense while people *can* pull GRDDL results out of 
>> non-XML HTML, it is much safer to do so with XHTML. So the WG will likely 
>> only fully endorse using GRDDL with XHTML, although we will mention it is 
>> possible to use it with non-XML HTML "at your own risk" in our Use Case 
>> docs.
>
> That seems perfectly reasonable, and come to think of it, it wouldn't be 
> reasonable to talk about `errors' in this context, since it wouldn't be 
> feasible for such a spec to mandate error behaviour in anything but uselessly 
> generic terms.

 	Agreed.

> At the same time, I can't help feeling that saying just `if it's not 
> well-formed, all bets are off' and `it is possible...', while true, is rather 
> avoiding the issue.  In the case of someone generating (X)HTML which wraps 
> third party content (I'm thinking again of Yahoo wrapping user-generated 
> HTML), I think they could reasonably expect the GRDDL spec to give _some_ 
> clue about what ought to happen when they put a valid and metadata-rich 
> wrapper round invalid and RDF-less content.  I think they should also 
> reasonably expect that GRDDL processors _would_ have a go, and if so it would 
> be good for the spec to bless that.  In this case, I think it would be useful 
> to make it clear that if they do emit ill-formed XHTML and end up saying 
> `:your_mother a :hamster.', then it's formally their fault, and no-one's 
> allowed to sue the poor little GRDDL processor, which was only doing its best 
> in adverse circumstances.

 	A think one part of GRDDL is the focus on "the author of a 
document states that the transformation will provide a faithful rendition 
of the source document, or some portion of the source document, that 
preserves its meaning in RDF." [2] This puts one the burden on 
the author to explicilty license the transform. One line of argument could 
be that if the author wanted to license a faithful rendition, they would 
want that rendition to be as "deterministic" and unlikely to break as 
posible, and that would be one reason to use XHTML instead of tagsoup. 
However, another line of counter-argument would be that many pages are 
generated using HTML that "in the small" for a set of particular web-pages 
is itself generic and regular even, so that the  author could be able to 
determine a transformation to RDF and specify it. More comments on this 
are very welcome.



  > >
>
> Thanks for the link in the minutes to Henry Thompson's description[1] of the 
> recent TAG issue TagSoupIntegration-54 [2] -- this seems to be exactly the 
> same problem, and I'm encouraged by the parenthetical remark in:
>
>> Is the indefinite persistence of 'tag soup' HTML* consistent with a
>> sound architecture for the Web?  If so, (and the going-in assumption
>> is that it _is_ so), what changes, if any, to fundamental Web
>> technologies are necessary to integrate 'tag soup' with
>> HTML and well-formed XML?
>
> So while I certainly agree that it's not sensible for GRDDL to break its neck 
> specifying what to do when faced with tag-soup nonsense, and inappropriate to 
> specify a particular parser, I feel there's still probably scope for a couple 
> of formal `shoulds' in there.
>
> How about: ``GRDDL processors SHOULD attempt to recover gracefully from 
> well-formedness or validity errors, and SHOULD retain any RDF generated from 
> this process.  In this case (and only in this case), processors MAY use 
> information about the document type (gained from a Content-Type header or 
> otherwise) to assist in the best-effort parsing of the document.  If such an 
> error-recovery strategy is employed, a GRDDL processor MAY rely on the 
> generated RDF as if it had been extracted from a conformant document.''  That 
> would be coupled with a remark ``Document authors are responsible for the RDF 
> statements generated by a correctly-applied GRDDL transformation, and must be 
> aware that, confronted with ill-formed or invalid XML, GRDDL processors are 
> free to use a range of strategies to recover from errors, and free to rely on 
> the RDF thus generated.''

 	Thanks for the suggested text! We'll take this into account. 
Suggested text is *always* very much welcome.

> That doesn't really commit anyone to anything, but it enshrines Postel's law 
> in an appropriate balance of permissions and suggestions, appropriately 
> deprecates ill-formed documents, makes it clear that authors should make 
> their documents valid or bear the consequences, and makes it clear whose 
> fault it is (the author's) if a GRDDL processor relies on RDF statements 
> gleaned from a misunderstanding of an ill-formed document (though most of the 
> time I'm sure this would be just fine).
>
> As a tangential point, what about the case where a GRDDL processor is asked 
> to handle an XHTML document which has a DTD, but which uses the 
> xmlns:data-view technique for linking to the GRDDL transformation?  It's 
> therefore well-formed but invalid.  That case is excluded by both section 2 
> and section 4.  Are all bets off?

 	Do you mean a DTD that XHTML does not allow xmlns:data-view? I 
believe that should not be a problem. Could you give us a test case (i.e. 
a sample input document and your suggested output or problems that it 
brigns up) > >
> Is it still appropriate to be commenting on this, or does everyone feel it's 
> resolved or uninteresting?  I get the impression that the telecon notes 
> aren't intended to be Rulings -- is that correct?  I hope this is still of 
> use.

No, the telecon notes are not rulings - they are simply notes, formal 
decisions are always noted by "RESOLVED:" - in the last meeting we only 
resolved two test cases.

  > All the best,
>
> Norman
>
>
> [1] http://lists.w3.org/Archives/Public/www-tag/2006Oct/0062.html
> [2] http://www.w3.org/2001/tag/issues.html#TagSoupIntegration-54
>
[1]http://www.w3.org/2004/01/rdxh/spec#issue-whichlangs
[2]http://www.w3.org/TR/grddl/

 				--harry

 	Harry Halpin
 	Informatics, University of Edinburgh
         http://www.ibiblio.org/hhalpin
Received on Monday, 27 November 2006 18:11:49 UTC