Re: GRDDL and HTML from Norman Gray on 2006-11-27 (public-grddl-comments@w3.org from October to December 2006)

From: Norman Gray <norman@astro.gla.ac.uk>
Date: Mon, 27 Nov 2006 16:19:01 +0000
To: Harry Halpin <hhalpin@ibiblio.org>
Cc: public-grddl-comments@w3.org
Message-Id: <D79B0BB9-405E-46F8-B672-A071C2105FAC@astro.gla.ac.uk>
Harry and all, hello.

Thanks for your comments.

Oh dear: the following's become rather long, but it does have  
suggested text in it....

On 2006 Nov 27 , at 08.18, Harry Halpin wrote:

> 	As for your comments on non-XML and HTML, it does appear that  
> since GRDDL is defined over the XPath Data Model, [...]

That's not terrifically explicit in the GRDDL specification: "XPath"  
doesn't appear anywhere in the document, and it talks throughout of X 
(HT)ML _documents_.  It's broadly implied by the fact that GRDDL is  
specified as an XSLT transform, but that's all as far as I can see.

Why or when (just by the way) would you want to get XML from  
somewhere other than a document?  I can think of a few fairly wacky  
scenarios, but one surely reasonable one is where you're using a  
GRDDL processor to grub through an XML database.  Perhaps you have a  
collection of rather heterogeneous data objects in an XML database,  
and you decide that you can most neatly manage metadata there by  
including GRDDL transformations in strategic places.  If you want to  
do that, then you'd be wise not to serialise and reparse the  
contents, but pipe the database contents straight into a SAX stream,  
and plug that into your transformer. [thinks: hmm, _I_ have a  
heterogeneous XML database, and it just now occurs to me that this  
scenario may not be fanciful after all ...]

The vCard-to-SAX case is reasonable, too, I think, modulo some  
subtleties about where the GRDDL declarations actually appear.

I'm not claiming this is a big deal -- I'm not climbing on a  
hobbyhorse, don't worry -- just that abstracting the definition keeps  
things a little more flexible for the future.

And XML's not about angle-brackets!



More significantly:

You mentioned, Harry:

> However, we might add since there is not a standardized "tagsoup"  
> algorithm, it makes sense while people *can* pull GRDDL results out  
> of non-XML HTML, it is much safer to do so with XHTML. So the WG  
> will likely only fully endorse using GRDDL with XHTML, although we  
> will mention it is possible to use it with non-XML HTML "at your  
> own risk" in our Use Case docs.

That seems perfectly reasonable, and come to think of it, it wouldn't  
be reasonable to talk about `errors' in this context, since it  
wouldn't be feasible for such a spec to mandate error behaviour in  
anything but uselessly generic terms.

At the same time, I can't help feeling that saying just `if it's not  
well-formed, all bets are off' and `it is possible...', while true,  
is rather avoiding the issue.  In the case of someone generating (X) 
HTML which wraps third party content (I'm thinking again of Yahoo  
wrapping user-generated HTML), I think they could reasonably expect  
the GRDDL spec to give _some_ clue about what ought to happen when  
they put a valid and metadata-rich wrapper round invalid and RDF-less  
content.  I think they should also reasonably expect that GRDDL  
processors _would_ have a go, and if so it would be good for the spec  
to bless that.  In this case, I think it would be useful to make it  
clear that if they do emit ill-formed XHTML and end up saying  
`:your_mother a :hamster.', then it's formally their fault, and no- 
one's allowed to sue the poor little GRDDL processor, which was only  
doing its best in adverse circumstances.



Thanks for the link in the minutes to Henry Thompson's description[1]  
of the recent TAG issue TagSoupIntegration-54 [2] -- this seems to be  
exactly the same problem, and I'm encouraged by the parenthetical  
remark in:

> Is the indefinite persistence of 'tag soup' HTML* consistent with a
> sound architecture for the Web?  If so, (and the going-in assumption
> is that it _is_ so), what changes, if any, to fundamental Web
> technologies are necessary to integrate 'tag soup' with
> HTML and well-formed XML?

So while I certainly agree that it's not sensible for GRDDL to break  
its neck specifying what to do when faced with tag-soup nonsense, and  
inappropriate to specify a particular parser, I feel there's still  
probably scope for a couple of formal `shoulds' in there.

How about: ``GRDDL processors SHOULD attempt to recover gracefully  
from well-formedness or validity errors, and SHOULD retain any RDF  
generated from this process.  In this case (and only in this case),  
processors MAY use information about the document type (gained from a  
Content-Type header or otherwise) to assist in the best-effort  
parsing of the document.  If such an error-recovery strategy is  
employed, a GRDDL processor MAY rely on the generated RDF as if it  
had been extracted from a conformant document.''  That would be  
coupled with a remark ``Document authors are responsible for the RDF  
statements generated by a correctly-applied GRDDL transformation, and  
must be aware that, confronted with ill-formed or invalid XML, GRDDL  
processors are free to use a range of strategies to recover from  
errors, and free to rely on the RDF thus generated.''

That doesn't really commit anyone to anything, but it enshrines  
Postel's law in an appropriate balance of permissions and  
suggestions, appropriately deprecates ill-formed documents, makes it  
clear that authors should make their documents valid or bear the  
consequences, and makes it clear whose fault it is (the author's) if  
a GRDDL processor relies on RDF statements gleaned from a  
misunderstanding of an ill-formed document (though most of the time  
I'm sure this would be just fine).

As a tangential point, what about the case where a GRDDL processor is  
asked to handle an XHTML document which has a DTD, but which uses the  
xmlns:data-view technique for linking to the GRDDL transformation?   
It's therefore well-formed but invalid.  That case is excluded by  
both section 2 and section 4.  Are all bets off?



Is it still appropriate to be commenting on this, or does everyone  
feel it's resolved or uninteresting?  I get the impression that the  
telecon notes aren't intended to be Rulings -- is that correct?  I  
hope this is still of use.

All the best,

Norman


[1] http://lists.w3.org/Archives/Public/www-tag/2006Oct/0062.html
[2] http://www.w3.org/2001/tag/issues.html#TagSoupIntegration-54

-- 
------------------------------------------------------------------------ 
----
Norman Gray  /  http://nxg.me.uk
eurovotech.org  /  University of Leicester, UK
Received on Monday, 27 November 2006 16:19:23 UTC