Re: GRDDL and HTML from Norman Gray on 2006-11-27 (public-grddl-comments@w3.org from October to December 2006)

From: Norman Gray <norman@astro.gla.ac.uk>
Date: Mon, 27 Nov 2006 21:03:57 +0000
To: Harry Halpin <hhalpin@ibiblio.org>
Cc: public-grddl-comments@w3.org
Message-Id: <EEFBA5A0-2CE5-48A8-995B-85DADF3F8E24@astro.gla.ac.uk>
Harry, hello again.

On 2006 Nov 27 , at 17.07, Harry Halpin wrote:

>> At the same time, I can't help feeling that saying just `if it's  
>> not well-formed, all bets are off' and `it is possible...', while  
>> true, is rather avoiding the issue.  In the case of someone  
>> generating (X)HTML which wraps third party content (I'm thinking  
>> again of Yahoo wrapping user-generated HTML), I think they could  
>> reasonably expect the GRDDL spec to give _some_ clue about what  
>> ought to happen when they put a valid and metadata-rich wrapper  
>> round invalid and RDF-less content.  I think they should also  
>> reasonably expect that GRDDL processors _would_ have a go, and if  
>> so it would be good for the spec to bless that.  In this case, I  
>> think it would be useful to make it clear that if they do emit ill- 
>> formed XHTML and end up saying `:your_mother a :hamster.', then  
>> it's formally their fault, and no-one's allowed to sue the poor  
>> little GRDDL processor, which was only doing its best in adverse  
>> circumstances.
>
> 	A think one part of GRDDL is the focus on "the author of a  
> document states that the transformation will provide a faithful  
> rendition of the source document, or some portion of the source  
> document, that preserves its meaning in RDF." [2] This puts one the  
> burden on the author to explicilty license the transform. One line  
> of argument could be that if the author wanted to license a  
> faithful rendition, they would want that rendition to be as  
> "deterministic" and unlikely to break as posible, and that would be  
> one reason to use XHTML instead of tagsoup.

Indeed.  Authors certainly ought to produce valid/well-formed XHTML,  
and given that they do, there neither is nor should be any ambiguity  
or indeterminacy about what RDF it transforms into.

I'm thinking of the case where an author is ignorant (they're  
following a recipe), or where they know what they're doing, but have  
made an engineering decision not to obsess about cleaning up their  
HTML because...

> [...]many pages are generated using HTML that "in the small" for a  
> set of particular web-pages is itself generic and regular even, so  
> that the  author could be able to determine a transformation to RDF  
> and specify it.

...which I entirely agree with.



>> As a tangential point, what about the case where a GRDDL processor  
>> is asked to handle an XHTML document which has a DTD, but which  
>> uses the xmlns:data-view technique for linking to the GRDDL  
>> transformation?  It's therefore well-formed but invalid.  That  
>> case is excluded by both section 2 and section 4.  Are all bets off?
>
> 	Do you mean a DTD that XHTML does not allow xmlns:data-view? I  
> believe that should not be a problem. Could you give us a test case  
> (i.e. a sample input document and your suggested output or problems  
> that it brigns up)

OK.  This is probably more an issue of wording than anything deeply  
technical.

How about this?

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
       xmlns:data-view="http://www.w3.org/2003/g/data-view#"
    data-view:transformation="http://www.w3.org/2003/12/rdf-in-xhtml- 
xslts/grokFOAF.xsl
                              http://www.w3.org/2003/12/rdf-in-xhtml- 
xslts/grokCC.xsl
                              http://www.w3.org/2003/12/rdf-in-xhtml- 
xslts/grokGeoURL.xsl">
<head>
   <title>Joe Lambda's Home page [an example of RDF in XHTML]</title>
[...]

That's a dialect of XHTML that _is_ constrained by DTD syntax (so the  
spec's section 2 doesn't apply), but it's invalid, because it has the  
xmlns: and data-view: attributes (so section 4 doesn't apply).

This isn't just niggling: is any purpose served by the restriction of  
section 4 to _valid_ XHTML?  Writing valid XHTML is of course good  
for one's immortal soul, but if you give out invalid but well-formed  
XHTML, with a GRDDL transformation which produces the RDF you want,  
why should the GRDDL processor care?

All that section 4 actually needs to do is define a special-case  
mechanism for a particular category of well-formed XML documents,  
which (for irrelvant reasons) can't use the section 2 mechanism.   
Thus my suggestion is (i) that a GRDDL processor should be required  
to have no problems with XHTML such as the above, and (ii) that the  
three instances of the word 'valid' in section 4 -- and indeed  
elsewhere -- could be deleted without loss.

This does also imply that a document like "<html ...><wibble/><head  
profile='...'> ... </html>" would be acceptable.  But so what?   
Indeed, the only way that either of these cases could be  
distinguished from the 'well-formed XML' of section 2 is if the GRDDL  
processor took the trouble to validate the document: this would  
simply be mad, so that the distinction between valid and invalid (but  
well-formed) documents in the spec is a distinction without a  
difference.

You want text?  Delete the 'valid' words, and:
> Stated more formally:
>
> If an XML document has an attribute with XPath /html/head/@profile,  
> and that attribute contains the string "http://www.w3.org/2003/g/ 
> data-view", then the document has a GRDDL transformation for each  
> resource named by /html/head/link[@rel='transformation']/@href
That clearly applies to a much larger class of documents than just  
XHTML, but again that needn't be GRDDL's problem.

All the best,

Norman


-- 
------------------------------------------------------------------------ 
----
Norman Gray  /  http://nxg.me.uk
eurovotech.org  /  University of Leicester, UK
Received on Monday, 27 November 2006 21:04:23 UTC