- From: Norman Gray <norman@astro.gla.ac.uk>
- Date: Tue, 14 Nov 2006 10:26:12 +0000
- To: public-grddl-comments@w3.org
Greetings.
I was talking to Harry Halpin last week, about GRDDL and HTML, and he
suggested I post a couple of comments here. I haven't been following
the list GRDDL discussions in detail, so apologies in advance if I'm
misunderstanding an issue.
* media types and content sniffing: RFC2616 section 7.2.1 says that
you may guess the media type if the content-type header is absent,
implying (very nearly unambiguously) that you must not if the header
is present. However sniffing the XML header to determine which type
of GRDDL transformation is present, if any, doesn't violate this [I
imagine I'm misunderstanding this as an issue, but I said I'd look up
the reference].
* media types, 2: the spec refers to transformations on well-formed
XML documents, and specifically XHTML. I presume that this refers to
all documents with media types text/xml, application/xml, and */*+xml
-- would that be correct? Would it be worth making explicit?
* error behaviour: The GRDDL spec doesn't say what a GRDDL processor
should do if fed something which isn't one of these media types, or
which purports to be but isn't, or isn't well-formed. Ought it to
discuss this? Possibilities would include may/must one of halt and
catch fire, signal an error, produce an empty model, do the best it
can; or leave it explicitly unspecified or implementation specified.
I'm not convinced it's necessary for the spec to include this, but it
might be of interest to services (such as Yahoo, say), which might
want to add a transformation at the top of documents that wrap user
content. Can they expect that their content will be parsed right up
to the point where the user's malformed content starts, for example?
* media types and HTML: The only MIME type described for XHTML is
text/html <http://www.w3.org/TR/2000/REC-xhtml1-20000126/#media>.
Now, there might still be one or two documents out there on the web
which are not well-formed XHTML, but which are served as text/html.
A GRDDL processor might just treat this as an error, and recover or
object as appropriate. However, it can probably do better, since
John Cowan's TagSoup parser <http://home.ccil.org/~cowan/XML/tagsoup/
> will take any old nonsense, and produce from it a (well-
structured) SAX stream.
Consider for example Joe Kappa's homepage (hold your nose: not
pretty), which starts:
> <head profile=http://www.w3.org/2003/g/data-view>
> Joe Lambda's Home page [an example of RDF in XHTML]
> <link rel=transformation href=http://www.w3.org/2003/12/rdf-in-
> xhtml-xslts/grokFOAF.xsl>
>
> <div class=foaf-person>
> <h1>Joe Lambda's homepage</h2>
>
> <strong>Note: this should obsolete the <a
> href="/2003/12/rdf-in-xhtml-xslts/complete-example.html">older version
> presently.</a>
>
> <p>Hi there, my name is <span class=foaf-name>Joe Lambda</span>,
> and I work
> at <a href="http://www.acme.com" rel="foaf-work">ACME Inc.</a>. You
> can
> contact me by email at <a
> href=mailto:joe.lambda@example.org>joe.lambda@example.org</a>, or
> get more
> info on my <a href="http://www.example.org/~jlambda/"
> rel="foaf-home">personal home page
>
> <h2>People I know
> <li><a href="http://www.example.org/~bfoo/" rel="foaf-knows">Bill
> Foo</li></a>
> <li><a href="mailto:gbaz@example.com" rel="foaf-knows">G. Baz</a>
> </div>
> ...
Processing with
% java -jar tsaxon.jar -H joe-kappa.html grokFOAF.xsl >joe-
kappa.rdf
(tsaxon is a convenience version of Saxon that John Cowan distributes
with his TagSoup built in) produces output which is identical to that
produced from Joe Lambda's page, modulo a redundant charset declaration.
This isn't just a curiosity. There's a lot of this sort of stuff on
the web, some of it surely being claimed as XHTML, and Postel's law
says that a GRDDL processor should probably try to cope with it
without puking. Such a processor might try to recover from parser
errors, or could just use TagSoup for text/html content and handle
anything. If the resulting RDF is wrong (ie, not what the author
intended), then this isn't the processor's fault. This seems to fit
in with GRDDL's pragmatic motivations.
* SAX streams: One could make this a little more abstract, and avoid
mentioning a specific parser, by specifying GRDDL behaviour as acting
on a SAX stream (or another post-parse data model, such as the
Infoset or a DOM). That opens the door to strategies like using
TagSoup for text/html content, but also things like vcard4j <http://
vcard4j.sourceforge.net/>, which produces a DOM from vcard input
(text/directory media type).
I hope these comments are useful.
All the best,
Norman
--
------------------------------------------------------------------------
----
Norman Gray / http://nxg.me.uk
eurovotech.org / University of Leicester, UK
Received on Tuesday, 14 November 2006 10:26:31 UTC