Re: Microformat profile URIs from Danny Ayers on 2006-08-18 (public-grddl-wg@w3.org from August 2006)

From: Danny Ayers <danny.ayers@gmail.com>
Date: Fri, 18 Aug 2006 13:58:50 +0200
To: "Ben Adida" <ben@mit.edu>
Cc: public-grddl-wg <public-grddl-wg@w3.org>
Message-ID: <1f2ed5cd0608180458p22619f57sbbb459a8f74af777@mail.gmail.com>
On 8/18/06, Ben Adida <ben@mit.edu> wrote:

> It seems to me that it's actually much more complicated than that. You'd
> have to scrape for every known microformat, and the list is obviously
> growing as time goes on.

That's assuming you want all possible triples. In practice you might
only be interested in specific domains.

That's already problematic, because it implies
> some central repository of "known vocabularies" that are assumed for all
> web pages... that's quite a bit more centralized than most folks have
> come to expect from the web.

The web isn't short of such repositories (IANA mime types being the
obvious example, and most specs have their share of reserved strings).
I personally agree that in this case a formal registry probably
wouldn't be appropriate. (Probably irrelevant, but Atom has a nice
compromise trick for special names in link rel attributes - there is a
registry of simple strings, but as an alternative anyone can use a
URI).

> What's more problematic, though, is that the interplay between multiple
> microformats is not well defined. This is somewhat expected given that
> each microformat is optimized for its  specific field, but it also means
> that it will become harder to parse one microformat without at least
> *knowing* about the others, even if you don't want to parse the others.

I don't see how this is any different from the case where you have the
profile URIs. As I understand it transformations are applied
independently, the triples resulting from each being aggregated. In
practice a lot of relationships in mixed documents may slip through
the net. (Here I like the idea of eRDF as glue, but encouraging people
to publish that way is another matter).

> In other words, I think the "parsing microformats without profile URIs"
> is a pretty deep rathole as far as standardization is concerned.

Probably. But right now only a small fraction of the microformat
documents out there have profile URIs, there's nothing yet to suggest
those proportions will change. If we are to reject the majority of
total documents as out of scope, it should be a conscious decision. In
extremis, are we going to declare that GRDDL only applies where
there's a clean profile chain from valid XHTML? That may be a
reasonable course of action, I don't know.

On
> this, I agree with DanC (gasp, this is not a regular occurrence!): if a
> Google-like entity wants to parse and make a best guess as to what
> metadata is included on a page, then more power to them. But to make
> "guessing a transform" standard behavior seems awfully difficult and
> error-prone.

The entities don't have to be Google-like. There are already
microformat tools that don't worry about the profile. e.g. this viewer
[1] gives a uF-oriented view of this [2] profile-free page. Why
shouldn't e.g. Tabulator have the same capability?

If someone wanted to build a Google-like tool, a quick route to
getting suitable data would be to hook into something like Pingerati.
You would know in advance that most documents would contain
microformat data, most would be from a small number of well-known
vocabularies.

Ok, there is something of the liberal vs. draconian argument here, and
the bottom line is that it is up to the individual developer. But it
might help the developer if there were at least some guidelines in
place (I'm not suggesting that should be a deliverable, only that such
possibilities should be considered).

My personal preference would be no attempts to standardise profile
(and/or namespace)-free transformation itself, but recognition that it
will happen, it may well be the majority case because of data quality
realities. These transformation may or may not come under the umbrella
of GRDDL.  There may be best practice hints, though I can't think of
any. I'd also like to see support through vocabulary terms that
express that particular graphs have in effect been scraped, without
any data license from the publisher. Assuming of course that someone
can think of a viable approach...

Cheers,
Danny.

[1] http://pingerati.net/about/
[2] http://www.xformats.org/MicroViewer/
[3] http://www.jasonkolb.com/about



-- 

http://dannyayers.com
Received on Friday, 18 August 2006 11:58:59 UTC