- From: Danny Ayers <danny.ayers@gmail.com>
- Date: Fri, 18 Aug 2006 13:58:50 +0200
- To: "Ben Adida" <ben@mit.edu>
- Cc: public-grddl-wg <public-grddl-wg@w3.org>
On 8/18/06, Ben Adida <ben@mit.edu> wrote: > It seems to me that it's actually much more complicated than that. You'd > have to scrape for every known microformat, and the list is obviously > growing as time goes on. That's assuming you want all possible triples. In practice you might only be interested in specific domains. That's already problematic, because it implies > some central repository of "known vocabularies" that are assumed for all > web pages... that's quite a bit more centralized than most folks have > come to expect from the web. The web isn't short of such repositories (IANA mime types being the obvious example, and most specs have their share of reserved strings). I personally agree that in this case a formal registry probably wouldn't be appropriate. (Probably irrelevant, but Atom has a nice compromise trick for special names in link rel attributes - there is a registry of simple strings, but as an alternative anyone can use a URI). > What's more problematic, though, is that the interplay between multiple > microformats is not well defined. This is somewhat expected given that > each microformat is optimized for its specific field, but it also means > that it will become harder to parse one microformat without at least > *knowing* about the others, even if you don't want to parse the others. I don't see how this is any different from the case where you have the profile URIs. As I understand it transformations are applied independently, the triples resulting from each being aggregated. In practice a lot of relationships in mixed documents may slip through the net. (Here I like the idea of eRDF as glue, but encouraging people to publish that way is another matter). > In other words, I think the "parsing microformats without profile URIs" > is a pretty deep rathole as far as standardization is concerned. Probably. But right now only a small fraction of the microformat documents out there have profile URIs, there's nothing yet to suggest those proportions will change. If we are to reject the majority of total documents as out of scope, it should be a conscious decision. In extremis, are we going to declare that GRDDL only applies where there's a clean profile chain from valid XHTML? That may be a reasonable course of action, I don't know. On > this, I agree with DanC (gasp, this is not a regular occurrence!): if a > Google-like entity wants to parse and make a best guess as to what > metadata is included on a page, then more power to them. But to make > "guessing a transform" standard behavior seems awfully difficult and > error-prone. The entities don't have to be Google-like. There are already microformat tools that don't worry about the profile. e.g. this viewer [1] gives a uF-oriented view of this [2] profile-free page. Why shouldn't e.g. Tabulator have the same capability? If someone wanted to build a Google-like tool, a quick route to getting suitable data would be to hook into something like Pingerati. You would know in advance that most documents would contain microformat data, most would be from a small number of well-known vocabularies. Ok, there is something of the liberal vs. draconian argument here, and the bottom line is that it is up to the individual developer. But it might help the developer if there were at least some guidelines in place (I'm not suggesting that should be a deliverable, only that such possibilities should be considered). My personal preference would be no attempts to standardise profile (and/or namespace)-free transformation itself, but recognition that it will happen, it may well be the majority case because of data quality realities. These transformation may or may not come under the umbrella of GRDDL. There may be best practice hints, though I can't think of any. I'd also like to see support through vocabulary terms that express that particular graphs have in effect been scraped, without any data license from the publisher. Assuming of course that someone can think of a viable approach... Cheers, Danny. [1] http://pingerati.net/about/ [2] http://www.xformats.org/MicroViewer/ [3] http://www.jasonkolb.com/about -- http://dannyayers.com
Received on Friday, 18 August 2006 11:58:59 UTC