- From: Danny Ayers <danny.ayers@gmail.com>
- Date: Wed, 28 Feb 2007 13:44:15 +0100
- To: "Microformats Discuss" <microformats-discuss@microformats.org>
- Cc: www-archive@w3.org
* When my user agent encounters a HTML document it can use various pattern-matching rules to extract embedded data, e.g. it sees the string "vevent" in an attribute and uses the conventions for hCalendar to pull out the event details. This heuristic-based approach to document interpretation is commonly known as screenscraping [1]. * When my user agent encounters a HTML document it can follow the HTML specification, specifically the part on Meta Data Profiles [2]. If it finds the a URI corresponding to the hCalendar profile, it can use the conventions for hCalendar (which are encoded in a machine-readable form in the XMDP document at the profile URI) to pull out the event details. This deterministic approach to document interpretation is commonly know as parsing [3]. Being able to reliably parse documents means that there's a much better chance of the publisher's intent being preserved. This point is considerably more significant when the markup is likely to have some machine-processing rather than being directly rendered to the user, intelligent responses to mistakes are considerably more difficult for computers. The preservation of the publishers intent, their authorised statements, is particularly important in an environment where republication is not uncommon and provenance tracking often desirable. With scraping the chain of authority is broken at the first link. What's more, data extraction is easier and more efficient if there's a profile URI in place. Once the head of the doc has been read, the agent has all the information it needs on how to process the body. No speculative comparisons between the body content and a list of known attribute strings. Right now the best that can be offered for most microformats is calculated guesswork. What's needed to get beyond this is the minting of reasonably stable profile URIs and the XMDP documents placed at those URIs. XMDP profiles have already been drafted for many of the microformats (e.g. there's one for hCalendar at [4]). I really don't understand the lack of activity from the core devotees* of microformats on this. A minimal piece of server admin - publish the existing profiles at e.g. http://microformats.org/profiles/hcalendar and it's done. Yet it must be approaching a year since it was accepted that profiles should have URIs [5]. What does this say of the microformats process? Ok, arguably microformats can solve 80/20 of the embedded data problem without profile URIs. But ignoring the profile part of the HTML spec makes a mockery of "based on existing interoperable standards". I really like Tantek's definition: "Microformats are the way to publish and share information on the web with higher fidelity." [6]. Right now the conventions only offer a marginal improvement in fidelity, because for the most part it's still just screenscraping. Can someone please take this small step as soon as possible. I believe it will make a huge difference in the long term. Cheers, Danny. * I'd rather avoid the negative connotations of "cabal" ;-) [1] http://en.wikipedia.org/wiki/Screen_scraping [2] http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.3 [3] http://en.wikipedia.org/wiki/Parsing [4] http://dannyayers.com/microformats/hcalendar-profile [5] http://microformats.org/wiki/profile-uris [6] http://microformats.org/wiki/what-are-microformats -- http://dannyayers.com
Received on Wednesday, 28 February 2007 12:44:25 UTC