[whatwg] A Selector-based metadata proposal (was: Annotating structured data that HTML has no semantics for) from Ian Hickson on 2009-06-09 (public-whatwg-archive@w3.org from June 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 9 Jun 2009 23:29:15 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0906092256121.1648@hixie.dreamhostps.com>
On Thu, 14 May 2009, Eduard Pascual wrote:
>
> I have put online a document that describes my idea/proposal for a 
> selector-based solution to metadata. The document can be found at 
> http://herenvardo.googlepages.com/CRDF.pdf Feel free to copy and/or link 
> the file wherever you deem appropriate.
> 
> Needless to say, feedback and constructive criticism to the proposal is 
> always welcome. (Note: if discussion about this proposal should take 
> place somewhere else, please let me know.)

This proposal is very similar to RDF EASE. While I sympathise with the 
goal of making semantic extraction easier, I feel this approach has 
several fundamental problems which make it inappropriate for the specific 
use cases that were brought up and which resulted in the microdata 
proposal:

 * It separates (by design) the semantics from the data with those 
   semantics. I think this is a level of indirection too far -- when 
   something is a heading, it should _be_ a heading, it shouldn't be 
   labeled opaquely with a transformation sheet elsewhere defining that is 
   maps to the heading semantic.

 * It is even more brittle in the face of copy-and-paste and regular 
   maintenance than, say, namespace prefixes. It is very easy to forget to 
   copy the semantic transformation rules. It is very easy to edit the 
   document such that the selectors no longer match what they used to 
   match. It's not at all obvious from looking at the page that there are 
   semantics there.

 * It relies on selectors to do something subtle. Authors have a great 
   deal of trouble understanding selectors -- if you watch a typical Web 
   authors writing CSS, he will either use just class selectors, or he 
   will write selectors by trial and error until he gets the style he 
   wants. This isn't fatal for CSS because you can see the results right 
   there; for something as subtle as semantic data mining, it is extremely 
   likely that authors will make mistakes that turn their data into 
   garbage, which would make the feature impractical for large-scale use.

I say this despite really wanting Selectors to succeed (disclosure: I'm 
one of the editors of the Selectors specification and spent years working 
on its test suite).

I think CRDF has a bright future in doing the kind of thing GRDDL does, 
and in extracting data from pages that were written by authors who did not 
want to provide semantic data (i.e. screen scraping). It's an interesting 
way of converting, say, Microformats to RDF.


Having said that, I do agree that the repetition of microdata requires in 
common scenarios with blocks of repeated data is unfortunate. It is worse 
than the repetition one has just from the basic HTML markup.

e.g. this:

   <table>
    <tr>
     <td> Hedral  <td> Black
    <tr>
     <td> Pillar  <td> White
   </table>

...becomes this:

   <table>
    <tr item>
     <td itemprop=name> Hedral  <td itemprop=color> Black
    <tr item>
     <td itemprop=name> Pillar  <td itemprop=color> White
   </table>

...or even:

   <table>
    <tr item=com.example.cat>
     <td itemprop=com.example.name> Hedral  <td itemprop=com.example.color> Black
    <tr item>
     <td itemprop=com.example.name> Pillar  <td itemprop=com.example.color> White
   </table>

...which is far more verbose than ideal.

I considered special casing tables (using <col itemprop> to set 
itemprop="" for all cells in a column) but it would require quite a lot of 
complexity in processors since they'd additionally have to implement the 
table model, and having seen the quality of some of the implementations of 
metadata extractors used on Web content, I fear that that will be far too 
much complexity. (I fear even subject="" might already be too much.) The 
simpler we make it the more reliable it will be.

It also wouldn't solve the problem with other patterns, e.g. <dl> (which 
approaches like CRDF's handle fine).


I don't have a good answer for the repetition problem.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 9 June 2009 16:29:15 UTC