[whatwg] A Selector-based metadata proposal (was: Annotating structured data that HTML has no semantics for) from Eduard Pascual on 2009-06-10 (public-whatwg-archive@w3.org from June 2009)

From: Eduard Pascual <herenvardo@gmail.com>
Date: Wed, 10 Jun 2009 12:03:50 +0200
Message-ID: <6ea53250906100303n56d25c39jf7dc27b262f355b@mail.gmail.com>
First of all, Ian, thank for your reply. I appreciate any opinions on
this subject.

On Wed, Jun 10, 2009 at 1:29 AM, Ian Hickson<ian at hixie.ch> wrote:
> This proposal is very similar to RDF EASE.
Indeed, they are both CSS-based, and they fulfill similar purposes.
Let me, however, highlight some differences:
1st, EASE is tighly bound to RDFa. However, RDFa is meant for
embeeding metadata, and was built with that purpose on mind; while
EASE is meant for linked metadata, so builiding it on top of RDFa's
embeeding constructs is quite unnatural. In contrast, CRDF is build
from CSS's syntax and RDF's (not RDFa's) concepts: it only shares with
RDFa what they both inherit from RDF: the concepts and data model.
2nd, EASE is meant to be complimentary to RDFa: they address (or
attempt to address) different use cases / needs (embeeding vs.
linking). On the other hand (more on this below), CRDF attempts to
address both cases, plus the case where an hybrid approach is
appropriate (inlining some metadata, and linking other).

> While I sympathise with the
> goal of making semantic extraction easier, I feel this approach has
> several fundamental problems which make it inappropriate for the specific
> use cases that were brought up and which resulted in the microdata
> proposal:
>
> ?* It separates (by design) the semantics from the data with those
> ? semantics.
That's not accurate. CRDF *allows* separating the semantics, but
doesn't require to do so. Everything could be inlined, and the
possibility of separation is just for when it is needed.

>   I think this is a level of indirection too far -- when
> ? something is a heading, it should _be_ a heading, it shouldn't be
> ? labeled opaquely with a transformation sheet elsewhere defining that is
> ? maps to the heading semantic.
That doesn't make much sense. When something is a heading, it *is* a
heading. What do you mean by "should be a heading?". CRDF (as well as
many other syntaxes for RDF) allow parsers that don't know the
specific semantics of the markup language to find out that something
is actually a heading anyway; and allows expressing semantics that the
markup language has no direct support for (for example, is it a
site-section heading? a news heading? an iguana's name (used as the
main title for each iguana's page on the iguana collection example)?
something else?).

> ?* It is even more brittle in the face of copy-and-paste and regular
> ? maintenance than, say, namespace prefixes. It is very easy to forget to
> ? copy the semantic transformation rules. It is very easy to edit the
> ? document such that the selectors no longer match what they used to
> ? match. It's not at all obvious from looking at the page that there are
> ? semantics there.
I think the whole copy-paste thing should be broken on two separate scenarios:
Copy-pasting source code: with the next version of the document (which
I'm already cleaning up, and will allow "@namespace" rules inside the
inlining attribute), this will be as brittle (and as resillient) as
prefixes are: when a fragment that includes the "@namespace"s or
prefixes it needs is copy-pasted, it will work as expected; OTOH, if a
rule relies on a namespace that is not available (declared outside of
the copy-pasted fragment), the rule will just be ignored. The risk of
the copied code clashing with declarations on its new location is
lower than it may seem: an author who is already adding CRDF code to
his pages is quite likely to review the code he's copying for the
semantics that may be there; and authoring tools that automatically
add semantic code should review whether things make sense or not when
pasting code on them (for example, invalid/redundant properties
could/should be notified to the author).
Copy-pasting content: currently, browser support for copy-pasting CSS
styled content is mediocre and inconsistent (some browsers do it
right, some don't, some don't even try), but this is already more than
what is supported for RDFa, Microdata, or other semantic formats. With
a bit of luck, pressure for browsers to include CRDF properties when
copying content could help to get decent support for CSS properties as
well (since most of the code for these tasks would be shared).

> ?* It relies on selectors to do something subtle. Authors have a great
> ? deal of trouble understanding selectors -- if you watch a typical Web
> ? authors writing CSS, he will either use just class selectors, or he
> ? will write selectors by trial and error until he gets the style he
> ? wants. This isn't fatal for CSS because you can see the results right
> ? there; for something as subtle as semantic data mining, it is extremely
> ? likely that authors will make mistakes that turn their data into
> ? garbage, which would make the feature impractical for large-scale use.
It relies on selectors to do what they do: select things. Nobody is
*asking* authors to make use of over-complicated selectors for each
piece of metadata they want to add; but CRDF tries to *allow* using
any valid selector for each case that needs it.

For what I have seen, most "newbies" can handle both class and element
selectors, and the descendant combinator. A good portion of them is
even aware that they can use the descendant to combine selectors of
either type, and even that they can chain this. In general, they use
the few selectors they can manage when it's enough for the job, and
rely on classes and inline styles to simplify their life.
Even in the case some users went for trial and error when crafting
their selectors for CRDF, trial implies trying, and trying implies
somehow checking the result to match expectations: even if it requires
a bit more of attention to catch the errors, even this may be
simplified: just take a look at this sample (only relevant parts of
the markup are included, for simplicity):
myFile.crdf:
@namespace foo "http://www.example.com/"
myOverComplexSelector1 {
color: red; /* the CRDF parser will ignore this rule */
foo|myProperty: someValue;
}
myOverComplexSelector2 {
color: blue;
foo|myProperty: someOtherValue;
}
myFile.html:
<link href="myFile.crdf" type="text/css">
<link href="myFile.crdf" type="text/crdf">
...
For most content, a red or blue color will denote each of the
foo|myProperty values. Mixing CSS with CRDF is ugly (but safe), and
this is not 100% reliable: something may be red or blue without having
the relevant value, or have such value but the color being overriden
by a more specific style, but looking at patterns (and CRDF is
intended to use selectors essentially for clearly patterned cases,
such as tables or <dl>'s) it should allow to check, on most cases,
whether the selector is doing what's expected or not.  For example, if
the selector takes the form td:nth-of-type(<whatever>), this method
allows checking it at a quick glance: even if in some cells the
contents override the color, it may be seen on the general if the
column as a whole is getting the relevant color, and hence the
relevant property value. One could go further, and disable (commenting
out in the HTML, or adding a "~" to the file name for its fetch to
fail) the actual CSS while testing the CRDF, and then there'll be no
risk of style overrides. And, if I try, I bet I may find other tricks
to visually test CRDF rules. Of course, this doesn't replace accurate
testing, which should be always done, but is an example of how the
similarities between CRDF and CSS can be exploited to make author's
life easier.

There is one case left: those authors who overcomplicate themselves
and just don't test things at all. I have already stated that, while I
am ok with some foolproofing, I'm totally against suicide-proofing: if
someone wants to doom their own page, they don't need the specs' help
to do so, there are plenty of ways; so adding complexity to specs to
deal with such cases is just a waste.

> I say this despite really wanting Selectors to succeed (disclosure: I'm
> one of the editors of the Selectors specification and spent years working
> on its test suite).
I am aware of your work on the Selectors spec and tests, and that's
why I value your feedback on this area so much.


> I think CRDF has a bright future in doing the kind of thing GRDDL does,
I'm not sure about what GRDDL does: I just took a look through the
spec, and it seems to me that it's just an overcomplication of what
XSLT can already do; so I'm not sure if I should take that statement
as a good or a bad thing.

> and in extracting data from pages that were written by authors who did not
> want to provide semantic data (i.e. screen scraping).
Now that you mention it, I have to agree... I haven't noticed that
before, but I'll probably make some way for this detail on the
document's intro and/or the examples

> It's an interesting way of converting, say, Microformats to RDF.
The ability to convert Microformats to RDF was intended (although not
fully achieved: some "bad" content would be treated differently
between CRDF and Microformats); and in the same way CRDF also provides
the ability to define de-centralized Microformats.org-like
vocabularies (I'm not sure if referring to these as "microformats"
would still be appropiate).


Once again, I want to thank you for your feedback. Besides some fixes,
your mail has also convinced me to add some clarifications to the
document for some recurrent missconceptions (for example, CRDF doesn't
require, nor even encourages, taking all the semantics out of the main
document: semantics should be kept as close as possible to the content
as long as this doesn't force redundance/repetition).

Regards,
Eduard Pascual
Received on Wednesday, 10 June 2009 03:03:50 UTC