Re: Microdata to RDF: First Editor's Draft (ACTION-6) from Gregg Kellogg on 2011-10-13 (public-html-data-tf@w3.org from October 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Thu, 13 Oct 2011 14:09:22 -0400
To: Jeni Tennison <jeni@jenitennison.com>
CC: "public-html-data-tf@w3.org" <public-html-data-tf@w3.org>
Message-ID: <BB2E195F-D0FF-473C-9FD2-3E43B7BC6A36@greggkellogg.net>
On Oct 13, 2011, at 9:57 AM, Jeni Tennison wrote:

> Gregg,
> 
> On 12 Oct 2011, at 20:26, Gregg Kellogg wrote:
>> I have created a draft of the Microdata to RDF transformation and uploaded it to our Mercurial repository [1]. Note that the links to the draft go to the repository, and the actual draft can be used by selecting the "raw" form of the document [2]. (The lack of a current checked-out version of the Mercurial repository that can be used for direct references should be addressed at some point).
> 
> Thanks, that's looking good.
> 
>> Notable changes between this draft and the algorithm given in [3]:
>> 
>> * If a page has more than one top-level item, they are expressed in an RDF Collection to preserve original item order.
>> * If an item property has more than one value, all values are expressed in an RDF Collection to preserve original value order.
> 
> I think it might be worth, perhaps in the introduction, talking about how the goal for the mapping is to balance the preservation of information from the original microdata and the creation of idiomatic RDF. You might say something about how the results of the conversion may have to go through some level of vocabulary-specific mapping (which might include assigning datatypes to values, mapping collections to repeated properties and so on) after extraction. (Let me know if you want me to put words together for that).

Sure, I was focusing on the procedure rather than the procedural prose for now. Any suggestions for introductory, normative or other text that should go in are encouraged.

> The other thing about collections is that it looks as though you've based whether or not to create a collection for the values of a property purely on the values of the particular instance of the property. An alternative would be that if the property is used with multiple values *anywhere* in the page, it should create a collection (possibly with a single value) for consistency.
> 
> For example, if you have:
> 
>  <p itemscope itemtype="http://example.org/Book">
>    A Book written by <span itemprop="author">A.N. Author</span>
>  </p>
>  <p itemscope itemtype="http://example.org/Book">
>    Another Book written by <span itemprop="author">A.N. Author</span> and <span itemprop="author">A.N. Other</span>
>  </p>
> 
> then I think you should get:
> 
>  @prefix eg: <http://example.org/>
>  [] a eg:Book ;
>    eg:author ("A.N. Author") ;
>    .
>  [] a eg:Book ;
>    eg:author ("A.N. Author" "A.N. Other") ;
>    .
> 
> rather than:
> 
>  @prefix eg: <http://example.org/>
>  [] a eg:Book ;
>    eg:author "A.N. Author" ;
>    .
>  [] a eg:Book ;
>    eg:author ("A.N. Author" "A.N. Other") ;
>    .
> 
> What do you think?

Actually, this would be the only way to create multiple values of a property that weren't in a collection, so I'd rather keep it the way it is, but I can see your point. Note, on a separate list [7] that mfhepp worried about the use of collections at all, as that does not allow appropriate Good Relations mappings. We really need to solicit more input on the whole notion of preserving order through collections when deriving RDF from Microdata.

>> * @itemprop names which are not absolute URIs are resolved as relative URIs either to @itemtype or Document base.
>> * Resolving @itemprop names against @itemtype uses a modified algorithm using everything after "/" or "#" in the type URI.
> 
> Hixie rightly points out in [5]:
> 
> 
>> Note that the property "name" in the vocabulary "http://example.org/feline"
>> and the property "http://example.org/feline#name" have absolutely not 
>> relationship in microdata. They are different properties and cannot be 
>> mechanically considered to be equivalent in any way. Any use of microdata 
>> that claims that a full URL property name is the same property as a short 
>> name in a specific vocabulary is wrong. It's two properties. They might 
>> have the same semantics and can be used as equivalent, but they are 
>> different properties and any specification that defines or uses both would 
>> need to define how to handle clashes.
> 
> 
> There are two things that come out of that.
> 
> First is that the microdata-RDF mapping spec should flag up that the generation of property names used in the spec is a wilful violation [6] of the microdata specification to create URIs which are recognisable to the users of most existing vocabularies. (The only one I know that doesn't adhere to the pattern used in the microdata/RDF mapping is the hCalendar vocabulary in the WHATWG microdata spec.)

Agreed.

> Second is that the spec needs to be clear about what happens when a short name for a property is turned into a URL that is also used in its full form in a property on that same item. My suggestion would be that the values are merged; the difficulty with that is in preserving the order of the values. Perhaps get the relevant property elements and sort them into document order before extracting their values?

Nasty (and probably artificial) corner case. I'd suggest we leave the behavior as UNDEFINED. Otherwise, we'd need to re-implement element.properties. Advice to authors is to either be explicit about @itemprop names or rely on inference, but not both.

>> * The property value definition is updated as follows:
>>  * Values are returned as Literal, URI Reference or Blank Node
>>  * Time elements with a @datetime attribute uses a lexical matching against xsd:date, xsd:time, and xsd:dateTime to create appropriate typed literal
>>  * Plain literals get language from elements' in-scope @lang
>>  * blockquote and q with @cite attribute generate a URI Reference value
> 
> I don't think that the cite attributes should generate a URI reference value; that's not the item value according to the microdata rules, and I think it would be confusing for a <blockquote> or <q> to generate different values in the microdata parse from the generated RDF.

Fair enough, but it was in the original GRDDL-like interpretation at document level, so I don't think it's unreasonable, but it is an exception to element.itemValue that would be easier to leave out.

It also seems that <time> is likely going away, if it's replaced with <data> we may be able to get a broader type of datatype inference; time will tell.

We could also look at a separate datatype entailment process, not related to Microdata; RDFa could benefit from this too. It can't be done at processing time, due to the need to avoid external dependencies. But, a separate spec could discuss using rdfs:range statements in identified vocabularies for performing datatype entailment of plain literals, replacing the original values.

Gregg

>> [1] https://dvcs.w3.org/hg/htmldata/
>> [2] https://dvcs.w3.org/hg/htmldata/raw-file/24af1cde0da1/microdata-rdf/index.html
>> [3] http://www.w3.org/TR/2011/WD-microdata-20110525/
>> [4] http://dev.w3.org/html5/md/Overview.html
> 
> [5] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0067.html
> [6] http://www.w3.org/TR/html5/introduction.html#willful-violation
[7] http://lists.w3.org/Archives/Public/public-vocabs/2011Oct/0032.html
> -- 
> Jeni Tennison
> http://www.jenitennison.com
>
Received on Thursday, 13 October 2011 18:11:36 UTC