Re: Microdata to RDF: First Editor's Draft (ACTION-6)

Hi Greg:

See inline comments:
On Oct 16, 2011, at 6:56 AM, Gregg Kellogg wrote:

> On Oct 15, 2011, at 12:52 PM, "Martin Hepp" <> wrote:
>> Hi Greg, all:
>> With respect to
>> I strongly suggest to define a "Compatibility parsing method" that produces RDF data as close as possible to the RDF data that the same vocabulary / data patterns would yield from RDFa.
> Interesting idea about having a modal processor, but I'd like to see if we can avoid it, unless it could be defined in the markup itself.
>> For instance, it should
>> 1. produce property URIs by attaching the local property name to the base URI of the vocabulary, and not to the URI of the itemtype or document,
> This would require. Means of specifying a vocabulary. In the absence of a means of doing this using defined attributes, we chose to infer the vocabulary from the type. Admittedly not ideal, and as Ivan has noted when this was suggested for RDFa, it conflates to separate concepts. FWIW, I would support having Microdata honor the @vocab attribute and inference rules from RDFa, but this would require action from the HTML WG.
> Did you have some idea for establishing the vocabulary? Otherwise, inferring the vocabulary from the type URI seems like the best option to me.

Without having thought about this deeply, I would say that inferring the vocabulary from the type URI seem the best option to me.
>> 2. try to create proper typed RDF literals if the vocabulary defines a single xsd datatype for a datatype property,
> Requiring a processor to read a vocabulary to discover rdf:range assertions was considered and rejected by the RDFa WG, because of the burden it places on a processor (difficult to do in JavaScript due to same domain issues).
> This could be alleviated if we white listed a limited set of vocabularies, but that might not scale well and would require continuous action to keep current.
> As an alternative, I've considered a separate datatype inference pass that could yield a graph with datatyped literals from one with plain literals only. Would that be useful?

Yes, if you mean to infer the datatype from looking at the data without looking at the vocabulary, so that

<div itemtype="" itemscope about="">
  <div itemprop="foo">30.20</div>

would yield a <> ;
	<> "30.20"^^xsd:decimal.

based on the fact that xsd:decimal is likely the best datatype for "30.20", no matter what range was defined for "foo".

I did not check this, but I assume that most SPARQL implementations would yield proper results even if xsd:int, xsd:integer, xsd:decimal, xsd:decimal, xsd:float, and xsd:double were mixed. 

Example: A parser based on looking at the data would most likely do the following:

a) integer =< 2,147,483,647 -> xsd:int
<div itemtype="" itemscope about="">
  <div itemprop="foo">30</div>
</div> a <> ;
	<> "30"^^xsd:int.

b) integer > 2,147,483,647 -> xsd:integer
<div itemtype="" itemscope about="">
  <div itemprop="foo">3000000000</div>  <!-- biggest int is 2,147,483,647 -->
</div> a <> ;
	<> "3000000000"^^xsd:integer.   <------------- xsd:integer instead of xsd:int

c) non-integer number -> xsd:decimal
<div itemtype="" itemscope about="">
  <div itemprop="foo">30.0</div> 
</div> a <> ;
	<> "30.0"^^xsd:decimal.

I assume that a triplestore will correctly work with a xsd:decimal literal when xsd:float was expected etc.

And we should always keep in mind that meaningful data consumers will always have to do data cleansing and sanity checks, so enforcing data quality at the specification level is likely overrated. A fool with a spec is still a fool.

>> 3. suppress the generation of RDF collections and other meta-data patterns that make the data break for SPARQL queries that would work for the same pattern in RDFa.
> I'm quite sympathetic to this view. Using collections was an attempt to ensure that the semantic interpretation of Microdata was consistent between RDF and JSON conversions, but I also question it's value for RDF.
> This also requires an issue.
Yes. As said, it could be specified at the vocabulary level or as a parser setting.
>> For instance, the attached two examples should result in roughly the same triples.
>> One idea for implementing this is to define an owl:AnnotationProperty for owl:Ontology that sets the Microdata parsing mode.
> Interesting idea, this would make processing for a given vocabulary unambiguous, but it would also require that the vocabulary be processed when parsing.

Yes, or a simple hash-table for common vocabularies. Since the consuming clients need to implement data cleansing heuristics anyway, they could also use heuristics for this decision. Also note that it is more important for the client to know whether to expect collections or not than for the vocabulary to control the parsing mode. So defining parsing modes may be the simpler approach.
>> Best
>> Martin
> Thanks very much for your constructive feedback.
You are very welcome!
> Gregg
> P.S., I also note that your RDFa example assumes some datatype inference, doing this through post-processing would satisfy both Microdata and RDFa use cases.
I must admit I don't get what you want to say with this. There are cases in GR where the rdfs:range of a property is not the actual best xsd:datatype but a supertype, but these cases are rare.

>> a) Microdata
>> <div itemscope itemtype="" itemid="#offer">
>> <div itemprop="name">Hepp Personal SCSI Controller Card</div>
>> <div itemprop="description">The Hepp Personal SCSI is a 16-bit 
>> add-on card that allows attaching up to seven SCSI devices to your computer.</div>
>> <link itemprop="hasBusinessFunction" 
>>    href="" />
>> <div itemscope itemprop="hasPriceSpecification" 
>>      itemtype="">Price: 
>>   <meta itemprop="hasCurrency" content="USD">$
>>   <span itemprop="hasCurrencyValue">99.99</span>
>>   <time itemprop="validThrough" datetime="2012-11-30T23:59:59Z"></time> 
>> </div>
>> Condition: <div itemprop="condition">used</div>
>> EAN/UPC: <span itemprop="hasEAN_UCC-13">1234567890123</span>
>> MPN: <span itemprop="hasMPN">PSCSI</span>
>> Article No. <span itemprop="hasStockKeepingUnit">123-456</span>
>> Availability: <span itemscope itemprop="hasInventoryLevel" 
>>      itemtype="">
>>   <meta property="hasMinValueFloat" content="1.0">In-stock
>> </span>
>> <img itemprop="" src="" 
>>      alt="text" />
>> <link itemprop="" href="" />
>> </div>
>> b) RDFa
>> <div typeof="gr:Offering" about="#offer">
>> <div property="gr:name">Hepp Personal SCSI Controller Card</div>
>> <div property="gr:description">The Hepp Personal SCSI is a 16-bit add-on card that allows 
>> attaching up to seven SCSI devices to your computer.</div>
>> <div rel="gr:hasBusinessFunction" 
>>    resource=""></div>
>> <div rel="gr:hasPriceSpecification">
>>   <div typeof="gr:UnitPriceSpecification">Price: 
>>    <span property="gr:hasCurrency" content="USD">$</span>
>>    <span property="gr:hasCurrencyValue">99.99</span>
>>    <div property="gr:validThrough" datatype="xsd:datetime" 
>>         content="2012-11-30T23:59:59Z"></div> 
>>   </div>
>> </div>
>> Condition: <div property="gr:condition>used</div>
>> EAN/UPC: <span property="gr:hasEAN_UCC-13 datatype="xsd:string">1234567890123</span>
>> MPN: <span property="gr:hasMPN datatype="xsd:string">PSCSI</span>
>> Article No. <span property="gr:hasStockKeepingUnit datatype="xsd:string">123-456</span>
>> Availability: <div rel="gr:hasInventoryLevel"> 
>>      <div typeof="gr:QuantitativeValue">
>>        <div property="gr:hasMinValueFloat" content="1.0" datatype="xsd:float">In-stock</div>
>>      </div>
>> </div>
>> <div rel="schema:image">
>>   <img src="" alt="text" />
>> </div>
>> <div rel="foaf:page" resource=""></div>
>> </div>

Received on Tuesday, 18 October 2011 20:09:06 UTC