Re: Microdata to RDF: First Editor's Draft (ACTION-6)

On Oct 18, 2011, at 1:08 PM, Martin Hepp wrote:

Hi Greg:

See inline comments:
On Oct 16, 2011, at 6:56 AM, Gregg Kellogg wrote:

On Oct 15, 2011, at 12:52 PM, "Martin Hepp" <<>> wrote:

2. try to create proper typed RDF literals if the vocabulary defines a single xsd datatype for a datatype property,

Requiring a processor to read a vocabulary to discover rdf:range assertions was considered and rejected by the RDFa WG, because of the burden it places on a processor (difficult to do in JavaScript due to same domain issues).

This could be alleviated if we white listed a limited set of vocabularies, but that might not scale well and would require continuous action to keep current.

As an alternative, I've considered a separate datatype inference pass that could yield a graph with datatyped literals from one with plain literals only. Would that be useful?

Yes, if you mean to infer the datatype from looking at the data without looking at the vocabulary, so that

<div itemtype="" itemscope about="">
 <div itemprop="foo">30.20</div>

would yield a <> ;
<> "30.20"^^xsd:decimal.

based on the fact that xsd:decimal is likely the best datatype for "30.20", no matter what range was defined for "foo".

I did not check this, but I assume that most SPARQL implementations would yield proper results even if xsd:int, xsd:integer, xsd:decimal, xsd:decimal, xsd:float, and xsd:double were mixed.

Example: A parser based on looking at the data would most likely do the following:

a) integer =< 2,147,483,647 -> xsd:int
<div itemtype="" itemscope about="">
 <div itemprop="foo">30</div>
</div> a <> ;
<> "30"^^xsd:int.

b) integer > 2,147,483,647 -> xsd:integer
<div itemtype="" itemscope about="">
 <div itemprop="foo">3000000000</div>  <!-- biggest int is 2,147,483,647 -->
</div> a <> ;
<> "3000000000"^^xsd:integer.   <------------- xsd:integer instead of xsd:int

c) non-integer number -> xsd:decimal
<div itemtype="" itemscope about="">
 <div itemprop="foo">30.0</div>
</div> a <> ;
<> "30.0"^^xsd:decimal.

I assume that a triplestore will correctly work with a xsd:decimal literal when xsd:float was expected etc.

And we should always keep in mind that meaningful data consumers will always have to do data cleansing and sanity checks, so enforcing data quality at the specification level is likely overrated. A fool with a spec is still a fool.

Well, I don't think we can always infer the datatype by performing lexical matching; I think this would be too aggressive. In ISSUE-1 [1], I suggest that this could be done by processing the vocabulary at parse time, or from a cached version of the vocabulary. For most purposes, important vocabularies will be cached anyway, so this doesn't really impose a real runtime burden in the normal case. If it's not cached, there should probably be language which allows an implementation to produce triples using a generic algorithm, perhaps with information in a processor graph (ala RDFa) which indicates this choice.



3. suppress the generation of RDF collections and other meta-data patterns that make the data break for SPARQL queries that would work for the same pattern in RDFa.

I'm quite sympathetic to this view. Using collections was an attempt to ensure that the semantic interpretation of Microdata was consistent between RDF and JSON conversions, but I also question it's value for RDF.

This also requires an issue.
Yes. As said, it could be specified at the vocabulary level or as a parser setting.

For instance, the attached two examples should result in roughly the same triples.

One idea for implementing this is to define an owl:AnnotationProperty for owl:Ontology that sets the Microdata parsing mode.

Interesting idea, this would make processing for a given vocabulary unambiguous, but it would also require that the vocabulary be processed when parsing.

Yes, or a simple hash-table for common vocabularies. Since the consuming clients need to implement data cleansing heuristics anyway, they could also use heuristics for this decision. Also note that it is more important for the client to know whether to expect collections or not than for the vocabulary to control the parsing mode. So defining parsing modes may be the simpler approach.



Thanks very much for your constructive feedback.

You are very welcome!

P.S., I also note that your RDFa example assumes some datatype inference, doing this through post-processing would satisfy both Microdata and RDFa use cases.
I must admit I don't get what you want to say with this. There are cases in GR where the rdfs:range of a property is not the actual best xsd:datatype but a supertype, but these cases are rare.

a) Microdata
<div itemscope itemtype="" itemid="#offer">
<div itemprop="name">Hepp Personal SCSI Controller Card</div>
<div itemprop="description">The Hepp Personal SCSI is a 16-bit
add-on card that allows attaching up to seven SCSI devices to your computer.</div>
<link itemprop="hasBusinessFunction"
  href="" />
<div itemscope itemprop="hasPriceSpecification"
 <meta itemprop="hasCurrency" content="USD">$
 <span itemprop="hasCurrencyValue">99.99</span>
 <time itemprop="validThrough" datetime="2012-11-30T23:59:59Z"></time>
Condition: <div itemprop="condition">used</div>
EAN/UPC: <span itemprop="hasEAN_UCC-13">1234567890123</span>
MPN: <span itemprop="hasMPN">PSCSI</span>
Article No. <span itemprop="hasStockKeepingUnit">123-456</span>
Availability: <span itemscope itemprop="hasInventoryLevel"
 <meta property="hasMinValueFloat" content="1.0">In-stock

<img itemprop="" src=""
    alt="text" />
<link itemprop="" href="" />

b) RDFa

<div typeof="gr:Offering" about="#offer">
<div property="gr:name">Hepp Personal SCSI Controller Card</div>
<div property="gr:description">The Hepp Personal SCSI is a 16-bit add-on card that allows
attaching up to seven SCSI devices to your computer.</div>
<div rel="gr:hasBusinessFunction"
<div rel="gr:hasPriceSpecification">
 <div typeof="gr:UnitPriceSpecification">Price:
  <span property="gr:hasCurrency" content="USD">$</span>
  <span property="gr:hasCurrencyValue">99.99</span>
  <div property="gr:validThrough" datatype="xsd:datetime"
Condition: <div property="gr:condition>used</div>
EAN/UPC: <span property="gr:hasEAN_UCC-13 datatype="xsd:string">1234567890123</span>
MPN: <span property="gr:hasMPN datatype="xsd:string">PSCSI</span>
Article No. <span property="gr:hasStockKeepingUnit datatype="xsd:string">123-456</span>
Availability: <div rel="gr:hasInventoryLevel">
    <div typeof="gr:QuantitativeValue">
      <div property="gr:hasMinValueFloat" content="1.0" datatype="xsd:float">In-stock</div>
<div rel="schema:image">
 <img src="" alt="text" />
<div rel="foaf:page" resource=""></div>

Received on Wednesday, 19 October 2011 07:14:26 UTC