Re: Microdata to RDF conversion from Philip Jägenstedt on 2010-01-20 (public-html@w3.org from January 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Wed, 20 Jan 2010 23:35:01 +0100
To: "Ian Hickson" <ian@hixie.ch>
Cc: "HTML WG" <public-html@w3.org>
Message-ID: <op.u6urofgcsr6mfa@worf>
On Tue, 19 Jan 2010 09:22:25 +0100, Ian Hickson <ian@hixie.ch> wrote:

> On Sun, 17 Jan 2010, Philip Jägenstedt wrote:

>> This algorithm uses the http://purl.org/dc/terms/ namespace, while the
>> mapping at <http://dev.w3.org/html5/md/#conversion-to-rdf> uses the
>> http://purl.org/dc/elements/1.1/ namespace. http://purl.org/dc/terms/
>> seems to be the canonical namespace at this time, so I suggest just
>> using that.
>
> Wait, what? I'm confused. What exactly are you saying should change?

The works vocabulary maps itemprop="title" to  
http://purl.org/dc/elements/1.1/title while the algorithm for converting a  
document to RDF maps <title>foo</title> to http://purl.org/dc/terms/title.  
Unless there's some specific reason for this, use  
http://purl.org/dc/terms/title in both cases, as /elements/1.1/ is  
apparently a legacy namespace (see  
http://dublincore.org/documents/dcmi-terms/#H3)

>> There's an issue with how vocabularies that use subitems are currently
>> handled. In short, triples are only generated if the item either has a
>> type which is an absolute URL or if the item property is an absolute
>> URL. This prevents site-private data from being exported as RDF, which
>> is a good thing. However, for vocabularies which have an item type for
>> the top-level item but not for subitems (which seems quite unnecessary)
>> this means that no triples are generated for the subitems, even though
>> the subitem reasonably be considered to be using the same vocabulary as
>> the typed top-level item. To illustrate the point, here's the output of
>> the RDF extraction (as Turtle) from the Jack Bauer example if the
>> current spec is honored: [...]
>>
>> As you see, the structured subitems org, adr, etc just point to blank  
>> nodes
>> with no further triples for those nodes. My fix is to pass on the type  
>> of the
>> parent item when generating triples for subitems as a default, which is
>> overridden if the subitem defines its own type (as e.g. agent does in  
>> the
>> above). I think this is sensible and it certainly produces a more  
>> complete RDF
>> graph: [...]
>
> That works if you know the vocabulary and thus know that the nested
> subitem is from that vocabulary, but it seems highly suspect in the case
> where you don't know that. Also, consider:
>
>    <div itemscope itemtype="http://example.com/person">
>     <p itemprop="school" itemscope>
>      I go to school in the <span itemprop="class">middle</span>  
> classroom.
>     </p>
>     <p itemprop="demographics" itemscope>
>      I am <span itemprop="class">middle</span>-classed.
>     </p>
>    </div>
>
> (A bit contrived, but you get the idea.) It would be wrong to use the  
> same
> predicate for both itemprep="class" cases.

Since no itemtype is used for itemprop="school", http://example.com/person  
must define this as part of its vocabulary, unless the above is an example  
of invalid markup. Since it's all one vocabulary, using the same prefix  
for the RDF predicates seems quite logical.

> Long story short, I think it's better to just use itemtype="" everywhere
> you want to start a new vocabulary. This does mean the vCard vocaburary
> doesn't really convert to RDF well, but what's the use case for that? I
> would have thought most people would just use vCard, if they wanted to
> convert this to another format.

Surely, the org, adr, tel, etc subitems in the hcard vocabulary shouldn't  
be considered independent, reusable vocabularies in themselves? If they  
are then they should have their own itemtype, just like agent does  
(another hcard). In short, subitem are sometimes used to structure data  
within the same domain. This is most obvious with the "n" property/item,  
which is more or less a structured version of "fn". If only completely  
flat microdata vocabularies are possible to convert to RDF using the  
generic algorithm and everything else requires vocabulary-specific  
workarounds or roundtrips over a 3rd format, then the RDF extraction  
algorithm isn't of much use at all and should be dropped. I would prefer  
though if we could keep some interoperability with the RDF  
model/toolschains, especially in a case like this where the result mirrors  
the microdata tree without surprises.

(For what it's worth, http://www.w3.org/TR/vcard-rdf uses a single  
namespace.)

Thanks for all the other fixes!

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Wednesday, 20 January 2010 22:35:37 UTC