Re: Yahoo's RDF vocabularies from Manu Sporny on 2009-03-24 (public-rdfa@w3.org from March 2009)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Mon, 23 Mar 2009 22:24:01 -0400
To: Peter Mika <pmika@yahoo-inc.com>
CC: RDFa <public-rdf-in-xhtml-tf@w3.org>, RDFa Community <public-rdfa@w3.org>
Message-ID: <49C84441.3020609@digitalbazaar.com>
Peter Mika wrote:
>> SearchMonkey outlines what people must do to get their video listings in
>> Yahoo's search service using the Yahoo Media vocabulary. So while you
>> can write your own SearchMonkey apps using your own vocabularies, you
>> can (right now) only use Yahoo's vocabulary if you want the enhanced
>> search listings to show up on the main Yahoo search page, right? This
>> sends a pretty strong message - use Yahoo's vocabularies or you won't
>> show up in the enhanced listings.
>>   
> It's again important to clarify that that is not the case. You can use
> any RDFa vocabulary and build a SearchMonkey application that acts on
> data formatted according to that vocabulary.

The point wasn't about what would show up in a developer's SearchMonkey
application - it was about what would show up in Yahoo's search results.

So, if Metacafe were to use this as their markup:

<div xmlns:dcterms="http://purl.org/dc/terms/"
     xmlns:media="http://purl.org/media#"
     xmlns:video="http://purl.org/media/video#"
     about="#cute-puppy" typeof="video:Recording">

<img rev="media:depiction"
   src="http://s.mcstatic.com/thumb/767922.jpg" />
<span property="dcterms:title">OMG Cute Puppies!</span>
<object rel="media:download"
   href="http://www.metacafe.com/fplayer/767922/cute_puppy.swf" />
</div>

and you did a search like this:

http://search.yahoo.com/search?p=site%3Ametacafe.com+cute+puppies

Would you get the extended video search listing, or just regular search
listing?

I would hope that, in time, one would get an extended video search
listing using a variety of popular video vocabularies.

>> Taking something that already exists for syndication purposes and
>> transforming it into an RDF vocabulary on a 1-to-1 basis is not a best
>> practice because the syndication format makes some very strong
>> assumptions about the data in the stream. RSS data is fairly strongly
>> typed data, and is machine generated in a controlled environment. RDFa
>> data is usually not strongly typed, is generated by humans as well as
>> machines, and is not in a very controlled environment.
>>   
> I'm not sure I follow you on this distinction... what do you mean by RSS
> is 'fairly strongly typed' data while RDFa is not? RDFa has explicit
> typing, just like XML Schema.

Slight clarification - when I said "RSS data", I was implying "Media RSS
data", but I don't think that was clear.

There are two parts to this issue:

1. Lack of typing in Yahoo's Media RDF vocabulary.
2. The implications of not specifying typing and ranges in RDF
   vocabularies.

The first part - Yes, RDFa has explicit typing, but Yahoo's media
vocabulary doesn't specify any ranges for the vocab. So, even though
RDFa can specify type info, Yahoo's Media vocabulary doesn't mention
what those types should be in the OWL definitions. There is no way to
validate that the triples are well typed.

However, many types are integer-only, such as:

media:channels
media:fileSize
media:height
media:views
media:width

Some are float:

media:framerate
media:samplingrate
media:bitrate

Some are xsd:duration:

media:duration

Some are set-based ranges:

mediarss:content @expression - can be (sample | full | nonstop)

However, none of that type information is mentioned at any point in the
vocabularies.

The second part to this is the effect that this has on publishing
behavior. This assumes that tools will be built that check type and
range information, but if Yahoo isn't specifying type/range information
- the tools will be unable to understand what triples are and are not valid.

In Yahoo Media RDF, specifying a sample rate of "fourty-four kilohertz"
is just as valid as "44.1" and there is nothing in the spec that says
that one is more correct than the other. There is certainly no way for a
machine to make this distinction - which is important, because we need
to be able to do that in order to do basic data validation.

If the Yahoo's media vocab isn't specific, web authors will create bad
data for Yahoo - making your engineering team's job harder in the long
run. More importantly, it will be impossible to write automatic
validation tools so that authoring suites will generate proper output.

>> MediaRSS also contains both elements /and/ attributes to refine the
>> meaning of the elements. However, it seems that only the elements made
>> it over to ymedia, which is unfortunate because a great deal of semantic
>> fidelity is lost without the attributes.
>>   
> I'm not sure I understand what you are referring to. In this case width
> and height are attributes in XML, and properties in RDF.

Hmm, you're right - there weren't as many attributes that weren't ported
over as I had thought previously.

Here are some elements and attributes of Media RSS that didn't make it
into Yahoo's Media vocab:

mediarss:rating
mediarss:credit    - @role
mediarss:restriction
mediarss:thumbnail - @time
mediarss:content   - @expression, @isDefault
mediarss:text      - @start, @end

So, Yahoo's Media vocabulary wasn't necessarily a direct port of MediaRSS?

>> Also, the vocabulary specifies both ymedia:title /and/ suggests the use
>> of dc:title. There is no need for ymedia:title since you're just
>> re-defining what dc:title already does. There is an argument for helping
>> web authors by only requiring them to include one vocabulary, but it is
>> at the expense of teaching people that it's okay to re-create entire
>> vocabularies under that argument - which is detrimental to all of this
>> in the long run.
>>
>> I think the solution would be something along the lines of @profile
>> pre-loading a set of vocabularies, so Yahoo could use multiple
>> vocabularies in a stack without creating undue burden on the HTML author:
>>
>> <html profile="http://search.yahoo.com/searchmonkey-vocabs.html">
>> ... <span property="dc:title">Puppies</span> ...
>>   ... <span property="format:width" content="1080">HD</span> ..
>>   
> This would be incompatible with RDFa, no?

Right now, yes. In the future, maybe not.

We're working on a mechanism that would allow web authors to define
keyword and vocabulary bundles and include them via the @profile attribute.

This would enable markup like what is described above and uF-like markup
like this:

<body profile="http://microformats.org/profiles/hcard">
   <p>The @profile attribute can be used to load prefix declarations and
      token mappings from an external resource. This test was suggested
      by <span typeof="vcard" property="fn">Shane McCarron</span> and
      <span typeof="vcard" property="fn">Mark Birbeck</span>.
   </p>
</body>

which would generate the following triples:

_:bnode0
   <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
      <http://microformats.org/profiles/hcard#vcard> .
_:bnode0
   <http://microformats.org/profiles/hcard#fn>
      "Shane McCarron" .
_:bnode1
   <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
      <http://microformats.org/profiles/hcard#vcard> .
_:bnode1
   <http://microformats.org/profiles/hcard#fn>
      "Mark Birbeck" .

This would allow you to give your web authors a single vocabulary for
all of Yahoo's vocabularies and thus make it easy for them to not have
to include 5-8 xmlns declarations every time that they wanted to use
Yahoo vocabularies.

>>> We do publish OWL definitions for the vocabularies at [2].
>>>     
>>
>> Good! But that's so 2007! :)
>>
>> Why not mark up the same pages that define the human readable vocabulary
>> with a machine readable one using RDFa, like these pages do:
>>
>> http://purl.org/media/
>> http://purl.org/media/audio
>> http://purl.org/media/video
>> http://purl.org/commerce/
>>   
> No one has ever requested it until now ;) Why is it better than a
> separate OWL document? OWL is very 2009 ;)

It's better than a separate OWL document because it is both human and
machine-readable. OWL documents are machine-readable and
developer-readable. By keeping your OWL document separated from you
Yahoo developer documentation, you risk the chance of them getting out
of sync.

Your OWL document is not reference-able by a computer, either. So, if I
wanted to write a validator for Yahoo's Media vocabulary, I can't just
use the same URL that I define in xmlns:ymedia. In other words, the
document at the Yahoo namespace URL:

http://search.yahoo.com/searchmonkey/media

is not machine readable. As an aside, there should really be a '#' at
the end of that URL, otherwise this:

media:Article

will be expanded to this:

http://search.yahoo.com/searchmonkey/mediaArticle

which is not dereference-able.

>> Hmm, maybe... I thought the general sense on the web was that schema
>> versioning was a bad idea and should not be done. If you really need to
>> shift versions, you can always point people at a new URL and clearly
>> mark the old URL as deprecated, as the Dublin Core folks did.
>>   
> I'm not sure if the world is ready for that...

What do you mean? Why does the world need to be ready for this? If we're
done with a vocabulary and there is a better way forward, we'd start
using the new vocabulary, wouldn't we? Why shouldn't we just use a new
URL for that new vocabulary?

>> Hope to keep working through the issues and providing feedback. Thanks
>> for replying and reading through this rather long set of thoughts :)
>>
> I'm happy to respond. We should probably try to avoid using this mailing
> list for discussions about specific vocabularies. In particular, based
> on evidence all over the Web, discussions are media vocabularies tend to
> be lengthy and most likely uninteresting for anyone on this list who is
> not working with media content.

Agreed. I won't follow up too much on this - I tried to keep the
comments so that they would address general vocabulary design and what
people should be aware of when creating and consuming somebody else's
vocabulary. Sometimes a public critique, as long as it is constructive,
helps others learn about the nuances of a particular technology such as
RDFa.

That, or this discussion just drove 100+ list members away from RDFa :)

-- manu

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: Absorbing Costs Considered Harmful
http://blog.digitalbazaar.com/2009/02/27/absorbing-costs-harmful
Received on Tuesday, 24 March 2009 02:24:41 UTC