Re: Using schema.org Dataset metadata properties from Dan Brickley on 2014-09-16 (public-csv-wg@w3.org from September 2014)

From: Dan Brickley <danbri@google.com>
Date: Tue, 16 Sep 2014 12:31:54 +0100
To: Jeni Tennison <jeni@jenitennison.com>, Thomas Baker <tom@tombaker.org>
Cc: CSV on the Web Working Group <public-csv-wg@w3.org>
Message-ID: <CAK-qy=5L8qHL61vRq9C1G2unftD5KkBLfr-2tNeJTR-EgTR_UA@mail.gmail.com>
+Cc: Tom Baker from Dublin Core

On 13 September 2014 17:28, Jeni Tennison <jeni@jenitennison.com> wrote:
> Hi,
>
> In the current metadata document here:
>
>   http://w3c.github.io/csvw/metadata/#common-properties
>
> the spec maps adopts the list of Dublin Core properties for describing tables etc. As ISSUE 6 says, this might not be the right choice: there might be other standard vocabularies that should be used instead or as well.
>
> On the call this week, Dan suggested using schema.org instead, namely the properties on Dataset here:
>
>   http://schema.org/Dataset
>
> The properties there are informed by DCAT which itself was informed by Dublin Core.
>
> Any thoughts?

As a WG co-chair, as a loyal member of the DC community, and as
someone in large part responsible for schema.org day-to-day, I've
stepped back from this conversation so far.

I first met Tom Baker (cc:'d) at the 5th Dublin Core meeting, October
6-8, 1997 in Helsinki, Finland. That was the week that W3C announced
the first draft RDF spec, which also began the long and noble
tradition of DC being used in W3C RDF-related spec examples:

http://www.w3.org/TR/WD-rdf-syntax-971002/

<?namespace href="http://purl.org/DublinCore/RDFschema" as="DC"?>
<?namespace href="http://www.w3.org/schemas/rdf-schema" as="RDF"?>
<RDF:serialization>
  <RDF:assertions href="http://www.webnuts.net/Jan97.html">
    <DC:subject>
      <RDF:resource id="subject_001">
        <DC:scheme>Dewey Decimal Code</DC:scheme>
        <DC:lang>English</DC:lang>
        <RDF:PropValue>020 - Library Science</RDF:PropValue>
      </RDF:resource>
    </DC:subject>
  </RDF:assertions>
</RDF:serialization>

This even pre-dates modern XML namespaces :)

Some aspects of this thread feel like a conversation that hasn't
stopped since those days. "When do we describe something as a 'thing',
or with a string? Or a link?". Both DC and schema.org navigate those
tradeoffs, and in similar ways. They both try to gently encourage
thing-centric data modeling, while acknowledging that many important
repositories are full of information based on ambiguous or vague
strings. We take what we can get.

There are good cases for both Dublin Core and Schema.org in CSV
metadata files. If you have a repository/collection whose metadata
generally is DC-based, it is entirely reasonable to want to have
DC-based CSV metadata, and W3C CSVW metadata files absolutely should
support that use case. JSON-LD makes that relatively easy.

My personal feeling for how DC and Schema.org should best relate (also
see http://www.slideshare.net/danbri/what-is-left-to-do-dublin-core-2012-keynote
) is that schema.org is weakest on controlled values for properties,
and this is an area where the DC community (through connection to the
digital library / GLAM world) can excel. In terms of the actual
vocabulary terms (basic properties and types) their expressivity is
similar, except schema.org's is much larger (I think around 1200 terms
now, and growing). I don't expect schema.org as a project to add a lot
of controlled enumerations, whereas DC could easily find a role doing
just that e.g. SKOS thesauri, controlled terms for educational
technology publishing, etc. There is certainly room for both, even if
there are overlaps.

In terms of explicit mappings, we did have a DC/schema.org mapping
task force a couple of years ago, but we let it fizzle out without
finalizing its output. More recently the schema.org site codebase has
been opensourced, posted on Github, and has grown some features that
make it worthwhile revisiting those mappings.

The master file defining schema.org is
https://github.com/rvguha/schemaorg/blob/master/data/schema.rdfa

Recently we have been publishing frozen snapshots of that with every
release, e.g. http://schema.org/release/20140912/20140912-v1.91.rdfa.html
although there is a need for more structure around those. Schema.org
is currently updated fairly often, see
http://schema.org/docs/releases.html for a release history or Github
for the full details.

Within those machine readable files there are already some basic
mappings to DC, e.g. Event:

 <div typeof="rdfs:Class" resource="http://schema.org/Event">
      <span class="h" property="rdfs:label">Event</span>
      <span property="rdfs:comment">An event happening at a certain
time and location, such as a concert, lecture, or festival. Ticketing
information may be added via the &#39;offers&#39; property. Repeated
events may be structured as separate Event objects.</span>
       <span>Subclass of: <a property="rdfs:subClassOf"
href="http://schema.org/Thing">Thing</a></span>
       <link property="owl:equivalentClass"
href="http://purl.org/dc/dcmitype/Event"/>
    </div>

... which in turn gets re-published in per-term pages like
http://schema.org/Event as follows:

<div id="mainContent" vocab="http://schema.org/" typeof="rdfs:Class"
resource="http://schema.org/Event">
  <link property="owl:equivalentClass"
href="http://purl.org/dc/dcmitype/Event"/>
</div>...

Now that this is possible we should go back and put in the rest of the
draft mappings.

As a long time Dublin Core person I don't want to advocate against DC
here, but I do think there are advantages to using schema.org:

1.
When we map the actual payloads of CSV data into triples, schema.org's
added depth will make it more useful than DC. So schema.org will
re-appear within mappings/templates anyway, whether normatively
encouraged or not.

2.
It has the attention of publishers and consumers at large scale.
Schema.org went from nothing to being on 7+ million domains in 3
years, and is still being actively evolved. It is not a classic formal
standards activity but both builds on standards and has most
discussion/collaboration through public means on github and W3C
public-vocabs list.

3.
It is relatively easy to get it extended. No promises, but if there
are unaddressed use cases, at this moment a change request to
schema.org is much more likely to result in changes / improvements
than a change request to Dublin Core.

The downside of this is that the thing is constantly evolving, which
goes against some W3C instincts w.r.t. making normative references.
And it is under the stewardship of the 4 sponsor search engines
(Yandex, Yahoo, Bing, Google), which is not everyone's preferred
model.

I would be happy with either DC or schema.org as the default for CSVW
metadata, but would prefer either way that we make sure publishers can
choose which they prefer since CSVs are often a smaller part of a
larger story. I wouldn't want to hold back DC-centric systems from
using DC, or schema.org-centric systems from using schema.org.

Beyond that I suggest we collect concrete metadata use cases and see
what in practice is missing from either DC and schema.org and try to
get them added to one or the other.

I've copied Tom as we were chatting earlier and he may have thoughts
to add. As an RDFish person I'm just happy that both vocabs share a
common underlying data model at least (and overlapping
communities...).

cheers,

Dan
Received on Tuesday, 16 September 2014 11:32:26 UTC