RE: Library of Congress Subject Headings as SKOS Linked Data

> From: public-swd-wg-request@w3.org [mailto:public-swd-wg-
> request@w3.org] On Behalf Of Ed Summers
> Sent: Monday, June 09, 2008 9:54 AM
> To: SWD Working SWD; public-lod@w3.org
> Subject: Library of Congress Subject Headings as SKOS Linked Data
> 
> 
> I'd like to announce an experimental linked-data, SKOS representation
> of the Library of Congress Subject Headings (LCSH) [1] ... and also
> ask for some help.
> 
> Each concept is identified with a URI like:
> 
>   http://lcsh.info/sh95000541#concept
> 
> When responding to requests for concept URIs, the server content
> negotiates to determine which representation of the concept to return:
> 
>   - application/xhtml+xml
>   - application/json
>   - text/n3
>   - application/rdf+xml

You might find what we have done for our Terminology Services project applicable or adaptable to your situation.  One aspect of the service is that a resource can return multiple representations, e.g., HTML, JSON, RDF or XML.  Each representation has its own URI based on the TAG finding in [1]:

http://example.org/content/resource.html 
http://example.org/content/resource.json 
http://example.org/content/resource.rdf 
http://example.org/content/resource.xml 

[1] <http://www.w3.org/2001/tag/doc/alternatives-discovery.html>

The service also uses a generic URI for the resource that allows content negotiation and auto-discovery of the multiple representations.  For example, the generic request:

GET http://example.org/content/resource
Accept: application/xhtml+xml

returns a 200 OK HTTP status with the same HTML representation returned by the URI:

http://example.org/content/resource.html 

When an Accept header is not given or the Accept header is */* then the service returns a 300 MULTIPLE REPRESENTATIONS HTTP status along with a stripped RDF-XML entity that describes the multiple representations that are available for the resource.  We chose to use a stripped RDF-XML document since 1) we could generate a concrete XML schema that the server and clients could validate against or process with standard XML tools like XSLT or XQuery and 2) the document was also usable by RDF tools in a semantic context.

The stripped RDF-XML document looks like the following, but is probably generic enough that we might officially publish the schema if there is community interest.  I would welcome thoughts on whether we should officially publish the schema or any other comments or suggestions on the schema.

Here is a basic description of the schema:

1) The document element is the Resource element.
2) The document element specifies an rdf:about attribute containing the relative URI for the generic resource.
3) The document element specifies an xml:base attribute containing the base URI for the generic resource. Note you can use absolute URI's, but the xml:base reduces redundancy for the URI's described in each representation, see (6) below.
4) The document element contains a Representations element that is a collection of all the representations for the resource.
5) The Representations element contains a Representation element for each representation of the resource.
6) Each Representation element specifies an rdf:about attribute containing the relative URI for the resource representation.  See note in (3).
7) Each Representation element contains a generic type describing the type of resource.
8) Each Representation element contains one or more languages that the resource is available in.
9) Each Representation element contains one or more formats that the resource is available in.  Note format covers both media type and character sets the resource is available in which are specified by the rdf:datatype attribute.
10) Each Representation element contains references to one or more standards that the representation conforms to.
11) Each Representation element asserts that it is a format of the generic resource.

This document is used to drive validation of incoming requests to the generic resource URI.  For example:

1) dct:language drives what is acceptable for the Accept-Language HTTP header.
2) dct:format drives what is acceptable for the Accept HTTP header. Preferred content type is specified first.
3) dct:format drives what is acceptable for the Accept-Charset HTTP header.
4) dct:conformsTo drives schema validation or implemented features on either the server or client side.

It is possible that multiple representations may contain the same content types.  For example, you might have multiple XML representations using different schemas, but the content types for them are always application/xml and text/xml.  When a client uses an Accept HTTP header with the content type application/xml the service will return a 300 MULTIPLE REPRESENTATIONS HTTP status with just the representations that have a dct:format with a content type of application/xml.  To retrieve a specific XML schema the service assigns a local content type for each XML representation, e.g., application/x-example.org-schema1 for XML schema 1 and application/x-example.org-schema2 for XML schema 2.  This allows the client to retrieve the specific schema they require.

You can verify that this is in fact a valid RDF document by using the RDF validator's *extended interface* where you *must* check the Advanced Feature "RDF is NOT enclosed in <RDF>...</RDF> tags".

<http://www.w3.org/rdf/validator>


<?xml version="1.0"?>
<Resource
  rdf:about="resource"
  xml:base="http://example.org/content/"
  xmlns="urn:uuid:D30A7E67-31BF-40A3-9956-9668674FCD84"
  xmlns:dct="http://purl.org/dc/terms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
  <Representations rdf:parseType="Collection">

    <Representation rdf:about="resource.html">
      <dct:type rdf:datatype="http://www.w3.org/2001/XMLSchema#token">XHTML document</dct:type>
      <dct:language rdf:datatype="http://purl.org/dc/terms/RFC1766">en</dct:language>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">application/xhtml+xml</dct:format>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">text/html</dct:format>
      <dct:format rdf:datatype="http://www.iana.org/assignments/character-sets#">UTF-8</dct:format>
      <dct:conformsTo rdf:resource="http://www.w3.org/TR/2002/REC-xhtml1-20020801"/>
      <dct:isFormatOf rdf:resource="resource"/>
    </Representation>

    <Representation rdf:about="resource.json">
      <dct:type rdf:datatype="http://www.w3.org/2001/XMLSchema#token">JSON document</dct:type>
      <dct:language rdf:datatype="http://purl.org/dc/terms/RFC1766">en</dct:language>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">application/json</dct:format>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">application/ecmascript</dct:format>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">application/javascript</dct:format>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">text/ecmascript</dct:format>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">text/javascript</dct:format>
      <dct:format rdf:datatype="http://www.iana.org/assignments/character-sets#">UTF-8</dct:format>
      <dct:conformsTo rdf:resource="http://www.ietf.org/rfc/rfc4627"/>
      <dct:isFormatOf rdf:resource="resource"/>
    </Representation>

    <Representation rdf:about="resource.rdf">
      <dct:type rdf:datatype="http://www.w3.org/2001/XMLSchema#token">RDF document</dct:type>
      <dct:language rdf:datatype="http://purl.org/dc/terms/RFC1766">en</dct:language>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">application/rdf+xml</dct:format>
      <dct:format rdf:datatype="http://www.iana.org/assignments/character-sets#">UTF-8</dct:format>
      <dct:conformsTo rdf:resource="http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210"/>
      <dct:isFormatOf rdf:resource="resource"/>
    </Representation>

    <Representation rdf:about="resource.xml">
      <dct:type rdf:datatype="http://www.w3.org/2001/XMLSchema#token">XML document</dct:type>
      <dct:language rdf:datatype="http://purl.org/dc/terms/RFC1766">en</dct:language>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">application/xml</dct:format>
      <dct:format rdf:datatype="http://purl.org/dc/terms/IMT">text/xml</dct:format>
      <dct:format rdf:datatype="http://www.iana.org/assignments/character-sets#">UTF-8</dct:format>
      <dct:conformsTo rdf:resource="http://www.w3.org/TR/2006/REC-xml-20060816"/>
      <dct:isFormatOf rdf:resource="resource"/>
    </Representation>

  </Representations>
</Resource>

Content negotiation does have some problems, especially when you use a browser for UA (user agent) access.  We found this out with our Terminology Services project where we also serve LCSH, in MARC-XML, XHTML, JSON, SKOS, and Zthes.  What we found is that browser requests from Firefox have an Accept HTTP header that prefers XHTML (application/xhtml+xml), thus accessing a generic URI from Firefox will always retrieve the XHTML resource.  Internet Explorer on the other hand doesn't prefer XHTML over */* and is able to retrieve the stripped RDF-XML representations from our service.  I'm not saying Firefox is incorrect, but it’s a difference that has to be explained to people accessing or testing our service with a browser.

Which brings me to another issue.  MARC (ISO 2709) has a registered media type of application/marc [2], but MARC-XML nor SKOS have registered media types.  It would be preferrable that both have registered media types so you can distinguish MARC-XML from just any XML (application/xml) and SKOS from any RDF (application/rdf+xml).  In our Terminology Service project we made the distinction by assigning local media types for MARC-XML (application/x-oclc-tspilot.marcxml) and SKOS (application/x-oclc-tspilot.skos), but we don't feel that this is desirable from a long term perspective.

[2] <http://www.loc.gov/marc/marcimt.html>


Andy.

Received on Monday, 9 June 2008 15:51:16 UTC