Reusing Spreadsheets

GRDDL can be useful for integrating data from general-purpose XML dialects (that is, dialects that are not domain-oriented) as well as highly specialized dialects such as in the health care example, covered previously. One general-purpose XML dialect that may contain a trove of accumulated information is that of spreadsheets. Reusing information stored in spreadsheets may be quite valuable, and GRDDL provides one mechanism for accessing this information. In this example, we will specifically consider the problem of gleaning information from Microsoft® Excel spreadsheets, although other spreadsheet-like XML dialects would be able to take advantage of the same basic mechanism.

Jane serves as the secretary for a small group with her two friends, David and Robin, that meets once a month. She tracks the attendance at these meetings using a simple Excel spreadsheet, and she starts a new spreadsheet each year. She wants the members of this group to be able to query these accumulated statistics freely, and she recognizes that RDF would support this kind of merging and querying functionality. She decides to use GRDDL to allow any of the members of the group to glean RDF from any of these attendance records and query the data along with any other RDF that may be available.

Jane intends to use a GRDDL transformation called xcel-mf2rdf.xsl, which requires the Excel spreadsheet to conform to a particular microformat profile. She first must identify which cells in her spreadsheet are data cells. In the case of an attendance spreadsheet, the data cells are the attendance indicators, and she identifies these cells by giving them the name "Data". She must also identify the header cells. In this case, the header cells are the cells containing names and dates; Jane identifies these cells by giving them the name "Header". Next, Jane gives each data and header cell an additional name, which serves as the local name of the property for that cell. She names the date cells "date", the member name cells "name", and the attendance cells "present". Finally, Jane must set two custom properties globally on the spreadsheet. The first property is called "profile", and this particular microformat profile has profile URI http://www.mnot.net/2005/08/xcel-mf. The second property is called "namespace", and provides the namespace to be used for RDF properties in the GRDDL results; Jane chooses the namespace URI ahttp://example.org/attendance/.

Resulting Excel spreadsheet screenshot

Attendance spreadsheet with header cells selected

Since GRDDL operates on XML documents, she saves her Excel files using the XML dialect that Excel provides. After saving them as XML, she adds the reference to this transformation to the root element of each attendance document. Following the directives of the Excel microformat profile, and including the appropriate GRDDL reference, this is a slice of the resulting spreadsheet document:

<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook
  xmlns="urn:schemas-microsoft-com:office:spreadsheet"
  xmlns:o="urn:schemas-microsoft-com:office:office"
  xmlns:x="urn:schemas-microsoft-com:office:excel"
  xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
  xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
  xmlns:html="http://www.w3.org/TR/REC-html40"
  xmlns:grddl="http://www.w3.org/2003/g/data-view#"
  grddl:transformation="xcel-mf2rdf.xsl">
 <!-- ... -->
 <CustomDocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
  <profile dt:dt="string">http://www.mnot.net/2005/08/xcel-mf</profile>
  <namespace dt:dt="string">ahttp://example.org/attendance/</namespace>
 </CustomDocumentProperties>
 <!-- ... -->
 <Worksheet ss:Name="Sheet1">
  <Table>
   <!-- ... -->
   <Row ss:Index="3">
    <!-- ... -->
    <Cell ss:StyleID="s26"><Data ss:Type="String">2006-04</Data><NamedCell
      ss:Name="Header"/><NamedCell ss:Name="date"/></Cell>
    <!-- ... -->
   </Row>
   <!-- ... -->
   <Row>
    <Cell ss:StyleID="s25"><Data ss:Type="String">Robin</Data><NamedCell
      ss:Name="name"/><NamedCell ss:Name="Header"/></Cell>
    <!-- ... -->
    <Cell><Data ss:Type="String">n</Data><NamedCell ss:Name="present"/><NamedCell
      ss:Name="Data"/></Cell>
    <!-- ... -->
   </Row>
   <!-- ... -->
  </Table>
  <!-- ... -->
 </Worksheet>
</Workbook>

When processed by a GRDDL-aware agent, a document such as this will correspond to RDF triples such as the following:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="ahttp://example.org/attendance/">
 <!-- ... -->
 <rdf:Description>
  <name>Robin</name>
  <date>2006-04</date>
  <present>n</present>
 </rdf:Description>
 <!-- ... -->
</rdf:RDF>

Jane and the other members of the group can now use this data in a variety of situations. For example, suppose there exist other records of decisions that were made at these meetings. A record of one such decision might look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="ahttp://example.org/events/">
 <rdf:Description>
  <date>2006-03</date>
  <label>choose new meeting location</label>
 </rdf:Description>
</rdf:RDF>

Merging these triples with the GRDDL results from the attendance record spreadsheets, Jane can now ask questions such as "who attended the meeting at which we decided to choose the new meeting location?" In SPARQL:

PREFIX att: <ahttp://example.org/attendance/>
PREFIX ev: <ahttp://example.org/events/>
SELECT ?name WHERE
{
  ?event ev:label "choose new meeting location" .
  ?event ev:date ?date .
  ?attendance att:date ?date .
  ?attendance att:name ?name .
  ?attendance att:present "y" .
}

Which would give the following results:

<?xml version="1.0" encoding="UTF-8"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head>
    <variable name="event"/>
    <variable name="attendance"/>
    <variable name="date"/>
    <variable name="name"/>
  </head>
  <results distinct="false" ordered="false">
    <result>
      <binding name="name">
        <literal>Jane</literal>
      </binding>
    </result>
    <result>
      <binding name="name">
        <literal>David</literal>
      </binding>
    </result>
  </results>
</sparql>

This indicates that Jane and David were present at the meeting where that decision was made.

This microformat for Excel is defined by Mark Nottingham in "Adding Semantics to Excel with Microformats and GRDDL". The version of the transformation script here is based on Mark's original, although it has a few significant changes. Also, note that there certainly exist other ways to think of spreadsheet data in terms of RDF. Developing a GRDDL solution for those approaches is left as an exercise for the reader.