Describing tables with schema.org

Last updated: 8/05/2013

Overview & motivation

Many useful data exist in the form of tables, either inside HTML documents, or as standalone files in various formats (CSV, XLS, ODF), or in relational database management systems. While tables are a natural representation of data for many applications, they do not provide the necessary semantics to enable search engines to effectively index, and surface to users the information they contain.

The goal of this proposal is to help bridge between data tables and triples, by providing a simple mechanism to describe data tables so that their contents can be understood in terms of entities and properties.

This work is similar in spirit to the R2RML W3C recommendation. The main differences are that 1) this proposal is oriented towards the needs and expertise of mainstream schema.org publishers and 2) that it is not aimed at relational databases, but focuses on annotating tables that are available on the Web, and tries to make the mark-up as simple as possible for this use cases. It is likely that R2RML configurations can be automatically generated in many cases.

This is an early proposal. Comments are welcome.

Marking up HTML tables

Consider the list of paintings by Rembrandt on the following wikipedia page:

http://en.wikipedia.org/wiki/List_of_paintings_by_Rembrandt

Here are the first few rows (image URLs in the first column are rendered in the table):

Image

Title

Year

Technique

Dimensions

Gallery

The Operation (Touch)

1624/1625

Oil on panel

21.6 x 17.7 cm.

Private collection

The Spectacles-pedlar (Sight)

1624/1625

Oil on panel

21 x 17.8 cm.

Stedelijk Museum De Lakenhal, Leiden

The Three Singers (Hearing)

1624/1625

Oil on panel

21.6 x 17.8 cm.

Collection W. Baron van Dedem

We would like each row of the table to be described as a http://schema.org/Painting. Today, this has to be done by marking up each row individually. This process is repetitive and error-prone, and the resulting mark-up is verbose.

Instead, we propose marking up the columns in the header of the table to say which properties of Painting they correspond to.

Here is what it looks like in rdfa syntax:

<table typeof="Painting" vocab="http://schema.org/">

  <thead>

    <tr>

      <th property="image">Image</th>

      <th property="name">Title</th>

      <th property="dateCreated">Year</th>

      <th>Technique</th>

      <th>Dimensions</th>

      <th property="contentLocation">Gallery</th>

    </tr>

  </thead>

<tbody>...</tbody>

</table>

By putting marking on the table header, we're saying that each row of the table corresponds to an instance of the class specified in typeof (here, http://schema.org/Painting), and that the annotated columns contain the values of the properties of Painting they are marked up with.

For instance, the first row of the table will be interpreted as a Painting instance with the following property values:

This specific way of interpreting mark-up to apply to all the rows of a table is only intended for annotations on table elements and their headers (inside a thead or th elements). This differenciates it from regular mark-up, which describes individual entities and therefore is not meaningful on table and table header elements.

Note that the values of columns can be either text (e.g., the title or the year), or URLs (image), or both. These should be resolved based on the type of the target property. Such tables (e.g., in Wikipedia) will often vary unpredictably between hypertext and simpler atomic values. For now we do not address the question of normalizing such content.

Side comment: The example above does not map all the columns of the table, because the Painting schema does not define properties that correspond to Technique and Dimensions. The use of "contentLocation" for Gallery is also a bit of a stretch.

Constant properties

The table above omits a critical piece of information about the paintings: their painter! This can be represented by adding a "constant" property, using a meta tag:

<table typeof="Painting" vocab="http://schema.org/">

  <thead>

    <tr>

      <th property="image">Image</th>

      <th property="name">Title</th>

      <th property="dateCreated">Year</th>

      <th>Technique</th>

      <th>Dimensions</th>

      <th property="contentLocation">Gallery</th>

      <link property="author"  

            href="http://en.wikipedia.org/wiki/Rembrandt"/>

    </tr>

  </thead>

<tbody>...</tbody>

</table>

Marking up external tables (CSV, etc)

Many useful tables live outside HTML documents, in CSV files. Mark-up doesn't really work in this case. Instead we're proposing using JSON-LD to define table mappings.

Suppose our paintings table is instead available as a CSV file, with the following contents (dropping the image, Technique and Dimensions columns for readability)

Title,Year,Gallery

The Operation (Touch),1624/1625,Private collection

The Spectacles-pedlar (Sight),1624/1625,Stedelijk Museum De Lakenhal,Leiden

The Three Singers (Hearing),1624/1625,Collection W. Baron van Dedem

First, we need a mechanism to reference the columns of this table. For CSV tables on the Web, we can use URI fragment identifiers to address columns inside the table, either by their header or by their position.

For example, assuming the CSV table is available at the following URL:

http://wp.org/rembrandt-paintings.csv

The Year column can be referenced by its name:

http://wp.org/rembrandt-paintings.csv#col:Year

Or by its position

http://wp.org/rembrandt-paintings.csv#col:1

Mapping our CSV table and its columns to Painting and its properties can be done in JSON-LD as follows:

{

  "@context": "http://schema.org/",

  "@type": "Painting",

  "name": "{http://wp.org/rembrandt-paintings.csv#col:Title}",

  "dateCreated" : "{http://wp.org/rembrandt-paintings.csv#col:Year}",

  "contentLocation" : "{http://wp.org/rembrandt-paintings.csv#col:Gallery}",

  "author": "http://en.wikipedia.org/wiki/Rembrandt"

}

This definition above can either be embedded in an HTML document about the table, by placing it in a script tag with the type attribute set to application/ld+json (per JSON-LD spec), or provided as part of a dataset bundle (see below).

The meaning of this definition is very similiar to the mark-up version above. @type specifes the type of the elements that will be generated from the table (i.e., Painting). Properties of painting can reference columns of tables or constant strings. References to table columns are surrounded by curly braces (e.g., "{http://wp.org/rembrandt-paintings.csv#col:Title}", but constant values are not (e.g., "http://en.wikipedia.org/wiki/Rembrandt").

This syntax is based on the RFC for URI templates, and allows for richer patterns, as described below.

Patterns

In some cases, it is useful to combine values from multiple columns together, or with constant strings in order to create values. This can be achieved by mixing references (in curly braces) with text. For instance, we could generate a description of the painting by concatenating the Title, Year and Gallery columns, separated by text as follows:

{

  "@context": {

    "@vocab" : "http://schema.org/",

    "rp" : "http://wp.org/rembrandt-paintings.csv#"

  },

  "@type": "Painting",

  "name": "{rp:col:Title}",

  "dateCreated" : "{rp:col:Year}",

  "contentLocation" : "{rp:col:Gallery}",

  "author" : "http://en.wikipedia.org/wiki/Rembrandt",

  "description" : "{rp:col:Title}, {rp::col:Year}, {rp:col:Gallery}."

}

The description field will generate the following description for the first row in the table:

The Operation (Touch), 1624/1625, Private collection.

Note that this example used the JSON-LD prefix expansion mechanism to keep references to columns short, so given the context definition "rp" : "http://wp.org/rembrandt-paintings.csv#" the name property value "{rp:col:Title}" is equivalent to "{http://wp.org/rembrandt-paintings.csv#col:Title}".

It it tempting to make the pattern language richer, to support restructuring and transforming the values foud in tables. There is a trade-off between that expressiveness and the simplicity and ease-of-use of the language.

Bundling datasets

Many datasets consist of more than a single table. They may contain multiple CSV files, and the description of the entities they contain may be quite complex.

We propose extending the http://schema.org/Dataset class to act as a container for such datasets, by adding the following property to it:

For example, consider an art catalog dataset consisting of many CSV files. Each CSV file lists the paintings of one painter in the catalog.

We propose creating one Dataset that "bundles" all these files, and a json-ld file that contains the entity mappings for each table.

The dataset definition will look like:

<script type="application/ld+json">

{

  "@context": "http://schema.org",

  "@id": "paintings-dataset",

  "@type": "Dataset",

  "datasetElement": [

    {"@id": "rembrandt-paintings.csv",

     "@type": "DataDownload",

     "encodingFormat": "text/csv",

     "contentUrl": "http://wp.org/rembrandt-paintings.csv"

    },

    {"@id": "renoir-paintings.csv",

     "@type": "DataDownload",

     "encodingFormat": "text/csv",

     "contentUrl": "http://wp.org/renoir-paintings.csv"

    },

    …,

    {"@id": "paintings",

     "@type": "DataDownload",

     "encodingFormat": "application/json",

     "contentUrl": "http://wp.org/paintings.json"

    }

   ]

}

</script>

The CSV files for each painter look like the one in the example above. The paintings.json file contains a list of table mappings like the one above:

[{

  "@context": {

    "@vocab" : "http://schema.org/",

    "p1" : "http://wp.org/rembrandt-paintings.csv#"

  },

  "@type": "Painting",

  "name": "{p1:col:Title}",

  "dateCreated" : "{p1:col:Year}",

  "contentLocation" : "{p1:col:Gallery}",

  "author" : "http://en.wikipedia.org/wiki/Rembrandt"

},

{

  "@context": {

    "@vocab" : "http://schema.org/",

    "p2" : "http://wp.org/renoir-paintings.csv#"

  },

  "@type": "Painting",

  "name": "{p2:col:Title}",

  "dateCreated" : "{p2:col:Year}",

  "contentLocation" : "{p2:col:Gallery}",

  "author" : "http://en.wikipedia.org/wiki/Renoir"

},...]

Identifiers and references

In schema.org and other vocabularies, URLs are used as primary identifiers of entities. We illustrate through an example how they can easily be created and used for referencing entities.

We will use the patterns described above to generate URLs for entities from column values.

In the case or URLs, the pattern language we use here is actually very similar to URI templates.

Suppose we have two tables, one that lists countries and one that lists cities.

Countries

country-code

country-name

DE

Germany

FR

France

Cities

city-code

city-name

city-country

PAR

Paris

FR

BER

Berlin

DE

Let's assume these table exist as two CSV files at the following locations:

The country table can be described as follows:

{

  "@context": {

    "@vocab": "http://schema.org",

    "t1": "http://my.domain.org/countries.csv#"

  }

  "@type": "Country",

  "@id": "http://my.domain.org/country/{t1:col:country-code}",

  "name": "{t1:col:country-name}"

}]

The "@id" field is used in JSON-LD to specify the URL of an entity. We use a pattern to construct a URL for each country, based on its country code value

The city table description can create a reference to the country of each city through its url:

{

  "@context":{

    "@vocab": "http://schema.org",

    "t2": "http://my.domain.org/cities.csv#"

  }

  "@type": "City",

  "@id": "http://my.domain.org/city/{t2:col:city-code}",

  "name": "{t2:col:city-name}",

  "containedIn": "http://my.domain.org/country/{t2:col:city-country}"

},

Nested descriptions

The example above is a normalized representation: Because each country may contain multiple cities, cities and countries are represented as separate tables, and each table containing exactly one row per entity.

In reality, tables are often denormalized, with redundant information in them. For instance, there may be a single city / country table that looks like:

city-code

city-name

city-country

country-name

PAR

Paris

FR

France

LIL

Lille

FR

France

BER

Berlin

DE

Germany

MUN

Munich

DE

Germany

This table can be described as follows:

{

  "@context": {

    "@vocab": "http://schema.org",

    "t3": "http://my.domain.org/cities-counties.csv#"

  }

  "@type": "City",

  "@id": "http://my.domain.org/city/{t3:col:city-code}",

  "name": "{t3:col:city-name}",

  "containedIn": {

     "@type": "Country",

     "@id": "http://my.domain.org/country/{t3:col:city-country}",

     "name": "{t3:col:country-name}"

  }

}

We use JSON-LD's ability to embed structured values to nest the definition of countries (with their URLs and names) inside the definition of cities.

Each row of the table will still generate one city entity, however not every row will generate a different country. Countries with the same "@id" attribute will be merged into a single entity. This happens naturally when the triples are generated, because they will have the same subject.

Using JSON-LD with HTML tables

At the beginning of this document, we showed how to describe (simple) HTML tables with mark-up, and then introduced a JSON-LD syntax to mark-up external CSV tables. This approach can also be used for HTML tables, provided identifiers are defined for table columns.

For instance, the Country table above can be written in HTML as follows:

  <table>

    <tr>

      <th id="country-code">country</th>

      <th id="country-name">name</th>

    </tr>

    <tr>

      <td>FR</td><td>France</td>

    </tr>

    <tr>

      <td>DE</td><td>Germany</td>

    </tr>

  </table>

This table can be described by a JSON-LD fragment embedded in the same document:

<script type="application/ld+json">

{

  "@context": "http://schema.org",

  "@type": "Country",

  "@id": "http://example.com/country/{#country-code}",

  "name": "{#country-name}"

}

</script>

The only difference with the CSV table case is that columns are referenced via the HTML identifiers in the table definition, instead of URL fragments for CSV. Compared to the direct mark-up approach we described at the beginning of the document, annotation through references to columns with ids is more flexible.

Related Vocabularies