Last updated: 08/15/2013
Many useful data exist in the form of tables, either inside HTML documents, or as standalone files in various formats (CSV, XLS, ODF), or in relational database management systems. While tables are a natural representation of data for many applications, they do not provide the necessary semantics to enable search engines to effectively index, and surface to users the information they contain.
The goal of this proposal is to help bridge between data tables and triples, by providing a simple mechanism to describe data tables so that their contents can be understood in terms of entities and properties.
This work is similar in spirit to the R2RML W3C recommendation. The main differences are that 1) this proposal is oriented towards the needs and expertise of mainstream schema.org publishers and 2) that it is not aimed at relational databases, but focuses on annotating tables that are available on the Web, and tries to make the mark-up as simple as possible for this use cases. It is likely that R2RML configurations can be automatically generated in many cases.
This is an early proposal. Comments are welcome.
Consider the list of paintings by Rembrandt on the following wikipedia page:
http://en.wikipedia.org/wiki/List_of_paintings_by_Rembrandt
Here are the first few rows (image URLs in the first column are rendered in the table):
Image | Title | Year | Technique | Dimensions | Gallery |
The Operation (Touch) | 1624/1625 | Oil on panel | 21.6 x 17.7 cm. | Private collection | |
The Spectacles-pedlar (Sight) | 1624/1625 | Oil on panel | 21 x 17.8 cm. | Stedelijk Museum De Lakenhal, Leiden | |
The Three Singers (Hearing) | 1624/1625 | Oil on panel | 21.6 x 17.8 cm. | Collection W. Baron van Dedem |
We would like each row of the table to be described as a http://schema.org/Painting. Today, this has to be done by marking up each row individually. This process is repetitive and error-prone, and the resulting mark-up is verbose.
Instead, we propose marking up the columns in the header of the table to say which properties of Painting they correspond to.
Here is what it looks like in rdfa syntax:
<table typeof="SetOf/Painting" vocab="http://schema.org/">
<thead>
<tr>
<th property="image">Image</th>
<th property="name">Title</th>
<th property="dateCreated">Year</th>
<th>Technique</th>
<th>Dimensions</th>
<th property="contentLocation">Gallery</th>
</tr>
</thead>
<tbody>...</tbody>
</table>
By putting this marking on the table header, we're saying that each row of the table corresponds to an instance of Painting, and that the annotated columns contain the values of the properties of Painting they are marked up with.
For instance, the first row of the table will be interpreted as a Painting instance with the following property values:
In order to distinguish this mark-up from regular mark-up, we insert the string SetOf/ in the type declaration. The type of the elements is determined by removing SetOf/ from the type. The type declaration corresponds to a full URL where SetOf is inserted at the beginning of the path fragment. The typeof declaration in the example above can be expanded to typeof="http://schema.org/SetOf/Painting", which means that the rows of the table are instances of http://schema.org/Painting. Note that this scheme is not limited to schema.org vocabulary, but can be used with any vocabulary by inserting SetOf at the beginning of the paths of URLs that identify types/classes.
Note that the values of columns can be either text (e.g., the title or the year), or URLs (image), or both. These should be resolved based on the type of the target property. Such tables (e.g., in Wikipedia) will often vary unpredictably between hypertext and simpler atomic values. For now we do not address the question of normalizing such content.
Side comment: The example above does not map all the columns of the table, because the Painting schema does not define properties that correspond to Technique and Dimensions. The use of "contentLocation" for Gallery is also a bit of a stretch.
The table above omits a critical piece of information about the paintings: their painter! This can be represented by adding a "constant" property, using a meta tag:
<table typeof="SetOf/Painting" vocab="http://schema.org/">
<thead>
<tr>
<th property="image">Image</th>
<th property="name">Title</th>
<th property="dateCreated">Year</th>
<th>Technique</th>
<th>Dimensions</th>
<th property="contentLocation">Gallery</th>
<link property="author"
href="http://en.wikipedia.org/wiki/Rembrandt"/>
</tr>
</thead>
<tbody>...</tbody>
</table>
Marking up external tables (CSV, etc)
Many useful tables live outside HTML documents, in CSV files. Mark-up doesn't really work in this case. Instead we're proposing using JSON-LD to define table mappings.
Suppose our paintings table is instead available as a CSV file, with the following contents (dropping the image, Technique and Dimensions columns for readability)
Title,Year,Gallery
The Operation (Touch),1624/1625,Private collection
The Spectacles-pedlar (Sight),1624/1625,Stedelijk Museum De Lakenhal,Leiden
The Three Singers (Hearing),1624/1625,Collection W. Baron van Dedem
First, we need a mechanism to reference the columns of this table. For CSV tables on the Web, we can use URI fragment identifiers to address columns inside the table, either by their header or by their position.
For example, assuming the CSV table is available at the following URL:
http://wp.org/rembrandt-paintings.csv
The Year column can be referenced by its name:
http://wp.org/rembrandt-paintings.csv#col:Year
Or by its position
http://wp.org/rembrandt-paintings.csv#col:1
Mapping our CSV table and its columns to Painting and its properties can be done in JSON-LD as follows:
{
"@context": "http://schema.org/",
"@type": "SetOf/Painting",
"name": "{http://wp.org/rembrandt-paintings.csv#col:Title}",
"dateCreated" : "{http://wp.org/rembrandt-paintings.csv#col:Year}",
"contentLocation" : "{http://wp.org/rembrandt-paintings.csv#col:Gallery}",
"author": "http://en.wikipedia.org/wiki/Rembrandt"
}
This definition above can either be embedded in an HTML document about the table, by placing it in a script tag with the type attribute set to application/ld+json (per JSON-LD spec), or provided as part of a dataset bundle (see below).
The meaning of this definition is very similiar to the mark-up version above. @type specifes the type of the elements that will be generated from the table (i.e., Painting). Properties of painting can reference columns of tables or constant strings. References to table columns are surrounded by curly braces (e.g., "{http://wp.org/rembrandt-paintings.csv#col:Title}", but constant values are not (e.g., "http://en.wikipedia.org/wiki/Rembrandt").
This syntax is based on the RFC for URI templates, and allows for richer patterns, as described below.
In some cases, it is useful to combine values from multiple columns together, or with constant strings in order to create values. This can be achieved by mixing references (in curly braces) with text. For instance, we could generate a description of the painting by concatenating the Title, Year and Gallery columns, separated by text as follows:
{
"@context": {
"@vocab" : "http://schema.org/",
"rp" : "http://wp.org/rembrandt-paintings.csv#"
},
"@type": "SetOf/Painting",
"name": "{rp:col:Title}",
"dateCreated" : "{rp:col:Year}",
"contentLocation" : "{rp:col:Gallery}",
"author" : "http://en.wikipedia.org/wiki/Rembrandt",
"description" : "{rp:col:Title}, {rp::col:Year}, {rp:col:Gallery}."
}
The description field will generate the following description for the first row in the table:
The Operation (Touch), 1624/1625, Private collection.
Note that this example used the JSON-LD prefix expansion mechanism to keep references to columns short, so given the context definition "rp" : "http://wp.org/rembrandt-paintings.csv#" the name property value "{rp:col:Title}" is equivalent to "{http://wp.org/rembrandt-paintings.csv#col:Title}".
It it tempting to make the pattern language richer, to support restructuring and transforming the values foud in tables. There is a trade-off between that expressiveness and the simplicity and ease-of-use of the language.
Many datasets consist of more than a single table. They may contain multiple CSV files, and the description of the entities they contain may be quite complex.
We propose extending the http://schema.org/Dataset class to act as a container for such datasets, by adding the following property to it:
For example, consider an art catalog dataset consisting of many CSV files. Each CSV file lists the paintings of one painter in the catalog.
We propose creating one Dataset that "bundles" all these files, and a json-ld file that contains the entity mappings for each table.
The dataset definition will look like:
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@id": "paintings-dataset",
"@type": "Dataset",
"datasetElement": [
{"@id": "rembrandt-paintings.csv",
"@type": "DataDownload",
"encodingFormat": "text/csv",
"contentUrl": "http://wp.org/rembrandt-paintings.csv"
},
{"@id": "renoir-paintings.csv",
"@type": "DataDownload",
"encodingFormat": "text/csv",
"contentUrl": "http://wp.org/renoir-paintings.csv"
},
…,
{"@id": "paintings",
"@type": "DataDownload",
"encodingFormat": "application/json",
"contentUrl": "http://wp.org/paintings.json"
}
]
}
</script>
The CSV files for each painter look like the one in the example above. The paintings.json file contains a list of table mappings like the one above:
[{
"@context": {
"@vocab" : "http://schema.org/",
"p1" : "http://wp.org/rembrandt-paintings.csv#"
},
"@type": "SetOf/Painting",
"name": "{p1:col:Title}",
"dateCreated" : "{p1:col:Year}",
"contentLocation" : "{p1:col:Gallery}",
"author" : "http://en.wikipedia.org/wiki/Rembrandt"
},
{
"@context": {
"@vocab" : "http://schema.org/",
"p2" : "http://wp.org/renoir-paintings.csv#"
},
"@type": "SetOf/Painting",
"name": "{p2:col:Title}",
"dateCreated" : "{p2:col:Year}",
"contentLocation" : "{p2:col:Gallery}",
"author" : "http://en.wikipedia.org/wiki/Renoir"
},...]
In schema.org and other vocabularies, URLs are used as primary identifiers of entities. We illustrate through an example how they can easily be created and used for referencing entities.
We will use the patterns described above to generate URLs for entities from column values.
In the case or URLs, the pattern language we use here is actually very similar to URI templates.
Suppose we have two tables, one that lists countries and one that lists cities.
Countries
country-code | country-name |
DE | Germany |
FR | France |
Cities
city-code | city-name | city-country |
PAR | Paris | FR |
BER | Berlin | DE |
Let's assume these table exist as two CSV files at the following locations:
The country table can be described as follows:
{
"@context": {
"@vocab": "http://schema.org",
"t1": "http://my.domain.org/countries.csv#"
}
"@type": "SetOf/Country",
"@id": "http://my.domain.org/country/{t1:col:country-code}",
"name": "{t1:col:country-name}"
}]
The "@id" field is used in JSON-LD to specify the URL of an entity. We use a pattern to construct a URL for each country, based on its country code value
The city table description can create a reference to the country of each city through its url:
{
"@context":{
"@vocab": "http://schema.org",
"t2": "http://my.domain.org/cities.csv#"
}
"@type": "SetOf/City",
"@id": "http://my.domain.org/city/{t2:col:city-code}",
"name": "{t2:col:city-name}",
"containedIn": "http://my.domain.org/country/{t2:col:city-country}"
},
The example above is a normalized representation: Because each country may contain multiple cities, cities and countries are represented as separate tables, and each table containing exactly one row per entity.
In reality, tables are often denormalized, with redundant information in them. For instance, there may be a single city / country table that looks like:
city-code | city-name | city-country | country-name |
PAR | Paris | FR | France |
LIL | Lille | FR | France |
BER | Berlin | DE | Germany |
MUN | Munich | DE | Germany |
This table can be described as follows:
{
"@context": {
"@vocab": "http://schema.org",
"t3": "http://my.domain.org/cities-counties.csv#"
}
"@type": "SetOf/City",
"@id": "http://my.domain.org/city/{t3:col:city-code}",
"name": "{t3:col:city-name}",
"containedIn": {
"@type": "SetOf/Country",
"@id": "http://my.domain.org/country/{t3:col:city-country}",
"name": "{t3:col:country-name}"
}
}
We use JSON-LD's ability to embed structured values to nest the definition of countries (with their URLs and names) inside the definition of cities.
Each row of the table will still generate one city entity, however not every row will generate a different country. Countries with the same "@id" attribute will be merged into a single entity. This happens naturally when the triples are generated, because they will have the same subject.
At the beginning of this document, we showed how to describe (simple) HTML tables with mark-up, and then introduced a JSON-LD syntax to mark-up external CSV tables. This approach can also be used for HTML tables, provided identifiers are defined for table columns.
For instance, the Country table above can be written in HTML as follows:
<table>
<tr>
<th id="country-code">country</th>
<th id="country-name">name</th>
</tr>
<tr>
<td>FR</td><td>France</td>
</tr>
<tr>
<td>DE</td><td>Germany</td>
</tr>
</table>
This table can be described by a JSON-LD fragment embedded in the same document:
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "SetOf/Country",
"@id": "http://example.com/country/{#country-code}",
"name": "{#country-name}"
}
</script>
The only difference with the CSV table case is that columns are referenced via the HTML identifiers in the table definition, instead of URL fragments for CSV. Compared to the direct mark-up approach we described at the beginning of the document, annotation through references to columns with ids is more flexible.