Re: Best Practices for Converting CSV into LOD?

On Mon, 2010-08-09 at 10:37 -0600, Wood, Jamey wrote: 
> Are there any established best practices for converting CSV data into LOD-friendly RDF?  For example, I would like to produce an LOD-friendly RDF version of the "2001 - Present Net Generation by State by Type of Producer by Energy Source" CSV data at:
> 
>   http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html
> 
> I'm attaching a sample of a first stab at this.  Questions I'm running into include the following:
> 
> 
>  1.  Should one try to convert primitive data types (particularly strings) into URI references?  Or just leave them as primitives?  Or perhaps provide both (with separate predicate names)?  For example, the  sample EIA data I reference has two-letter state abbreviations in one column.  Should those be left alone or converted into URIs?

If the code corresponds to a concept which has a useful URI to link to
then "yes". 

In cases where the string is a code but there isn't an existing URI
scheme then one approach is to create a set of SKOS concepts to
represent the codes, recording the original code string using
skos:notation.

> 2.  Should one merge separate columns from the original data in order to align to well-known RDF types?  For example, the sample EIA data has separate "Year" and "Month" columns.  Should those be merged in the RDF version so that an "xs:gYearMonth" type can be used?

Probably. Merging is useful if you are going to query via the merged
form. In a case like year/month there could be an argument for also
keeping the separate forms as well to enable you to query by month,
independent of year.

> 3.  Should one attempt to introduce some sort of hierarchical structure (to make the LOD more "browseable")?  The "skos:related" triples in the attached sample are an initial attempt to do that.  Is this a good idea?  If so, is that a reasonable predicate to use?  If it is a reasonable thing to do, we would presumably craft these triples so that one could navigate through the entire LOD (e.g. "state" -> "state/year" -> "state/year/month" -> "state/year/month/typeOfProducer" -> "state/year/month/typeOfProducer/energySource").

Another approach is to use one of the statistics-in-RDF representations
so that you can slice by the dimensions in the data.

There is the Scovo vocabulary [1]. 

Recently a group of us have been working on an updated vocabulary for
statistics [2] based on the SDMX standard [3]. At a recent Open Data
Foundation workshop [4] we agreed to partition the SDMX-in-RDF work into
a simple "Data Cube" vocabulary [5] and extension vocabularies to
support particular domains such as aggregate statistics (SDMX) and maybe
eventually micro-data (DDI).

The Data Cube vocabulary is very much a work in progress but I think we
have now closed out all the main open design questions, have a draft
vocab and aim to get the initial documentation to a usable state over
the coming few weeks.

Feel free to ping me off line if you would like to follow up on this.

Dave

[1] http://semanticweb.org/wiki/Scovo
[2] http://code.google.com/p/publishing-statistical-data/
[3] http://sdmx.org/
[4] http://www.odaf.org/blog/?p=39
[5]
http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html

Received on Wednesday, 11 August 2010 03:13:58 UTC