- From: Dave Reynolds <dave.e.reynolds@gmail.com>
- Date: Mon, 09 Aug 2010 22:07:34 +0100
- To: "Wood, Jamey" <Jamey.Wood@nrel.gov>
- Cc: "public-lod@w3.org" <public-lod@w3.org>
On Mon, 2010-08-09 at 10:37 -0600, Wood, Jamey wrote: > Are there any established best practices for converting CSV data into LOD-friendly RDF? For example, I would like to produce an LOD-friendly RDF version of the "2001 - Present Net Generation by State by Type of Producer by Energy Source" CSV data at: > > http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html > > I'm attaching a sample of a first stab at this. Questions I'm running into include the following: > > > 1. Should one try to convert primitive data types (particularly strings) into URI references? Or just leave them as primitives? Or perhaps provide both (with separate predicate names)? For example, the sample EIA data I reference has two-letter state abbreviations in one column. Should those be left alone or converted into URIs? If the code corresponds to a concept which has a useful URI to link to then "yes". In cases where the string is a code but there isn't an existing URI scheme then one approach is to create a set of SKOS concepts to represent the codes, recording the original code string using skos:notation. > 2. Should one merge separate columns from the original data in order to align to well-known RDF types? For example, the sample EIA data has separate "Year" and "Month" columns. Should those be merged in the RDF version so that an "xs:gYearMonth" type can be used? Probably. Merging is useful if you are going to query via the merged form. In a case like year/month there could be an argument for also keeping the separate forms as well to enable you to query by month, independent of year. > 3. Should one attempt to introduce some sort of hierarchical structure (to make the LOD more "browseable")? The "skos:related" triples in the attached sample are an initial attempt to do that. Is this a good idea? If so, is that a reasonable predicate to use? If it is a reasonable thing to do, we would presumably craft these triples so that one could navigate through the entire LOD (e.g. "state" -> "state/year" -> "state/year/month" -> "state/year/month/typeOfProducer" -> "state/year/month/typeOfProducer/energySource"). Another approach is to use one of the statistics-in-RDF representations so that you can slice by the dimensions in the data. There is the Scovo vocabulary [1]. Recently a group of us have been working on an updated vocabulary for statistics [2] based on the SDMX standard [3]. At a recent Open Data Foundation workshop [4] we agreed to partition the SDMX-in-RDF work into a simple "Data Cube" vocabulary [5] and extension vocabularies to support particular domains such as aggregate statistics (SDMX) and maybe eventually micro-data (DDI). The Data Cube vocabulary is very much a work in progress but I think we have now closed out all the main open design questions, have a draft vocab and aim to get the initial documentation to a usable state over the coming few weeks. Feel free to ping me off line if you would like to follow up on this. Dave [1] http://semanticweb.org/wiki/Scovo [2] http://code.google.com/p/publishing-statistical-data/ [3] http://sdmx.org/ [4] http://www.odaf.org/blog/?p=39 [5] http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html
Received on Wednesday, 11 August 2010 03:13:58 UTC