Re: Best Practices for Converting CSV into LOD? from Axel Rauschmayer on 2010-08-09 (public-lod@w3.org from August 2010)

From: Axel Rauschmayer <axel@rauschma.de>
Date: Mon, 9 Aug 2010 21:12:54 +0200
To: "Wood, Jamey" <Jamey.Wood@nrel.gov>
Cc: "public-lod@w3.org" <public-lod@w3.org>
Message-Id: <492DD16B-28D6-4FE9-A029-47C54053FF06@rauschma.de>
I gave this a shot in a previous version of Hyena. By prepending one or more special rows, one could control how the columns were converted: what predicate to use, how to convert the content. If a column specification was missing, defaults were used. There were several options: If a cell value was similar to a tag, resources could be auto-created (the cell value became the resource label, existing resources were looked up via their labels). One could also split a cell value prior to processing it (to account for multiple values per column).

Creating meaningful URIs for predicates and rows (resources) is especially important, but tricky. Ideally, import would work bi-directionally (and idempotently): Changes you make in RDF can be written back to the spreadsheet, changes in the spreadsheet can be reimported without causing chaos.

Even though my solution worked OK and I do not see how it could be done better, I was not completely happy with it, because writing this kind of CSV/RDF mapping is beyond the capabilities of normal end users. One could automatically create URIs for predicates from column titles, but as for reliable URIs ("primary keys"), I am at a loss. So it seems like one is stuck with letting an expert write an import specification and hiding it from end users. Then my solution of embedding such a spec in the spreadsheet should be re-thought. And it seems like a simple script might be a better solution than a complex specification language that can handle all the special cases. For example, I hadn’t even thought about two cells contributing to the same literal. Maybe a JVM-hosted scripting language (such as Jython) could be used, but even raw Java is not so bad and has the advantage of superior tool support.

This is important stuff, as many people have all kinds of lists in Excel---which would make great LOD data. It also shows that spreadsheets are hard to beat when it comes to getting started quickly: You just enter your data. Should someone come up with a simpler way of translating CSV data then that might translate to general usability improvements for entering LOD data.

On Aug 9, 2010, at 18:37 , Wood, Jamey wrote:

> Are there any established best practices for converting CSV data into LOD-friendly RDF?  For example, I would like to produce an LOD-friendly RDF version of the "2001 - Present Net Generation by State by Type of Producer by Energy Source" CSV data at:
> 
>  http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html
> 
> I'm attaching a sample of a first stab at this.  Questions I'm running into include the following:
> 
> 
> 1.  Should one try to convert primitive data types (particularly strings) into URI references?  Or just leave them as primitives?  Or perhaps provide both (with separate predicate names)?  For example, the  sample EIA data I reference has two-letter state abbreviations in one column.  Should those be left alone or converted into URIs?
> 2.  Should one merge separate columns from the original data in order to align to well-known RDF types?  For example, the sample EIA data has separate "Year" and "Month" columns.  Should those be merged in the RDF version so that an "xs:gYearMonth" type can be used?
> 3.  Should one attempt to introduce some sort of hierarchical structure (to make the LOD more "browseable")?  The "skos:related" triples in the attached sample are an initial attempt to do that.  Is this a good idea?  If so, is that a reasonable predicate to use?  If it is a reasonable thing to do, we would presumably craft these triples so that one could navigate through the entire LOD (e.g. "state" -> "state/year" -> "state/year/month" -> "state/year/month/typeOfProducer" -> "state/year/month/typeOfProducer/energySource").
> 4.  Any other considerations that I'm overlooking?
> 
> Thanks,
> Jamey
> <generation_state_mon.rdf>

-- 
Dr. Axel Rauschmayer
Axel.Rauschmayer@ifi.lmu.de
http://hypergraphs.de/
### Hyena: organize your ideas, free at hypergraphs.de/hyena/
Received on Monday, 9 August 2010 19:13:25 UTC