Re: Intro and Thoughts on CSV2RDF

Jeremy, all,

I'll try to find some time to do it. However, the usecase would express the
general notion of "why I need to convert to RDF". Basically any RDF usecase
can be applied here.

danbri, do you have any particular usecases in mind.


Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com


On Fri, Feb 14, 2014 at 4:31 AM, Tandy, Jeremy <
jeremy.tandy@metoffice.gov.uk> wrote:

>  Hi Juan - thanks for your thoughts. I note that currently we don't have
> any use cases on the wiki <https://www.w3.org/2013/csvw/wiki/Use_Cases>that discuss publication of (semantically enabled) CSV from relational
> database tables. We need use cases, expressed as a user-driven /
> outcome-driven narrative (e.g. tell a story about what someone is trying to
> achieve rather than use abstract functional requirements) with real data
> examples, in order to establish our requirements for the eventual spec.
>
>
>
> Can you add something to the wiki please so that I can incorporate it into
> the "Use cases and requirements" documentation?
>
>
>
> Many thanks, Jeremy
>
>
>
> *From:* Juan Sequeda [mailto:juanfederico@gmail.com]
> *Sent:* 12 February 2014 16:58
> *To:* public-csv-wg@w3.org
> *Subject:* Intro and Thoughts on CSV2RDF
>
>
>
> All,
>
>
>
> Quick intro (even though I did make an intro on the first call): I'm
> finishing my PhD in CS at UT Austin. My research focuses on the integration
> of relational databases with the semantic web. A result of my research is
> Ultrawrap [1], a Relational Database to RDF (RDB2RDF) system capable of
> running SPARQL queries as fast as SQL queries. Ultrawrap has been
> productized and is compliant with the W3C Direct Mapping and R2RML
> standards for RDB2RDF. Ultrawrap is currently being commercialized by my
> startup, Capsenta [2]. Ultrawrap is being used to generate the RDF dumps of
> Musicbrainz. Additionally, the data behind Constitute Project [3] comes
> from hundreds of CSVs and converted to RDF using Ultrawrap.  I've been
> involved in the RDB2RDF space since the first workshop in 2007, XG, WG,
> editor of the Direct Mapping spec and implementor of both standards.
>
>
>
> So... I believe I can bring some thoughts to the table wrt CSV to RDF.
> Part of these thoughts come from conversations that I have had previously
> with danbri.
>
>
>
> I saw in today's minutes that the RDB2RDF topic came up. I agree with Axel
> that "CSV2RDF should be just a "dialect/small modification" of the existing
> RDB2RDF spec". I actually encourage that there exists both a Direct Mapping
> (completely automated mapping) and a modification of R2RML.
>
>
>
> The following issues arise:
>
> - How do you know if the first column is a header or not.
>
> - How do you know if there exists an id attribute/field which acts as a
> unique identifier for the tuple (i.e primary key).
>
>
>
> Therefore, there needs to be a way to state this in a standard way. I'm
> assuming this is going to go somewhere. Given this information, the Direct
> Mapping standard should apply transparently (or so I believe at this
> moment).
>
>
>
> Now with R2RML, I believe some changes need to be made. R2RML was made to
> take advantage of SQL as much as possible; that is why you can define a
> mapping on table or on a sql query. Take for example the following R2RML
> mappings for Musicbrainz [4]. You can see that the tuples from "SELECT *
> FROM artist WHERE artist.type = 1" are mapped to instances of
> mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
> artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure how
> to do this without a SQL engine. Therefore, should SQL engines be involved
> in the CSV to RDF transformation?
>
>
>
> Another instance where R2RML relies heavily on SQL is when you want to
> translate database codes into IRIs [5]. For example, if you have a code
> value "eng" which should be mapped to some URI
> http://example.com/engineering, which is part of a well defined
> thesaurus/vocabulary.
>
>
>
> We have implemented a CSV2RDF in Ultrawrap which uses the Direct Mapping
> and R2RML standards as-is. The only assumption we have at the moment is
> that the first column is a header and the first attribute acts as a primary
> key.
>
>
>
> Another topic I've discussed with danbri is if you have a set of csv,
> which basically are the CSV dumps of all the relational tables of a
> database. Therefore, implicitly there are foreign keys. There should be a
> way to describe the relationships (foreign keys) between different CSVs.
>
>
>
> These are my initial thoughts. Looking forward to hearing what others have
> to say.
>
>
>
> [1] http://www.sciencedirect.com/science/article/pii/S1570826813000383
>
> [2] http://www.capsenta.com/
>
> [3] https://www.constituteproject.org/
>
> [4]
> https://github.com/LinkedBrainz/MusicBrainz-R2RML/blob/master/mappings/artist.ttl
>
> [5] http://www.w3.org/TR/r2rml/#example-translationtable
>
>
>  Juan Sequeda
> +1-575-SEQ-UEDA
> www.juansequeda.com
>

Received on Friday, 14 February 2014 15:22:28 UTC