- From: Juan Sequeda <juanfederico@gmail.com>
- Date: Tue, 18 Feb 2014 09:05:00 -0600
- To: Ivan Herman <ivan@w3.org>
- Cc: Andy Seaborne <andy@apache.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
- Message-ID: <CAMVTWDzHaiu1BhqC1bqUtCaw_7d58TPenmDKL_YY+2hk3Ttbhg@mail.gmail.com>
On Tue, Feb 18, 2014 at 5:50 AM, Ivan Herman <ivan@w3.org> wrote: > > > Andy Seaborne wrote: > > On 12/02/14 16:57, Juan Sequeda wrote: > >> So... I believe I can bring some thoughts to the table wrt CSV to RDF. > >> Part of these thoughts come from conversations that I have had > >> previously with danbri. > >> > >> I saw in today's minutes that the RDB2RDF topic came up. I agree with > >> Axel that "CSV2RDF should be just a "dialect/small modification" of the > >> existing RDB2RDF spec". I actually encourage that there exists both a > >> Direct Mapping (completely automated mapping) and a modification of > R2RML. > >> > >> The following issues arise: > >> - How do you know if the first column is a header or not. > >> - How do you know if there exists an id attribute/field which acts as a > >> unique identifier for the tuple (i.e primary key). > >> > >> Therefore, there needs to be a way to state this in a standard way. I'm > >> assuming this is going to go somewhere. Given this information, the > >> Direct Mapping standard should apply transparently (or so I believe at > >> this moment). > >> > >> Now with R2RML, I believe some changes need to be made. R2RML was made > >> to take advantage of SQL as much as possible; that is why you can define > >> a mapping on table or on a sql query. Take for example the following > >> R2RML mappings for Musicbrainz [4]. You can see that the tuples from > >> "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of > >> mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE > >> artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure > >> how to do this without a SQL engine. Therefore, should SQL engines be > >> involved in the CSV to RDF transformation? > >> > >> Another instance where R2RML relies heavily on SQL is when you want to > >> translate database codes into IRIs [5]. For example, if you have a code > >> value "eng" which should be mapped to some URI > >> http://example.com/engineering, which is part of a well defined > >> thesaurus/vocabulary. > > > > I agree that R2RML is a possible starting point and also that it does not > > apply automatically. > > > > There often isn't an explicit primary key nor proper foreign keys. If > the CSV > > conversion process can influence the CSV format, then there is a lot > that can > > be done but if the CSV format is fixed, it may not be ideal. > I am not sure what you mean here when you say 'influence'. I presume we > should > take the CSV format as a black box, meaning that this group will not (and > should > not) define what CSV is and we should be able to take anything that is out > there.... > +1 > > The attached metadata may be used to define a 'primary key', ie, a column > that > serves as such, but I do not think we should even talk about foreign keys > in > this context... > Fair enough. > > The problem with R2RML, as Juan's example above shows, that it goes very > quickly > to the usage of SQL as a tool for all kinds of clever tricks. Which is > great for > R2RML, but I do not think we should go down that route for CSV. I am > curious > what our use case will reveal in this sense but my gut feeling tells me > that > most of the CSV data out there are structurally simple (albeit possibly > huge...) > > Ivan > > > > A single table may be a denormalized view and somehow the data > structuring > > needs to be put back into the output. > > > > It might be useful if we have a very simple concrete synthetic example > to talk > > about in discussing conversion options. > > > > Here's a contribution: > > > > ---------------------------- > > "Sales Region"," Quarter"," Sales" > > "North","Q1",10 > > "North","Q2",15 > > "North","Q3",7 > > "North","Q4",25 > > "South","Q1",9 > > "South","Q2",15 > > "South","Q3",16 > > "South","Q4",31 > > ---------------------------- > > > > There are two sales regions, each with 4 sales results. > > > > This needs some kind of term resolution to turn e.g. "North" into a URI > for > > the northern sales region. It could be by an external lookup or by URI > > template as in R2RML. External lookup gives better linking. > > > > Defining "views" may help replacing the SQL with something. > > > > Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF to > lift > > out the data in a more useful form. I have doubts about two stages > processes > > like this because the real outcome is going the first step alone. > > Rows-encoded-in-RDF then pushes the burden onto the data consumer; it's a > > barrier to reuse. Of course, it's easier to add mechanically. The whole > area > > of times and dates is messy but important. > > Calculations might be done in a way that utilizes javascript, which > given the > > likely audience makes it not all new technology, and is a a route to > custom > > conversion. > > > > Andy > > > > > >
Received on Tuesday, 18 February 2014 15:05:51 UTC