- From: Ivan Herman <ivan@w3.org>
- Date: Tue, 18 Feb 2014 17:34:45 +0100
- To: Juan Sequeda <juanfederico@gmail.com>
- CC: Andy Seaborne <andy@apache.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
- Message-ID: <53038BA5.7020605@w3.org>
Juan Sequeda wrote: > Andy, > > > On Tue, Feb 18, 2014 at 5:25 AM, Andy Seaborne <andy@apache.org > <mailto:andy@apache.org>> wrote: > > On 12/02/14 16:57, Juan Sequeda wrote: > > So... I believe I can bring some thoughts to the table wrt CSV to RDF. > Part of these thoughts come from conversations that I have had > previously with danbri. > > I saw in today's minutes that the RDB2RDF topic came up. I agree with > Axel that "CSV2RDF should be just a "dialect/small modification" of the > existing RDB2RDF spec". I actually encourage that there exists both a > Direct Mapping (completely automated mapping) and a modification of R2RML. > > The following issues arise: > - How do you know if the first column is a header or not. > - How do you know if there exists an id attribute/field which acts as a > unique identifier for the tuple (i.e primary key). > > Therefore, there needs to be a way to state this in a standard way. I'm > assuming this is going to go somewhere. Given this information, the > Direct Mapping standard should apply transparently (or so I believe at > this moment). > > Now with R2RML, I believe some changes need to be made. R2RML was made > to take advantage of SQL as much as possible; that is why you can define > a mapping on table or on a sql query. Take for example the following > R2RML mappings for Musicbrainz [4]. You can see that the tuples from > "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of > mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE > artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure > how to do this without a SQL engine. Therefore, should SQL engines be > involved in the CSV to RDF transformation? > > Another instance where R2RML relies heavily on SQL is when you want to > translate database codes into IRIs [5]. For example, if you have a code > value "eng" which should be mapped to some URI > http://example.com/engineering, which is part of a well defined > thesaurus/vocabulary. > > > I agree that R2RML is a possible starting point and also that it does not > apply automatically. > > There often isn't an explicit primary key nor proper foreign keys. > > > That is why I suggest that there is a standard way of defining which > attribute/column can "act" as a primary key. Right. This should be part of the metadata that we will define for CSV anyway > > Same for foreign key. But that may be putting the cart ahead of the horse at > the moment. +1 for the putting the card... Ivan > > > If the CSV conversion process can influence the CSV format, then there is > a lot that can be done but if the CSV format is fixed, it may not be ideal. > > > What do you mean by "CSV conversion process"? Is it some pre processing step > (maybe done as a SQL query) before the data gets generated as a CSV? > > > A single table may be a denormalized view and somehow the data structuring > needs to be put back into the output. > > It might be useful if we have a very simple concrete synthetic example to > talk about in discussing conversion options. > > Here's a contribution: > > ---------------------------- > "Sales Region"," Quarter"," Sales" > "North","Q1",10 > "North","Q2",15 > "North","Q3",7 > "North","Q4",25 > "South","Q1",9 > "South","Q2",15 > "South","Q3",16 > "South","Q4",31 > ---------------------------- > > There are two sales regions, each with 4 sales results. > > This needs some kind of term resolution to turn e.g. "North" into a URI > for the northern sales region. It could be by an external lookup or by > URI template as in R2RML. External lookup gives better linking. > > Defining "views" may help replacing the SQL with something. > > > In this example, what would be the subject? > > > > Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF to > lift out the data in a more useful form. I have doubts about two stages > processes like this because the real outcome is going the first step > alone. Rows-encoded-in-RDF then pushes the burden onto the data consumer; > it's a barrier to reuse. Of course, it's easier to add mechanically. > > > The Direct mapping is useful to have quick/dirty RDF. If your schema is > normalized, or in this case, if the CSV is normalized, then the RDF that comes > out is fairly "good". > > Additionally, there may be users who would want to do a RDF->RDF > transformation. This is where Direct Mapping helps > > If users want to a CSV -> RDF transformation, then this is where a mapping > language comes in. > > Nevertheless, I'm a huge advocate for automation, hence the Direct Mapping. > Actually, we have been observing that Ultrawrap's users usually first run the > Direct Mapping. The resulting mapping is represented as R2RML. Then they go in > to edit the R2RML mapping. > > > The whole area of times and dates is messy but important. > > Calculations might be done in a way that utilizes javascript, which given > the likely audience makes it not all new technology, and is a a route to > custom conversion. > > Andy > > >
Received on Tuesday, 18 February 2014 16:35:03 UTC