Re: CSV2RDF and R2RML

Juan Sequeda wrote:
> Andy,
>
>
> On Tue, Feb 18, 2014 at 5:25 AM, Andy Seaborne <andy@apache.org
> <mailto:andy@apache.org>> wrote:
>
>     On 12/02/14 16:57, Juan Sequeda wrote:
>
>         So... I believe I can bring some thoughts to the table wrt CSV to RDF.
>         Part of these thoughts come from conversations that I have had
>         previously with danbri.
>
>         I saw in today's minutes that the RDB2RDF topic came up. I agree with
>         Axel that "CSV2RDF should be just a "dialect/small modification" of the
>         existing RDB2RDF spec". I actually encourage that there exists both a
>         Direct Mapping (completely automated mapping) and a modification of R2RML.
>
>         The following issues arise:
>         - How do you know if the first column is a header or not.
>         - How do you know if there exists an id attribute/field which acts as a
>         unique identifier for the tuple (i.e primary key).
>
>         Therefore, there needs to be a way to state this in a standard way. I'm
>         assuming this is going to go somewhere. Given this information, the
>         Direct Mapping standard should apply transparently (or so I believe at
>         this moment).
>
>         Now with R2RML, I believe some changes need to be made. R2RML was made
>         to take advantage of SQL as much as possible; that is why you can define
>         a mapping on table or on a sql query. Take for example the following
>         R2RML mappings for Musicbrainz [4]. You can see that the tuples from
>         "SELECT * FROM artist WHERE artist.type = 1" are mapped to instances of
>         mo:SoloMusicArtist while tuples from "SELECT * FROM artist WHERE
>         artist.type = 2" are mapped to instances of mo:MusicGroup. I'm not sure
>         how to do this without a SQL engine. Therefore, should SQL engines be
>         involved in the CSV to RDF transformation?
>
>         Another instance where R2RML relies heavily on SQL is when you want to
>         translate database codes into IRIs [5]. For example, if you have a code
>         value "eng" which should be mapped to some URI
>         http://example.com/engineering, which is part of a well defined
>         thesaurus/vocabulary.
>
>
>     I agree that R2RML is a possible starting point and also that it does not
>     apply automatically.
>
>     There often isn't an explicit primary key nor proper foreign keys. 
>
>
> That is why I suggest that there is a standard way of defining which
> attribute/column can "act" as a primary key.

Right. This should be part of the metadata that we will define for CSV anyway
>
> Same for foreign key. But that may be putting the cart ahead of the horse at
> the moment.

+1 for the putting the card...

Ivan

>  
>
>      If the CSV conversion process can influence the CSV format, then there is
>     a lot that can be done but if the CSV format is fixed, it may not be ideal.
>
>
> What do you mean by "CSV conversion process"? Is it some pre processing step
> (maybe done as a SQL query) before the data gets generated as a CSV?
>
>
>     A single table may be a denormalized view and somehow the data structuring
>     needs to be put back into the output.
>
>     It might be useful if we have a very simple concrete synthetic example to
>     talk about in discussing conversion options.
>
>     Here's a contribution:
>
>     ----------------------------
>     "Sales Region"," Quarter"," Sales"
>     "North","Q1",10
>     "North","Q2",15
>     "North","Q3",7
>     "North","Q4",25
>     "South","Q1",9
>     "South","Q2",15
>     "South","Q3",16
>     "South","Q4",31
>     ----------------------------
>
>     There are two sales regions, each with 4 sales results.
>
>     This needs some kind of term resolution to turn e.g. "North" into a URI
>     for the northern sales region.  It could be by an external lookup or by
>     URI template as in R2RML. External lookup gives better linking.
>
>     Defining "views" may help replacing the SQL with something.
>
>
> In this example, what would be the subject?
>  
>
>
>     Using direct mapping seems to involve doing CSV->RDF, then RDF->RDF to
>     lift out the data in a more useful form.  I have doubts about two stages
>     processes like this because the real outcome is going the first step
>     alone.  Rows-encoded-in-RDF then pushes the burden onto the data consumer;
>     it's a barrier to reuse. Of course, it's easier to add mechanically.
>
>
> The Direct mapping is useful to have quick/dirty RDF. If your schema is
> normalized, or in this case, if the CSV is normalized, then the RDF that comes
> out is fairly "good". 
>
> Additionally, there may be users who would want to do a RDF->RDF
> transformation. This is where Direct Mapping helps
>
> If users want to a CSV -> RDF transformation, then this is where a mapping
> language comes in.
>
> Nevertheless, I'm a huge advocate for automation, hence the Direct Mapping.
> Actually, we have been observing that Ultrawrap's users usually first run the
> Direct Mapping. The resulting mapping is represented as R2RML. Then they go in
> to edit the R2RML mapping. 
>  
>
>     The whole area of times and dates is messy but important.
>
>     Calculations might be done in a way that utilizes javascript, which given
>     the likely audience makes it not all new technology, and is a a route to
>     custom conversion.
>
>             Andy
>
>
>

Received on Tuesday, 18 February 2014 16:35:03 UTC