looking at CSV2RDF from another perspective

Dear all,
in our lab we work on an extension over R2RML, that can handle different 
input sources instead of only relational databases, one of these is CSV.
We believe that a generic mapping model that works with different 
formats should be defined. At the moment only R2RML is a W3C standard 
indicating how relational databases should be mapped to RDF. But how one 
should handle the mappings for the other file formats to RDF is not defined.

At this list I follow the discussion around, among others, defining a 
"dialect" over R2RML for CSV. This sounds reasonable but would we then 
need an extension for XML files too? Then for JSON and so on so forth? 
So in a nutshell my point is that instead of addressing dialects, we 
address the core mapping language that takes care of the mapping per se 
and adjust it to each input; generalizing what can be generalized and 
keep it specific when it's necessary.

To this end, we tried to see what's the core of R2RML, that defines the 
mapping per-se by excluding its database-specific characteristics and 
defined it as RML. This ends up to be the core mapping model and on top 
of this, we only need to specify what a "dialect" expects for each case 
specifically, namely how to extract the data from the input source, how 
to iterate over the input source and how to refer to the data values.
RML was defined as a superset of R2RML, thus R2RML can be considered 
adding the RDB specific concepts to RML.

We believe that the way R2RML's model is defined allows us to extend the 
core model to accommodate any other file format as its model is 
mapping-oriented rather than source-oriented.
RML just derives the R2RML core and adapts to references to the files, 
taking advantage of what each file format's definition offers. What 
R2RML "taught" us is handling the input source considering the offered 
input, namely use column names and queries as we learnt from SQL. In the 
same context, we consider XPath to point to the data values in the case 
of XML and JSONPath for JSON. What we have for CSV is columns and that's 
what we do.
Some insights over the model's definition can be found at [1] but 
unfortunately the details of the model are still in the pipeline to be 
published.

Below there are a few remarks exactly up-to-the points that different 
people made during the discussion:

Juan put the question regarding how we know there exists an id 
attribute/field which acts as a unique identifier for the tuple (i.e 
primary key).
Looking at our use cases, since we mostly map CSV, XML and JSON files, 
we do not have the luxury of a primary key in none of these cases. But 
we still need to find a way to map the data. Indeed having a primary key 
is handy, but one can define what s/he considers as the primary key or a 
machine can find what a primary key could be. Going a step further, do 
we really need to have a primary key in every file?
Coming to the example that Andy brought in discussion (regarding Sales 
region, North South etc), I couldn't agree more with Ivan saying that 
during mapping the definition of file format (in that case CSV) should 
not be in doubt. And that's one of the RML model's cornerstones: 
defining the core mapping model and handle whatever is case-specific as 
if it can be handle.
So, if we have no primary keys in the case of CSV and no query language 
as SQL, we need to treat to CSV files considering what their definition 
allows us. If that's not the case, any conversion (e.g. turning a CSV 
file to a database and then map, cause one wants to profit of SQL and 
R2RML) is out of the scope of the mapping language (or mapping dialect). 
If CSV can carry metadata by definition, we may use them.
To conclude my point, from our point of view, it makes more sense to 
have a well defined ql:CSV language that allows one to properly point to 
the data within a file rather than re-defining a dialect mapping 
language for every file format. Remark: the Query Language in RML 
defines how a file needs to be processed, *not* which format the data is 
in (e.g. it could be both ql:XPath and ql:XQuery for XML format). In the 
case of CSV, we consider the column names. An "enriched" CSV would offer 
us more potential to refer to its data and further automate the mappings.

Last, If the CSV file doesn't contain enough information to generate a 
triple as we want it and if we need a triple were the subject is 
ex:region1, as Andy mentioned, then couldn't it be that defining the 
mapping considering a complementary file in whatever format could be a 
solution? Do we care to turn this CSV file to RDF or are we mostly 
interested in using the data of this CSV file to describe a domain? That 
comes out of the scope of this WG I assume but it's still relevant as 
food for thought I think.

With this mail I don't aim to claim that RML is the optimal solution, 
even though it could be applicable in the case of CSV, and I am only 
referring to non-direct mappings. I am only trying to point out that 
before defining a "dialect" (and possibly another one for each different 
file format needed) maybe it's a good idea to face the problem from the 
global point of view.

Kind regards,
Anastasia

[1] http://semweb.mmlab.be/rml/spec.html
-- 

Anastasia Dimou
@natadimou | mmlab.be | iminds.be
Semantic Web - Linked Open Data Researcher
Ghent University, Belgium - Multimedia Lab - iMinds
Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

Received on Friday, 21 February 2014 13:47:15 UTC