- From: Anastasia Dimou <anastasia.dimou@ugent.be>
- Date: Fri, 21 Feb 2014 14:45:43 +0100
- To: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
- Message-ID: <53075887.6030207@ugent.be>
Dear all, in our lab we work on an extension over R2RML, that can handle different input sources instead of only relational databases, one of these is CSV. We believe that a generic mapping model that works with different formats should be defined. At the moment only R2RML is a W3C standard indicating how relational databases should be mapped to RDF. But how one should handle the mappings for the other file formats to RDF is not defined. At this list I follow the discussion around, among others, defining a "dialect" over R2RML for CSV. This sounds reasonable but would we then need an extension for XML files too? Then for JSON and so on so forth? So in a nutshell my point is that instead of addressing dialects, we address the core mapping language that takes care of the mapping per se and adjust it to each input; generalizing what can be generalized and keep it specific when it's necessary. To this end, we tried to see what's the core of R2RML, that defines the mapping per-se by excluding its database-specific characteristics and defined it as RML. This ends up to be the core mapping model and on top of this, we only need to specify what a "dialect" expects for each case specifically, namely how to extract the data from the input source, how to iterate over the input source and how to refer to the data values. RML was defined as a superset of R2RML, thus R2RML can be considered adding the RDB specific concepts to RML. We believe that the way R2RML's model is defined allows us to extend the core model to accommodate any other file format as its model is mapping-oriented rather than source-oriented. RML just derives the R2RML core and adapts to references to the files, taking advantage of what each file format's definition offers. What R2RML "taught" us is handling the input source considering the offered input, namely use column names and queries as we learnt from SQL. In the same context, we consider XPath to point to the data values in the case of XML and JSONPath for JSON. What we have for CSV is columns and that's what we do. Some insights over the model's definition can be found at [1] but unfortunately the details of the model are still in the pipeline to be published. Below there are a few remarks exactly up-to-the points that different people made during the discussion: Juan put the question regarding how we know there exists an id attribute/field which acts as a unique identifier for the tuple (i.e primary key). Looking at our use cases, since we mostly map CSV, XML and JSON files, we do not have the luxury of a primary key in none of these cases. But we still need to find a way to map the data. Indeed having a primary key is handy, but one can define what s/he considers as the primary key or a machine can find what a primary key could be. Going a step further, do we really need to have a primary key in every file? Coming to the example that Andy brought in discussion (regarding Sales region, North South etc), I couldn't agree more with Ivan saying that during mapping the definition of file format (in that case CSV) should not be in doubt. And that's one of the RML model's cornerstones: defining the core mapping model and handle whatever is case-specific as if it can be handle. So, if we have no primary keys in the case of CSV and no query language as SQL, we need to treat to CSV files considering what their definition allows us. If that's not the case, any conversion (e.g. turning a CSV file to a database and then map, cause one wants to profit of SQL and R2RML) is out of the scope of the mapping language (or mapping dialect). If CSV can carry metadata by definition, we may use them. To conclude my point, from our point of view, it makes more sense to have a well defined ql:CSV language that allows one to properly point to the data within a file rather than re-defining a dialect mapping language for every file format. Remark: the Query Language in RML defines how a file needs to be processed, *not* which format the data is in (e.g. it could be both ql:XPath and ql:XQuery for XML format). In the case of CSV, we consider the column names. An "enriched" CSV would offer us more potential to refer to its data and further automate the mappings. Last, If the CSV file doesn't contain enough information to generate a triple as we want it and if we need a triple were the subject is ex:region1, as Andy mentioned, then couldn't it be that defining the mapping considering a complementary file in whatever format could be a solution? Do we care to turn this CSV file to RDF or are we mostly interested in using the data of this CSV file to describe a domain? That comes out of the scope of this WG I assume but it's still relevant as food for thought I think. With this mail I don't aim to claim that RML is the optimal solution, even though it could be applicable in the case of CSV, and I am only referring to non-direct mappings. I am only trying to point out that before defining a "dialect" (and possibly another one for each different file format needed) maybe it's a good idea to face the problem from the global point of view. Kind regards, Anastasia [1] http://semweb.mmlab.be/rml/spec.html -- Anastasia Dimou @natadimou | mmlab.be | iminds.be Semantic Web - Linked Open Data Researcher Ghent University, Belgium - Multimedia Lab - iMinds Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium
Received on Friday, 21 February 2014 13:47:15 UTC