Re: Uploaded CSV of Chinook sample database from Anastasia Dimou on 2014-10-19 (public-csv-wg@w3.org from October 2014)

From: Anastasia Dimou <anastasia.dimou@ugent.be>
Date: Sun, 19 Oct 2014 16:43:43 +0200
To: public-csv-wg@w3.org, Dan Brickley <danbri@google.com>
Message-ID: <5443CE1F.4000508@ugent.be>
Dear Konstantine, Dan, all,
> <konstant@iit.demokritos.gr> wrote:
>> Dan, all,
>>
>> there's also dbpedia, which comes as a set of 685 CSVs in addition to
>> RDF [1]. So that gives us both the CSV and the RDF that is intended as
>> equivalent. I will describe it in more detail, if you find this use case interesting.
> Definitely interesting - although I would say ambitious too. I don't
> think the current tool I've been using (RMLProcessor) is designed to
> handle datasets of that scale.
The existing processor was developed as a proof of concept for RML. It's 
not optimized in terms of mappings execution performance, but it was 
tested with larger files, too. As far as I can say, these files could be 
mapped (it'd take quit some time though), but doing so it wouldn't offer 
much more insights regarding mapping multiple CSVs.  What it's examined 
at the moment is how to express such mapping definitions/rules and how 
to refer to these mappings.
>   My suggestion would be that we start by
> getting our story straight at the smaller end of the scale:
>
> 1. For a single CSV file, eg. the events example, how do we cite the
> (R2)RML mapping file in its metadata.json?
> (i.e. see scenarios//events/attempts/attempt-1 in github)
To this end, GRDDL [2] is the closest standard.
> 2. How does that extend to the case of multiple CSVs (i.e.
>     scenarios/chinook/)? e.g. Chinook: Album.csv Customer.csv Genre.csv
> InvoiceLine.csv Playlist.csv Track.csv Artist.csv Employee.csv
> Invoice.csv MediaType.csv PlaylistTrack.csv
That's a reason why specifying the input file for the RML mapping (like 
"rml-path-to-file":"../../source/events-listing.csv") at the metadata 
might not be that useful.
If the metadata of events-listing.csv file is described, then why to 
specify the input file again (as it is redundant since the mapping 
refers to this file by definition and it's also specified at the 
rml:Source of the Triples Map). If the mapping document contains also 
Triples Maps that use other input sources, they are specified using the 
rml:Source of the corresponding Triples Maps. There's again no reason to 
be (re-)specified, thus no concern in the case of mapping multiple CSVs, 
too.
This way, only those CSV files needed are used. In this example, even 
though Employee.csv is in the same bunch of files with Album.csv (and 
the others), it is not required to map the data in the Album.csv file 
for instance.

Regarding references to the RML mapping ( e.g. 
"mapping-info-experimental":{"rml":"mapping-events.rml.ttl", ...}  of 
your example), it might be good to cover both the case of a reference to 
a mapping document and/or the case of a reference to a Triples Map 
directly. Also references to multiple mapping documents might be necessary.
> 3. How does that extend to the case of multiple mappings? e.g. you
> might have a mapping into schema.org and another into CIDOC or
> SKOS/FOAF/DC, or another into Wikidata triples etc. What would the
> metadata.json look like there?
Multiple references to different mapping documents, don't you think it's 
enough?
Otherwise, a solution could be to provide a reference to a 
recommended/default mapping and have the option to provide alternatives 
(like SKOS does for labels; prefRML and altRML for instance)
> Having a processor for these kinds of mapping that parallelized (on
> hadoop or whatever) would be interesting I'm sure for large datasets
> like dbpedia. Presumably easier for a single CSV than multiple since
> partitioning is easier (assuming each row's mapping to triples is
> independent).
Indeed, it's not a one-size-fits-all situation. Even though the mapping 
definitions in RML might be the same, different sources and different 
combination of sources might require different ways of processing them, 
thus different processors might be preferred on different occasions but 
that's beyond the scope of defining the mapping per se.
> For our work the metadata.json ought to be a natural entry point, so I
> hope template/mapping demos will start to use that rather than their
> own configurations/parameters. I have also started looking into
> getting RMLProcessor to accept other RDF formats besides Turtle, even
> though R2RML is officially Turtle-only. In some cases a simple mapping
> could then be fully embedded inside the JSON-LD metadata file...
>
> Dan
> [1] http://wiki.dbpedia.org/DBpediaAsTables

Kind regards,
Anastasia

[2] http://www.w3.org/TR/grddl/
Received on Sunday, 19 October 2014 14:44:15 UTC