Re: Call for Editors! from Juan Sequeda on 2014-03-21 (public-csv-wg@w3.org from March 2014)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Fri, 21 Mar 2014 12:34:23 -0500
To: Gregg Kellogg <gregg@greggkellogg.net>
Cc: Ivan Herman <ivan@w3.org>, Andy Seaborne <andy@apache.org>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>, "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>
Message-ID: <CAMVTWDwKZ+OA5sUPiXMD3Df1d0rwEjQQZ4BSNLeDeE4zBPZttA@mail.gmail.com>
Gregg

On Fri, Mar 21, 2014 at 12:13 PM, Gregg Kellogg <gregg@greggkellogg.net>wrote:

> On Mar 21, 2014, at 7:34 AM, Juan Sequeda <juanfederico@gmail.com> wrote:
>
>
> On Fri, Mar 21, 2014 at 4:30 AM, Ivan Herman <ivan@w3.org> wrote:
>
>> Hi Juan,
>>
>> thanks.
>>
>> (Just to make it clear: I do not try to be picky here, I am just trying
>> to articulate the issues for future reference...)
>>
>> - This is clear use case for the usage of RDF, ie, to convert CSV data
>> into RDF.
>>
>> - It also reflect the need to use a common vocabulary for all the
>> converted CSV files from a particular domain, thereby binding the data to
>> Linked Data. Meaning to use somewhere in the workflow a tool that goes
>> beyond direct mapping.
>>
>> I wonder whether this should not be added to a next version of the UCR
>> document (I let Jeremy look into this). Mainly in terms of the Direct
>> mapping vs. more, it is a good practical example!
>>
>> But (and here is where I am picky:-(
>>
>> I understand you used the existing toolkit around R2RML and, in your
>> case, being one of the co-editors of the Direct Mapping document, this was
>> the absolutely natural thing to do. However... I am looking at use cases
>> that contrast an R2RML approach to the the JSON centric approach. Because,
>> at some point, we may have to choose, right? Would your work have been
>> equally possible (try to forget about your personal experience) using
>> CSV-JSON? Would have it been easier/more difficult/equal?
>>
>
> You mean Gregg's CSV-LD approach (https://www.w3.org/2013/csvw/wiki/CSV-LD
> )?
>
> If so,... well.... I need to look at Gregg's approach in detail. A just
> gave it a quick look and I don't see how data values can be used in a
> template to generate an IRI.
>
>
> The section on Composite Primary Keys [1] describes how a property can be
> defined that uses field values to compose an IRI (or BNode identifier, in
> this case). The use is not limited to primary keys, and it could be used as
> a standin for any position.
>

In your example in [1], how do define (or know) that "_:{Sales Region}" is
going to turn into a bnode (I'm assuming that is what's suppose to go on).
You need a way to define if a term is going to be a Literal, IRI or BNode.
In R2RML, you have rr:termType.

See: http://www.w3.org/TR/r2rml/#termtype

This is just one example. I'm sure you can go through all of R2RML and make
your approach include everything (right?).

Then it boils down to having two completely different syntaxes with the
same semantics.

Actually, can't you just represent an R2RML mapping, which is in turtle
syntax, as JSON-LD? How would that be different from your proposed CSV-LD?


> If I have RDB and CSV, wouldn't it be more convenient to have one mapping
> language (well actually, it would be two but they would be practically
> identical), instead of using R2RML for the RDB and some other mapping
> language for the CSV?
>
>
> That may be. In any case, I think it will need to describe additional
> linking metadata in a fairly generic way. Because of the different
> character of the R2RML and JSON use cases, this may end up adding more
> complexity than it's worth.
>

"describe additional linking metadata in a fairly generic way"... sorry,
I'm confused here. Can you elaborate.


> At one time, I was imagining how a JSON-LD native query language might
> look, which could then be an alternative syntax for SPARQL. You can imagine
> the equivalent of SPARQL variables as having meaning within a JSON-LD
> document. If we considered it from this point, then a SPARQL/TARQL-like
> solution which emits triples based on variables in a query pattern which
> are associated with column names could have a similar representation in a
> Turtle-like or JSON-LD-like syntax.
>

mmm... it's hard for me to visualize this. Can you give an example.

And would this be another alternative to R(C)2RML, Relational(CSV) to RDF
Mapping Language and CSV-LD?



> What I like about the CSV-LD syntax is (again IMO) easy to see how the
> content of the CSV is directly mapped to JSON, but I'm admittedly biased
> here.
>

I think we are both biased here :)



>
> Gregg
>
> [1] https://www.w3.org/2013/csvw/wiki/CSV-LD#Composite_Primary_Keys
>
>
>
>> Thanks
>>
>> Ivan
>>
>> On 21 Mar 2014, at 24:25 , Juan Sequeda <juanfederico@gmail.com> wrote:
>>
>> > Ivan, all,
>> >
>> > This is our use-case:
>> >
>> > Constitute Project [1] is a search engine for the worlds constitution.
>> This is a project funded by Google Ideas [2]. We, Capsenta, did the mapping
>> of the constitution data to RDF and OWL. All of the data was original Excel
>> spreadsheets (i.e. CSV files). What we did was to import the spreadsheets
>> into SQL Server, and then used Direct Mapping, R2RML and Ultrawrap to map
>> the data to RDF. Why did we want to use RDF/OWL? Several reasons:
>> >
>> > 1) RDF (graph data model) is flexible. We don't know what is going to
>> happen to constitutional data later. So we need to be ready for change
>> > 2) We currently have 189 constitutions, each in it's own spreadsheet.
>> We need to integrate this data.
>> > 3) We created an ontology about constitutional topics. Naturally, we
>> want to represent this in OWL.
>> > 4) We want to link to other datasets, such as DBpedia
>> > 5) RDF is becoming the standard to publish open data.
>> >
>> > These reasons are not specific to Constitute. It can apply to any csv
>> dataset which needs search or integrated with other datasets.  More info
>> can be found in our 2013 Semantic Web Challenge submission [3]. We won 2nd
>> prize :)
>> >
>> > Constitute is having a lot of impact. We know for a fact that
>> constitutional drafters of Tunsia, Egypt and now Mongolia have been using
>> Constitute.
>> >
>> > Btw, interesting fact: On average, 5 constitutions are written from
>> scratch every year. A constitution last on average for 20 years. People who
>> write constitutions have never done that before and will never do that
>> again; that is why they want to search through existing constitutions.
>> >
>> > [1] https://www.constituteproject.org/#/
>> > [2] https://www.google.com/ideas/projects/constitute/
>> > [3]
>> http://challenge.semanticweb.org/2013/submissions/swc2013_submission_12.pdf
>> >
>> >
>> > Juan Sequeda
>> > +1-575-SEQ-UEDA
>> > www.juansequeda.com
>> >
>> >
>> > On Thu, Mar 20, 2014 at 12:53 PM, Ivan Herman <ivan@w3.org> wrote:
>> > Sorry if I sound like a broken record, but I would really like to see
>> and understand the CSV->RDF use cases, also in terms of the people who are
>> likely to use that. Learning CSV-LD or R2RML-CSV requires a learning curve.
>> The question is which of the two is steeper for the envisaged user base.
>> >
>> > (I do not have anything against any of the two, but we may have to make
>> a choice at some point if we go down that route...)
>> >
>> > Ivan
>> >
>> > On 20 Mar 2014, at 18:47 , Gregg Kellogg <gregg@greggkellogg.net>
>> wrote:
>> >
>> > > On Mar 20, 2014, at 10:39 AM, Juan Sequeda <juanfederico@gmail.com>
>> wrote:
>> > >
>> > >> If there is going to be a CSV to RDF mapping, shouldn't it be
>> relatively close (if not almost equal to) R2RML. I foresee users doing
>> RDB2RDF mappings with R2RML and having a few (or many) CSV files that they
>> would like to map to RDF too. They would want to continue using the same
>> tool.
>> > >>
>> > >> What we do is import the CSVs to a RDB, and then use R2RML. So as a
>> user who needs to transform to RDF, I would want to have something almost
>> equivalent to R2RML.
>> > >
>> > > This certainly is a valid use case. I was considering what the impact
>> on developers using these tools might be. If there is a single tool (and
>> spec) which handles the relevant use cases, then it might simplify the life
>> of developers. Nothing against R2RML, and if that's the chain a developer's
>> working with, the same logic would indicate that having to use something
>> like CSV-LD would be a burden.
>> > >
>> > > Gregg
>> > >
>> > >> Juan Sequeda
>> > >> +1-575-SEQ-UEDA
>> > >> www.juansequeda.com
>> > >>
>> > >>
>> > >> On Thu, Mar 20, 2014 at 12:08 PM, Gregg Kellogg <
>> gregg@greggkellogg.net> wrote:
>> > >> On Mar 20, 2014, at 9:52 AM, Andy Seaborne <andy@apache.org> wrote:
>> > >>
>> > >> > On 20/03/14 15:34, Ivan Herman wrote:
>> > >> >>
>> > >> >> On 20 Mar 2014, at 16:03 , Juan Sequeda <juanfederico@gmail.com>
>> wrote:
>> > >> >>
>> > >> >>> I would say yes :)
>> > >> >>>
>> > >> >>> 1) Direct Mapping is completely automatic
>> > >> >>> 2) R2RML is a manual.
>> > >> >>
>> > >> >> Correct. The question for me is: do the use cases around justify
>> the extra (non-trivial) effort of defining an R2RML-CSV? Remember that the
>> definition of R2RML took over two years:-(
>> > >> >
>> > >> > Caution warranted - it needs to be scoped downwards.  I hope (but
>> can not prove) that the CSV mapping is less of a mountain.
>> > >> >
>> > >> > CSV-LD is a R2RML(-ish) mapping.  Gregg's already started, so not
>> 2 years :-)
>> > >>
>> > >> Yes, CSV-LD is much like R2RML, but I think we could complete a spec
>> in fairly short order.
>> > >>
>> > >> For the direct mapping, this could be a default mapping done by
>> automatically constructing a context along the lines Andy had suggested,
>> and could fall out of that spec as well.
>> > >>
>> > >> One consideration, is converting CSV files with a very large number
>> of rows. The CSV-LD model would essentially create a document for each row,
>> so converting to RDF could be streamed, but some provision for BNode
>> identifiers would need to be made, so that if some value maps to a BNode,
>> it would be preserved across records and not result in a new BNode being
>> minted, even though it had the same identifier. This isn't really a
>> problem, but it would mean specifying an algorithm that extended the
>> existing JSON-LD conversion algorithms to the degree that the BNode
>> identifier mapping would persist so that the conversion can be streamed.
>> > >>
>> > >> Gregg
>> > >>
>> > >> >       Andy
>> > >> >
>> > >> >>
>> > >> >> Ivan
>> > >> >>
>> > >> >>>
>> > >> >>> Direct Mapping bootstraps the R2RML.
>> > >> >>>
>> > >> >>> Btw, I would be interested in participating in the CSV to RDF
>> effort.
>> > >> >>>
>> > >> >>>
>> > >> >>> Juan Sequeda
>> > >> >>> +1-575-SEQ-UEDA
>> > >> >>> www.juansequeda.com
>> > >> >>>
>> > >> >>>
>> > >> >>> On Thu, Mar 20, 2014 at 6:44 AM, Andy Seaborne <andy@apache.org>
>> wrote:
>> > >> >>> On 20/03/14 11:31, Ivan Herman wrote:
>> > >> >>>
>> > >> >>> On 20 Mar 2014, at 11:40 , Andy Seaborne <andy@apache.org>
>> wrote:
>> > >> >>>
>> > >> >>> On 19/03/14 23:09, Jeni Tennison wrote:
>> > >> >>> Hi,
>> > >> >>>
>> > >> >>> Now that the first two of our documents are getting published as
>> first public working drafts, we are moving on to the next stage of our
>> work, namely looking at conversion from tabular data into other formats.
>> > >> >>>
>> > >> >>> We have a wiki document here:
>> > >> >>>
>> > >> >>>    https://www.w3.org/2013/csvw/wiki/Conversions
>> > >> >>>
>> > >> >>> that describes in very broad terms what we need to do.
>> > >> >>>
>> > >> >>> Specifically, we're looking for volunteers to lead the efforts /
>> edit four documents, specifying:
>> > >> >>>
>> > >> >>>    * Conversion of CSV to RDF
>> > >> >>>
>> > >> >>> RDF to RDF had two conversion documents.
>> > >> >>>
>> > >> >>> I guess you meant RDB to RDF...
>> > >> >>>
>> > >> >>> Yes.  Typo.  s/x42/x46/ -- only one bit out.
>> > >> >>>
>> > >> >>>         Andy
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>> Ivan
>> > >> >>>
>> > >> >>>
>> > >> >>> (with no strong advocacy)
>> > >> >>> With hindsight, was it a good idea? Should we do the same?
>> > >> >>>
>> > >> >>>    * Conversion of CSV to JSON and/or a browser API
>> > >> >>>    * Conversion of CSV to XML (possibly pending actually having
>> a use case for this)
>> > >> >>>    * Conversion of CSV into a tabular data platform / framework
>> / store (eg into a spreadsheet application or relational database or
>> application like R)
>> > >> >>>
>> > >> >>> Please step forward, by editing the wiki, to lead the work on
>> one of these documents and/or volunteer to help someone else with the work
>> that needs to go into it. Obviously everything will be discussed on the
>> list, but lead editors are instrumental in framing those discussions.
>> > >> >>>
>> > >> >>> Thanks,
>> > >> >>>
>> > >> >>> Jeni
>> > >> >>> --
>> > >> >>> Jeni Tennison
>> > >> >>> http://www.jenitennison.com/
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>> ----
>> > >> >>> Ivan Herman, W3C
>> > >> >>> Digital Publishing Activity Lead
>> > >> >>> Home: http://www.w3.org/People/Ivan/
>> > >> >>> mobile: +31-641044153
>> > >> >>> GPG: 0x343F1A3D
>> > >> >>> FOAF: http://www.ivan-herman.net/foaf
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>
>> > >> >>
>> > >> >> ----
>> > >> >> Ivan Herman, W3C
>> > >> >> Digital Publishing Activity Lead
>> > >> >> Home: http://www.w3.org/People/Ivan/
>> > >> >> mobile: +31-641044153
>> > >> >> GPG: 0x343F1A3D
>> > >> >> FOAF: http://www.ivan-herman.net/foaf
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >
>> > >> >
>> > >>
>> > >>
>> > >
>> >
>> >
>> > ----
>> > Ivan Herman, W3C
>> > Digital Publishing Activity Lead
>> > Home: http://www.w3.org/People/Ivan/
>> > mobile: +31-641044153
>> > GPG: 0x343F1A3D
>> > FOAF: http://www.ivan-herman.net/foaf
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>> ----
>> Ivan Herman, W3C
>> Digital Publishing Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> GPG: 0x343F1A3D
>> FOAF: http://www.ivan-herman.net/foaf
>>
>>
>>
>>
>>
>>
>
>
Received on Friday, 21 March 2014 17:35:17 UTC