RE: What to do when "primary key" cell values are blank from Tandy, Jeremy on 2014-06-06 (public-csv-wg@w3.org from June 2014)

From: Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
Date: Fri, 6 Jun 2014 14:45:42 +0000
To: Andy Seaborne <andy@apache.org>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <2624871D9A05174691BD59F8EFD68AE20884574C@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk>
Hi Andy -

Thanks for the feedback. I have now re-written the R-PrimaryKey requirement [1] to focus _only_ on unique identification of _rows_ within a dataset as opposed to the identification of entities described by a given row. I have amended all the use cases where this distinction was muddled such that where I was talking about unique identifiers for the _entity_ I now refer to R-URIMapping to convert the local identifier in the dataset to a globally unique URI.

I've dropped all suggestions about skipping rows where the primary key is blank.

In Use Case #24 - Expressing a hierarchy within occupational listings [2], I now use the Conditional Processing requirement [3] to skip rows where the "unique identifier" field is blank.

I think that this has fixed the confusion in the document.

Jeremy

[1] http://w3c.github.io/csvw/use-cases-and-requirements/#R-PrimaryKey 
[2] http://w3c.github.io/csvw/use-cases-and-requirements/#UC-ExpressingHierarchyWithinOccupationalListings 
[3] http://w3c.github.io/csvw/use-cases-and-requirements/#R-ConditionalProcessingBasedOnCellValues 

> -----Original Message-----
> From: Andy Seaborne [mailto:andy@apache.org]
> Sent: 06 June 2014 12:15
> To: Tandy, Jeremy; public-csv-wg@w3.org
> Subject: Re: What to do when "primary key" cell values are blank
> 
> On 06/06/14 11:47, Tandy, Jeremy wrote:
> > Hi Andy
> >
> > ... so it looks like I've got confused in my terminology:
> >
> > """\"The\" primary key is different in each pass.  The note in R-
> PrimaryKey does not meet our experiences."""
> >
> > ... and ...
> >
> > """\"Primary\" is being overloaded between uniquely identifying a row
> (structural to CSV files), and uniquely identifying an entity
> (modelling).  In denormalised data, entities might get repeated on
> different rows."""
> >
> > I've clearly been thinking about the "modelling" case not the
> "structural" case. Can you help me clarify with some suggested
> alternative text?
> >
> 
> R-PrimaryKey seems to take a design position and I think there are
> alternatives depending on the data and intent.
> 
> Maybe drop these 2 items that seem to me to be one specific choice that
> is not always the right one for all conversions:
> 
> ----
> Where a row contains a primary key cell that is blank or empty, that
> row shall be ignored.
> ----
> 
> because an alternative approach is to generate a primary key anyway
> (e.g. UUID based or based on row number).  This may be patched up later
> or not.  Skipping looses the information.
> 
> 
> 
> I think data is as clean as this seems to see it:
> ----
> Note
> 
> Assumption that a row within a CSV file describes a single entity for
> which a primary key can be assigned.
> ----
> 
> In the hierarchy extraction example, there is a deduced identifier for
> the "11-1011.03" row could induce another triple subject.
> 
> soc:11-1011.00 skos:narrower soc:11-1011.03 .
> 
> (using :narrower, not :broader)
> 
> In the Land registry example, a transaction row has the address on it
> but the address can be used multiple places.  There are two entities in
> the row (imagine a conversion that just extracted the addresses).
> 
> In order to share the address, the subject URI for an address is a hash
> of its parts and in the RDF is a separate entity to the transaction
> record.  That's it's "primary key" - not the transaction's "primary
> key".
> 
> 	Andy
> 
> > Thanks in anticipation.
> >
> > Jeremy
> >
> >
> >
> >> -----Original Message-----
> >> From: Andy Seaborne [mailto:andy@apache.org]
> >> Sent: 06 June 2014 10:23
> >> To: public-csv-wg@w3.org
> >> Subject: Re: What to do when "primary key" cell values are blank
> >>
> >> On 06/06/14 09:53, Tandy, Jeremy wrote:
> >>> Hi - when putting together Use Case #24 - Expressing a hierarchy
> >> within occupational listings [1] I was considering how primary key
> >> behaviour might work. In this use case, there are four different
> >> types of entity described in a single CSV file. I inferred that we
> >> might apply four different templates to pull out the relevant
> >> contents and transform into RDF. A given row describes _one of_ the
> >> types of entity, meaning that the primary key column asserted, say,
> >> for extracting "SOC Major Group" concepts will often be blank.
> >>>
> >>> I have stated in the use case that:
> >>>> Where the value in the designated primary key column is blank, the
> >> row is ignored.
> >>>
> >>> I have also added this constraint to the primary key requirement
> [2].
> >>>
> >>> Please advise is this is inappropriate!
> >>
> >> We use template conversion - we often run multiple templates on the
> >> same CSV, essentially extracting different kinds of entity on each
> >> pass.
> >> "The" primary key is different in each pass.  The note in
> >> R-PrimaryKey does not meet our experiences.
> >>
> >> JeniT's condition extract is an example where it might be done as a
> >> pass to generate the skos:broader separately from the "code
> >> rdfs:label ....".
> >>
> >> "Primary" is being overloaded between uniquely identifying a row
> >> (structural to CSV files), and uniquely identifying an entity
> >> (modelling).  In denormalised data, entities might get repeated on
> >> different rows.
> >>
> >> 	Andy
> >>
> >>>
> >>> Regards, Jeremy
> >>>
> >>>
> >>> [1]
> >>> http://w3c.github.io/csvw/use-cases-and-requirements/#UC-
> >> ExpressingHie
> >>> rarchyWithinOccupationalListings [2]
> >>> http://w3c.github.io/csvw/use-cases-and-requirements/#R-PrimaryKey
> >>>
> >>
> >
Received on Friday, 6 June 2014 14:46:15 UTC