Re: CSV on the web: question re null / missing values from Rufus Pollock on 2014-08-06 (public-csv-wg@w3.org from August 2014)

From: Rufus Pollock <rufus.pollock@okfn.org>
Date: Wed, 6 Aug 2014 16:00:49 +0100
To: Peter Parslow <Peter.Parslow@ordnancesurvey.co.uk>
Cc: Dan Brickley <danbri@google.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAKssCpPtwyi8nNvjpfO_6ofbLF9XxqfLpfzOqjfZjmC5qoOUmw@mail.gmail.com>
This is a really interesting question. I note this arose with respect to
Tabular Data Package and JSON Table Schema and someone opened this specific
issue:

https://github.com/dataprotocols/dataprotocols/issues/97

The suggestion there was to add a specific field named "missing_value"
which would define what was the missing value value/symbol.

Rufus


On 6 August 2014 13:45, Peter Parslow <Peter.Parslow@ordnancesurvey.co.uk>
wrote:

> Dan,
> That looks like a 'yes'.
>
> I can add some scenarios, from our own CSV data products.
>
> This morning I was looking at "AddressBase Plus", so the specific use case
> is a missing string value (although the string in question is actually an
> identifier in a different dataset). There are many fields in an address
> which are not always populated. For example, many (UK) addresses lie within
> an electoral ward, but not all do - so sometimes the electoral ward column
> is empty, indicating that this particular address does not lie in a ward.
> The use case is that we would like to be explicit about that, rather than
> risk it being interpreted as 'missing by accident'. The solution would
> appear to be for us to create an 'out of range' value to populate the field
> with ('none' springs to mind); the slight difficulty being that the
> electoral ward codes are created by another agency - so we run the risk of
> them creating a value of 'none' for some (no doubt good) reason! Especially
> if we chose the same 'none' across all the potentially missing string
> values (house name, street name, etc)
>
> My reference to GML is more generic, because we have a European standard
> now for expressing addresses in GML, and GML includes a reasonably good
> model for being explicit about what nil & absent values mean.
>
> Peter
> -----Original Message-----
> From: Dan Brickley [mailto:danbri@google.com]
> Sent: 06 August 2014 11:25
> To: Peter Parslow
> Cc: public-csv-wg@w3.org
> Subject: Re: CSV on the web: question re null / missing values
>
> On 6 August 2014 11:16, Peter Parslow
> <Peter.Parslow@ordnancesurvey.co.uk> wrote:
> > Does this working group intend to publish anything (e.g. advice) on how
> to handle null values in CSV data? Perhaps as part of the metadata work,
> given that current usage varies. I would like to see some guidance covering:
> >
> > * 'the meaning of null' - i.e. recognition of the range of
> > possibilities. OGC's gml:nilReasonType
> > (http://www.opengeospatial.org/standards/gml) extends the idea of
> > xsi:nil (http://www.w3.org/TR/xmlschema-1/#xsi_nil;
> > http://www.w3.org/TR/2001/REC-xmlschema-0-20010502/#Nils)
> >
> > * Giving null values in columns of different data types
> >
> > * Any interaction with whether strings are quoted or not
>
> Hi Peter,
>
> Interesting point. I think this will come to the fore as we go deeper into
> templates and mappings. Perhaps there are use cases that could capture more
> detailed requirements than we now have. But
> http://www.w3.org/TR/2014/WD-csvw-ucr-20140701/#R-MissingValueDefinition
> does touch on the issue already:
>
> "R-MissingValueDefinitionAbility to declare a "missing value" token and,
> optionally, a reason for the value to be missing
>
> Significant amounts of existing tabular text data include values such as
> -999. Typically, these are outside the normal expected range of values and
> are meant to infer that the value for that cell is missing.
> Automated parsing of CSV files needs to recognise such missing value
> tokens and behave accordingly. Furthermore, it is often useful for a data
> publisher to declare why a value is missing; e.g. withheld or
> aboveMeasurementRange
>
> Motivation: SurfaceTemperatureDatabank, OrganogramData, OpenSpendingData,
> NetCdFcDl, PaloAltoTreeData and PlatformIntegrationUsingSTDF."
>
> At a minimum it could be useful to add the links you provide to any future
> updates on the csvw-ucr doc. Do you have scenarios in mind that are not
> captured in the above list of use cases?
>
> cheers,
>
> Dan
>
>
> This email is only intended for the person to whom it is addressed and may
> contain confidential information. If you have received this email in error,
> please notify the sender and delete this email which must not be copied,
> distributed or disclosed to any other person.
>
> Unless stated otherwise, the contents of this email are personal to the
> writer and do not represent the official view of Ordnance Survey. Nor can
> any contract be formed on Ordnance Survey's behalf via email. We reserve
> the right to monitor emails and attachments without prior notice.
>
> Thank you for your cooperation.
>
> Ordnance Survey
> Adanac Drive
> Southampton SO16 0AS
> Tel: 08456 050505
> http://www.ordnancesurvey.co.uk
>



-- 

*Rufus PollockFounder and President | skype: rufuspollock | @rufuspollock
<https://twitter.com/rufuspollock>Open Knowledge <http://okfn.org/> - see
how data can change the world**http://okfn.org/ <http://okfn.org/> | @okfn
<http://twitter.com/OKFN> | Open Knowledge on Facebook
<https://www.facebook.com/OKFNetwork> |  Blog <http://blog.okfn.org/>*

The Open Knowledge Foundation is a not-for-profit organisation.  It is
incorporated in England & Wales as a company limited by guarantee, with
company number 05133759.  VAT Registration № GB 984404989. Registered
office address: Open Knowledge Foundation, St John’s Innovation Centre,
Cowley Road, Cambridge, CB4 0WS, UK.
Received on Wednesday, 6 August 2014 15:01:19 UTC