RE: [csvw] Is row by row processing sufficient? (#20) from Tandy, Jeremy on 2014-06-11 (public-csv-wg@w3.org from June 2014)

From: Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
Date: Wed, 11 Jun 2014 13:35:57 +0000
To: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <2624871D9A05174691BD59F8EFD68AE208846D53@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk>
We talked about this on the call. 

Here's the example I am talking about; taken from [Use Case #24 - Expressing a hierarchy within occupational listings][1].

The datafile can be found [here][2]

Here's a snippet of the data (with whitespace added for clarity):

"""
{snip}
Major Group,Minor Group,Broad Group,Detailed Occupation,                                            ,,,,,
           ,           ,           ,                   ,                                            ,,,,,
{snip}
           ,           ,           ,            13-2099,          "Financial Specialists, All Other",,,,,
    15-0000,           ,           ,                   ,       Computer and Mathematical Occupations,,,,,
           ,    15-1100,           ,                   ,                        Computer Occupations,,,,,
           ,           ,    15-1110,                   ,Computer and Information Research Scientists,,,,,
           ,           ,           ,            15-1111,Computer and Information Research Scientists,,,,,
{snip}
           ,           ,    15-1190,                   ,          Miscellaneous Computer Occupations,,,,,
           ,           ,           ,            15-1199,           "Computer Occupations, All Other",,,,,
           ,    15-2000,           ,                   ,            Mathematical Science Occupations,,,,,
{snip}
"""

This is a multi-level hierarchy.

Values in the "Broad Group" column are sometimes "ditto above" and sometime "empty" (null).

Filling out the grid of values explicitly gives me:

"""
{snip}
Major Group,Minor Group,Broad Group,Detailed Occupation,                                            ,,,,,
           ,           ,           ,                   ,                                            ,,,,,
{snip}
    13-0000,    13-2000,    13-2090,            13-2099,          "Financial Specialists, All Other",,,,,
    15-0000,           ,           ,                   ,       Computer and Mathematical Occupations,,,,,
    15-0000,    15-1100,           ,                   ,                        Computer Occupations,,,,,
    15-0000,    15-1100,    15-1110,                   ,Computer and Information Research Scientists,,,,,
    15-0000,    15-1100,    15-1110,            15-1111,Computer and Information Research Scientists,,,,,
{snip}
    15-0000,    15-1100,    15-1190,                   ,          Miscellaneous Computer Occupations,,,,,
    15-0000,    15-1100,    15-1190,            15-1199,           "Computer Occupations, All Other",,,,,
    15-0000,    15-2000,           ,                   ,            Mathematical Science Occupations,,,,,
{snip}
"""

I guess that the processing model we could use is:
- if blank, then same as above (assuming that previous row has been filled out already)

... so it may not be as complicated as I initially thought.

To do this we would need to add explicit "empty" or null values to copy when processing each row (e.g. "\n") which can then cascade down too, for example, e.g.

"""
{snip}
Major Group,Minor Group,Broad Group,Detailed Occupation,                                            ,,,,,
           ,           ,           ,                   ,                                            ,,,,,
{snip}
           ,           ,           ,            13-2099,          "Financial Specialists, All Other",,,,,
    15-0000,         \n,         \n,                 \n,       Computer and Mathematical Occupations,,,,,
           ,    15-1100,           ,                   ,                        Computer Occupations,,,,,
           ,           ,    15-1110,                   ,Computer and Information Research Scientists,,,,,
           ,           ,           ,            15-1111,Computer and Information Research Scientists,,,,,
{snip}
           ,           ,    15-1190,                 \n,          Miscellaneous Computer Occupations,,,,,
           ,           ,           ,            15-1199,           "Computer Occupations, All Other",,,,,
           ,    15-2000,         \n,                 \n,            Mathematical Science Occupations,,,,,
{snip}
"""

... which would be fully populated like:

"""
"""
{snip}
Major Group,Minor Group,Broad Group,Detailed Occupation,                                            ,,,,,
         \n,         \n,         \n,                 \n,                                            ,,,,,
{snip}
    13-0000,    13-2000,    13-2090,            13-2099,          "Financial Specialists, All Other",,,,,
    15-0000,         \n,         \n,                 \n,       Computer and Mathematical Occupations,,,,,
    15-0000,    15-1100,         \n,                 \n,                        Computer Occupations,,,,,
    15-0000,    15-1100,    15-1110,                 \n,Computer and Information Research Scientists,,,,,
    15-0000,    15-1100,    15-1110,            15-1111,Computer and Information Research Scientists,,,,,
{snip}
    15-0000,    15-1100,    15-1190,                 \n,          Miscellaneous Computer Occupations,,,,,
    15-0000,    15-1100,    15-1190,            15-1199,           "Computer Occupations, All Other",,,,,
    15-0000,    15-2000,         \n,                 \n,            Mathematical Science Occupations,,,,,
{snip}
"""

Jeremy

[1]: http://w3c.github.io/csvw/use-cases-and-requirements/index.html#UC-ExpressingHierarchyWithinOccupationalListings

[2]: http://w3c.github.io/csvw/use-cases-and-requirements/soc_structure_2010.csv


> -----Original Message-----
> From: Tandy, Jeremy [mailto:jeremy.tandy@metoffice.gov.uk]
> Sent: 11 June 2014 11:25
> To: Stasinos Konstantopoulos; public-csv-wg@w3.org
> Subject: RE: [csvw] Is row by row processing sufficient? (#20)
> 
> Hi Stasinos - you make some great points that I had not considered. I'd
> like to get some feedback on this topic at today's teleconf. I'll add
> this item to the agenda.
> 
> Jeremy
> 
> > -----Original Message-----
> > From: Stasinos Konstantopoulos [mailto:konstant@iit.demokritos.gr]
> > Sent: 11 June 2014 10:45
> > To: w3c/csvw
> > Cc: Tandy, Jeremy
> > Subject: Re: [csvw] Is row by row processing sufficient? (#20)
> >
> > Jeremy, all,
> >
> > Merged cells in Spreadsheets are CSV serialized as empty cells that
> > mean "same as above" or "same as on the left", depending the range
> > that was merged. I think that this is common enough that it deserves
> > to be handled without expecting that the publisher massages their
> > data. In fact, the "same as on the left" case appears in the Excel
> > files of Use Case #8 - Analyzing Scientific Spreadsheets [1]
> >
> > Furthermore, this is very similar to:
> >
> > R-MissingValueDefinition: Ability to declare a "missing value" token
> > and, optionally, a reason for the value to be missing [2]
> >
> > My proposal is to extend R-MissingValueDefinition to something along
> > the lines of:
> >
> > Ability to declare a "missing value" token and, optionally, a reason
> > for the value to be missing or an action to be taken to fill in the
> > value. Actions to be taken should be selected from a closed
> vocabulary
> > to be specified by the WG; including "same as above" and "same as on
> > the left" (from UC-8).
> >
> > Other interesting actions (e.g., "default value = V") might be found
> > in use cases if we look at them from this perspective.
> >
> > In this case, UC-8 should also require R-MissingValueDefinition.
> >
> > Best,
> > Stasinos
> >
> >
> > [1]
> > http://w3c.github.io/csvw/use-cases-and-requirements/index.html#UC-
> > AnalyzingScientificSpreadsheets
> > [2] http://w3c.github.io/csvw/use-cases-and-

> requirements/index.html#R-
> > MissingValueDefinition
> >
> > On 11 June 2014 12:03, Jeremy Tandy <notifications@github.com> wrote:
> > > In the Processing Model of the Generating RDF from Tabular Data on
> > the
> > > Web doc, there is an issue raised stating:
> > >
> > > """
> > > Independently processed rows - is this always the case?
> > > """
> > >
> > > There are examples (see Use Case #24 - Expressing a hierarchy
> within
> > > occupational listings) where "blank" fields imply "ditto" to the
> > field
> > > above (or the last time that field was not blank). At first glance,
> > > this seems pretty trivial, yet the example in the use case uses a
> > > multi-level hierarchy, and sometimes "blank" means "empty" (null)
> > > not "ditto". As such, the arbitrary processing required to "guess
> > > the behaviour applied to blank cells" is somewhat challenging.
> > >
> > > As such, I recommend that we don't try to process this mode of
> > > behaviour during the transformation. If people have CSV data with
> > > "blanks that mean ditto", they need to fill in the blanks first.
> > >
> > > Given that, I suggest that we stick with the model that processes
> > each
> > > row independtly and does not require us to maintain state from row
> > > to
> > row.
> > >
> > > —
> > > Reply to this email directly or view it on GitHub.
Received on Wednesday, 11 June 2014 13:36:30 UTC