Re: getting up to speed from Kev Kirkland on 2014-09-15 (public-csv-wg@w3.org from September 2014)

From: Kev Kirkland <kev@dataunity.org>
Date: Mon, 15 Sep 2014 19:53:05 +0100
To: "Ingram, William A" <wingram2@illinois.edu>
Cc: Jeni Tennison <jeni@jenitennison.com>, Ivan Herman <ivan@w3.org>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-ID: <CAPNZP6K9jnzizZ0TUGddsVR3uLD0j1=avWqqtX+kRu1O7qjhBQ@mail.gmail.com>
MultipleHeadingRows is one that interests me too - it comes up often in the
Market Research field. The data I had to deal with looks similar to Use
Case #2 [1].

Python Pandas [2] is the best software package I've found so far for
dealing with multiple headers. It has 'levels' which show how row/column
headers are grouped together. In Pandas all rows and columns have an index
and "hierarchical indexing" is used when the indexes are grouped together
into bigger hierarchical structures [3].

Using Example 1 (in Use Case #2) you might say row 5 is a column level
called "Measure", row 6 is a level called "Unit", row 7 "Indicator" and row
8 "Employment status" (I'm guessing the most appropriate names for the
levels as I couldn't see the correct terms in the metadata in the zip file).

I'm a bit rusty with Pandas, but I think it lets you specify hierarchical
indexing when you load a CSV file [4] (see the 'header' parameter). Unlike
a lot of other systems, 'headers' isn't a simple boolean for present or
absent, but can add more detail (like the rows which the headers appear on).

It would be great to have this type of information in the CSV on the Web
metadata as it's very useful for reading files. Pandas has hierarchical
indexing on both rows and columns so it can deal with data that looks like
pivot tables (or OLAP style results).

One potential issue with hierarchical indexing with levels is that each
level is assumed to be homogeneous. In Example 1 (Use Case #2) columns C
and D have a bit more information (they are total level figures) which
wouldn't be captured in the level definition.

Thanks,

Kev

[1]
http://w3c.github.io/csvw/use-cases-and-requirements/#UC-PublicationOfNationalStatistics
[2] http://pandas.pydata.org/
[3]
http://pandas.pydata.org/pandas-docs/stable/indexing.html#hierarchical-indexing-multiindex
[4]
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

On 15 September 2014 18:30, Ingram, William A <wingram2@illinois.edu> wrote:

> I got started on UC 12 -- the chemical structures case (requires:
> WellFormedCsvCheck, CsvValidation, MultipleHeadingRows, and
> UnitMeasureDefinition). Perhaps this was not the best case to begin with,
> but I feel like I'm in too deep to turn back now. :)
>
> I hit a wall in trying to describe these as JSON, paticularly the multiple
> header rows. Is there any background reading I should read to learn more
> about csv to json? Or has no one figured this out yet?
>
> Thanks,
> Bill
>
>
> On 9/11/14, 4:45 PM, "Ingram, William A" <wingram2@illinois.edu> wrote:
>
> >Thanks! This is helpful.
> >
> >On 9/11/14, 12:40 PM, "Jeni Tennison" <jeni@jenitennison.com> wrote:
> >
> >>Hi Bill,
> >>
> >>I’d suggest taking one of the use cases in the use case document:
> >>
> >>  http://w3c.github.io/csvw/use-cases-and-requirements/
> >>
> >>and creating a metadata document for it based on the metadata document:
> >>
> >>  http://w3c.github.io/csvw/metadata/
> >>
> >>as a good way of getting to understand the use cases that we have and the
> >>current ideas we have around the metadata that can help validate or
> >>display or convert CSV files.
> >>
> >>You could then take that use case and try to frame what an ideal
> >>JSON/XML/RDF representation of the same CSV data would look like.
> >>
> >>The framework that Dan’s put together for structuring these explorations
> >>here:
> >>
> >>  https://github.com/w3c/csvw/tree/gh-pages/examples/tests
> >>
> >>describes how to structure this work.
> >>
> >>Jeni
> >>
> >>-----Original Message-----
> >>From: Ivan Herman <ivan@w3.org>
> >>Reply: Ivan Herman <ivan@w3.org>>
> >>Date: 11 September 2014 at 13:58:12
> >>To: Ingram, William A <wingram2@illinois.edu>>
> >>Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>>
> >>Subject:  Re: getting up to speed
> >>
> >>>
> >>> On 10 Sep 2014, at 21:04 , Ingram, William A wrote:
> >>>
> >>> > Hi all,
> >>> >
> >>> > I'm new to the working group, still trying to get my bearings and
> >>>find a
> >>> > place to fit in. Last week was topsy-turvy ‹ we moved our offices to
> >>> > another building ‹ but I'm settled in now and ready to get to work.
> >>> >
> >>> > Is there a smallish project I can work on this week that would a.)
> >>>allow
> >>> > me to get my hands dirty with the material, and b.) be of some use to
> >>>the
> >>> > group?
> >>> >
> >>>
> >>> I think we will know more once we have a decision on the templating
> >>>story. Once the general
> >>> lines are there, we should (a) write the document, and (b) have
> >>>independent implementations...
> >>>
> >>> Welcome to the group!
> >>>
> >>> Ivan
> >>>
> >>> > Thanks,
> >>> > Bill
> >>> >
> >>> > --
> >>> > Bill Ingram
> >>> > Manager, Repository Services
> >>> > University of Illinois Library
> >>> >
> >>> > New Office:
> >>> > 422 Main Library, MC 522
> >>> > 1408 W Gregory Dr
> >>> > Urbana, IL 61801
> >>> >
> >>>
> >>>
> >>> ----
> >>> Ivan Herman, W3C
> >>> Digital Publishing Activity Lead
> >>> Home: http://www.w3.org/People/Ivan/
> >>> mobile: +31-641044153
> >>> GPG: 0x343F1A3D
> >>> WebID: http://www.ivan-herman.net/foaf#me
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>--
> >>Jeni Tennison
> >>http://www.jenitennison.com/
>



-- 
www.dataunity.org
twitter: @data_unity
Received on Monday, 15 September 2014 18:53:34 UTC