Re: getting up to speed

On Mon, 15 Sep 2014 19:53:05 +0100, Kev Kirkland <kev@dataunity.org>
wrote:

> MultipleHeadingRows is one that interests me too - it comes up often in the
> Market Research field. The data I had to deal with looks similar to Use
> Case #2 [1].
> 
> Python Pandas [2] is the best software package I've found so far for
> dealing with multiple headers. It has 'levels' which show how row/column
> headers are grouped together. In Pandas all rows and columns have an index
> and "hierarchical indexing" is used when the indexes are grouped together
> into bigger hierarchical structures [3].

Being the author of Perl5's Text::CSV_XS [1] I am fully open to *any*
valuable extension I can make to help you out.

I did implement RFC7111 [2] in the week it was released [3]

If not only to have a goal to make Text::CSV_XS the fastest, most
reliable and most configurable tool available for CSV parsing, I'd
for sure also like to use it to promote perl.

Not only perl5, but also perl6. I have started with someone else to
make a perl6 port of this module that we hope to have working by April
2015. If you would have any (well-documented) requirements for multiple
headers, I'd be obliged to think about implementing the most useful way
to accommodate the w3c wishes.

[1] http://metacpan.org/module/Text::CSV_XS
[2] http://tools.ietf.org/html/rfc7111
[3] https://metacpan.org/pod/Text::CSV_XS#fragment

> Using Example 1 (in Use Case #2) you might say row 5 is a column level
> called "Measure", row 6 is a level called "Unit", row 7 "Indicator" and row
> 8 "Employment status" (I'm guessing the most appropriate names for the
> levels as I couldn't see the correct terms in the metadata in the zip file).
> 
> I'm a bit rusty with Pandas, but I think it lets you specify hierarchical
> indexing when you load a CSV file [4] (see the 'header' parameter). Unlike
> a lot of other systems, 'headers' isn't a simple boolean for present or
> absent, but can add more detail (like the rows which the headers appear on).
> 
> It would be great to have this type of information in the CSV on the Web
> metadata as it's very useful for reading files. Pandas has hierarchical
> indexing on both rows and columns so it can deal with data that looks like
> pivot tables (or OLAP style results).
> 
> One potential issue with hierarchical indexing with levels is that each
> level is assumed to be homogeneous. In Example 1 (Use Case #2) columns C
> and D have a bit more information (they are total level figures) which
> wouldn't be captured in the level definition.
> 
> Thanks,
> 
> Kev
> 
> [1] http://w3c.github.io/csvw/use-cases-and-requirements/#UC-PublicationOfNationalStatistics
> [2] http://pandas.pydata.org/
> [3] http://pandas.pydata.org/pandas-docs/stable/indexing.html#hierarchical-indexing-multiindex
> [4] http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
> 
> On 15 September 2014 18:30, Ingram, William A <wingram2@illinois.edu> wrote:
> 
> > I got started on UC 12 -- the chemical structures case (requires:
> > WellFormedCsvCheck, CsvValidation, MultipleHeadingRows, and
> > UnitMeasureDefinition). Perhaps this was not the best case to begin with,
> > but I feel like I'm in too deep to turn back now. :)
> >
> > I hit a wall in trying to describe these as JSON, paticularly the multiple
> > header rows. Is there any background reading I should read to learn more
> > about csv to json? Or has no one figured this out yet?
> >
> > Thanks,
> > Bill

-- 
H.Merijn Brand  http://tux.nl   Perl Monger  http://amsterdam.pm.org/
using perl5.00307 .. 5.19   porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/        http://www.test-smoke.org/
http://qa.perl.org   http://www.goldmark.org/jeff/stupid-disclaimers/

Received on Tuesday, 16 September 2014 08:48:50 UTC