Re: CSV parser specification? from Rufus Pollock on 2014-03-10 (public-csv-wg@w3.org from March 2014)

From: Rufus Pollock <rufus.pollock@okfn.org>
Date: Mon, 10 Mar 2014 13:04:19 +0000
To: Alf Eaton <eaton.alf@gmail.com>
Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAKssCpMKCT_whSJp8m6jqeROBuWMWjYipKp33rjC=6+FfqPp_Q@mail.gmail.com>
Great stuff Alf.

I note trimming trailing (or leading) space is purpose of
 "skipInitialSpace" attribute in the CSV Dialect Description Format
spec<http://dataprotocols.org/csv-dialect/>
.

Rufus





On 6 March 2014 14:12, Alf Eaton <eaton.alf@gmail.com> wrote:

> Yes, indeed - my starting point has basically been to take that
> specification and try to find things that it doesn't cover.
>
> The main things I've found have been backslash escaping (supported by
> PHP's CSV parser, for example), trimming trailing space (supported by
> some CSV parsers), and multiple header rows/columns (supported by
> _pandas_, at least).
>
> That's not counting all the options that some parsers have for doing
> data conversion/transformation, which is a separate issue...
>
> Alf
>
> On 6 March 2014 14:04, Rufus Pollock <rufus.pollock@okfn.org> wrote:
> > I note there is this existing "mini-spec" for describing CSV dialects
> that
> > may be useful here:
> >
> > http://dataprotocols.org/csv-dialect/
> >
> > Both inference of given CSV structure (csv "sniffing") and validation
> would,
> > of course, be separate issues.
> >
> > Rufus
> >
> >
> > On 6 March 2014 13:54, Alf Eaton <eaton.alf@gmail.com> wrote:
> >>
> >> It occurred to me that there are essentially 3 levels of "strictness"
> >> that a parser might need to support:
> >>
> >> 1. "Strict": no options for the parser, all the encoding, escaping and
> >> delimiter options are fixed (similar to JSON).
> >> 2. "Intermediate": dialect options can be given to the parser (or it
> >> may use auto-detection), allowing different encodings, delimiters,
> >> enclosures, escape characters and header rows/columns (similar to how
> >> most CSV encoders and parsers currently work).
> >> 3. "Liberal": the parser may need extra options to be able to generate
> >> clean data, such as removing leading/trailing rows or columns,
> >> trimming whitespace, converting date formats, etc (similar to how some
> >> more complex CSV parsers currently work).
> >>
> >> I've attempted to write these up and include some examples:
> >> https://github.com/hubgit/csvw/wiki/CSV-Strictness
> >>
> >> It may be that the syntax document should aim for the strictest of
> >> these, as a recommendation for publishing data as CSV, but then
> >> describe the options that a parser would need in order to be more
> >> liberal and handle existing CSV files.
> >>
> >> Alf
> >>
> >>
> >> On 5 March 2014 23:49, Gregg Kellogg <gregg@greggkellogg.net> wrote:
> >> > On Mar 5, 2014, at 3:48 PM, Jeni Tennison <jeni@jenitennison.com>
> wrote:
> >> >
> >> >> Alf,
> >> >>
> >> >> See http://w3c.github.io/csvw/syntax/#parsing. Please add yourself
> as
> >> >> an editor and feel free to edit that content.
> >> >>
> >> >> What it really needs to do better is link back to describe the
> creation
> >> >> of the tabular data model that's described in the earlier section.
> Note that
> >> >> that model doesn't contain anything about indexes of columns or
> rows, so I
> >> >> have left that out of the parsing description too.
> >> >
> >> > If we define a syntax, we probably also want EBNF to describe it. This
> >> > is a simple first-cut at that:
> >> >
> >> > # EBNF description of CSV+
> >> > [1] csv       ::= header record+
> >> > [2] header    ::= record
> >> > [3] record    ::= fields ("\r\n" | "\n")
> >> > [4] fields    ::= field ("," fields)*
> >> > [5] field     ::= WS* rawfield WS*
> >> > [6] rawfield  ::= '"' QCHAR* '"'
> >> >                 | SCHAR*
> >> > [6] QCHAR     ::= [^"] | '""'
> >> > [7] SCHAR     ::= [^",\r\n]
> >> > [8] WS        ::= [ \t]
> >> >
> >> > Of course, it can't do field counting. We should probably place
> further
> >> > restrictions on QCHAR and SCHAR to avoid control characters. If header
> >> > weren't optional, it would be better defined as in RFC4180, but if the
> >> > syntax allows it to be optional, this would make it not an LL(1)
> grammar,
> >> > which isn't too much of an issue.
> >> >
> >> > If people feel that we should stick closer to the RFC4180 grammar, it
> >> > might just be a matter of loosening CRLF and add the non-ASCII
> Unicode range
> >> > to TEXTDATA, although expressed as W3C EBNF.
> >> >
> >> > If there's general consensus to adding this, I'm happy to put it in
> the
> >> > spec.
> >> >
> >> > Gregg
> >> >
> >> >> Jeni
> >> >>
> >> >> ------------------------------------------------------
> >> >> From: Alf Eaton eaton.alf@gmail.com
> >> >> Reply: Alf Eaton eaton.alf@gmail.com
> >> >> Date: 5 March 2014 at 18:32:01
> >> >> To: public-csv-wg@w3.org public-csv-wg@w3.org
> >> >> Subject:  CSV parser specification?
> >> >>
> >> >>>
> >> >>> Are there any plans to write a specification for a CSV parser,
> >> >>> that
> >> >>> would cover all the kinds of files described in the use cases?
> >> >>>
> >> >>> I had a go at an outline today[1], in an attempt to organise my
> >> >>> thoughts about which parameters would be useful to a parser at
> >> >>> which
> >> >>> points during the process.
> >> >>>
> >> >>> pandas[2] is the closest tool I've found that incorporates most
> >> >>> or all
> >> >>> of these (particularly the generation of "multi-index" keys
> >> >>> using
> >> >>> multiple header rows and index columns), though it also includes
> >> >>> a lot
> >> >>> of parameters that are only relevant to parsing/transforming
> >> >>> the
> >> >>> values of each cell, which I think should probably be in a separate
> >> >>> step.
> >> >>>
> >> >>> Alf
> >> >>>
> >> >>> [1] https://github.com/hubgit/csvw/wiki/CSV-Parsing
> >> >>> [2]
> >> >>>
> http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >> --
> >> >> Jeni Tennison
> >> >> http://www.jenitennison.com/
> >> >>
> >> >
> >>
> >
> >
> >
> > --
> >
> > Rufus Pollock
> >
> > Founder and CEO | skype: rufuspollock | @rufuspollock
> >
> > The Open Knowledge Foundation
> >
> > Empowering through Open Knowledge
> >
> > http://okfn.org/ | @okfn | OKF on Facebook |  Blog  |  Newsletter
>



-- 


*Rufus PollockFounder and CEO | skype: rufuspollock | @rufuspollock
<https://twitter.com/rufuspollock>The Open Knowledge Foundation
<http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
<http://okfn.org/> | @okfn <http://twitter.com/OKFN> | OKF on Facebook
<https://www.facebook.com/OKFNetwork> |  Blog <http://blog.okfn.org/>  |
 Newsletter <http://okfn.org/about/newsletter>*
Received on Monday, 10 March 2014 13:04:48 UTC