Re: CSV parser specification? from Rufus Pollock on 2014-03-06 (public-csv-wg@w3.org from March 2014)

From: Rufus Pollock <rufus.pollock@okfn.org>
Date: Thu, 6 Mar 2014 14:04:07 +0000
To: Alf Eaton <eaton.alf@gmail.com>
Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAKssCpNXTvVoxTG73G+1zkAQcHBRmYUTzuy23FN1ecx05H=MRg@mail.gmail.com>
I note there is this existing "mini-spec" for describing CSV dialects that
may be useful here:

http://dataprotocols.org/csv-dialect/

Both inference of given CSV structure (csv "sniffing") and validation
would, of course, be separate issues.

Rufus


On 6 March 2014 13:54, Alf Eaton <eaton.alf@gmail.com> wrote:

> It occurred to me that there are essentially 3 levels of "strictness"
> that a parser might need to support:
>
> 1. "Strict": no options for the parser, all the encoding, escaping and
> delimiter options are fixed (similar to JSON).
> 2. "Intermediate": dialect options can be given to the parser (or it
> may use auto-detection), allowing different encodings, delimiters,
> enclosures, escape characters and header rows/columns (similar to how
> most CSV encoders and parsers currently work).
> 3. "Liberal": the parser may need extra options to be able to generate
> clean data, such as removing leading/trailing rows or columns,
> trimming whitespace, converting date formats, etc (similar to how some
> more complex CSV parsers currently work).
>
> I've attempted to write these up and include some examples:
> https://github.com/hubgit/csvw/wiki/CSV-Strictness
>
> It may be that the syntax document should aim for the strictest of
> these, as a recommendation for publishing data as CSV, but then
> describe the options that a parser would need in order to be more
> liberal and handle existing CSV files.
>
> Alf
>
>
> On 5 March 2014 23:49, Gregg Kellogg <gregg@greggkellogg.net> wrote:
> > On Mar 5, 2014, at 3:48 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
> >
> >> Alf,
> >>
> >> See http://w3c.github.io/csvw/syntax/#parsing. Please add yourself as
> an editor and feel free to edit that content.
> >>
> >> What it really needs to do better is link back to describe the creation
> of the tabular data model that's described in the earlier section. Note
> that that model doesn't contain anything about indexes of columns or rows,
> so I have left that out of the parsing description too.
> >
> > If we define a syntax, we probably also want EBNF to describe it. This
> is a simple first-cut at that:
> >
> > # EBNF description of CSV+
> > [1] csv       ::= header record+
> > [2] header    ::= record
> > [3] record    ::= fields ("\r\n" | "\n")
> > [4] fields    ::= field ("," fields)*
> > [5] field     ::= WS* rawfield WS*
> > [6] rawfield  ::= '"' QCHAR* '"'
> >                 | SCHAR*
> > [6] QCHAR     ::= [^"] | '""'
> > [7] SCHAR     ::= [^",\r\n]
> > [8] WS        ::= [ \t]
> >
> > Of course, it can't do field counting. We should probably place further
> restrictions on QCHAR and SCHAR to avoid control characters. If header
> weren't optional, it would be better defined as in RFC4180, but if the
> syntax allows it to be optional, this would make it not an LL(1) grammar,
> which isn't too much of an issue.
> >
> > If people feel that we should stick closer to the RFC4180 grammar, it
> might just be a matter of loosening CRLF and add the non-ASCII Unicode
> range to TEXTDATA, although expressed as W3C EBNF.
> >
> > If there's general consensus to adding this, I'm happy to put it in the
> spec.
> >
> > Gregg
> >
> >> Jeni
> >>
> >> ------------------------------------------------------
> >> From: Alf Eaton eaton.alf@gmail.com
> >> Reply: Alf Eaton eaton.alf@gmail.com
> >> Date: 5 March 2014 at 18:32:01
> >> To: public-csv-wg@w3.org public-csv-wg@w3.org
> >> Subject:  CSV parser specification?
> >>
> >>>
> >>> Are there any plans to write a specification for a CSV parser,
> >>> that
> >>> would cover all the kinds of files described in the use cases?
> >>>
> >>> I had a go at an outline today[1], in an attempt to organise my
> >>> thoughts about which parameters would be useful to a parser at
> >>> which
> >>> points during the process.
> >>>
> >>> pandas[2] is the closest tool I've found that incorporates most
> >>> or all
> >>> of these (particularly the generation of "multi-index" keys
> >>> using
> >>> multiple header rows and index columns), though it also includes
> >>> a lot
> >>> of parameters that are only relevant to parsing/transforming
> >>> the
> >>> values of each cell, which I think should probably be in a separate
> >>> step.
> >>>
> >>> Alf
> >>>
> >>> [1] https://github.com/hubgit/csvw/wiki/CSV-Parsing
> >>> [2]
> http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
> >>>
> >>>
> >>>
> >>
> >> --
> >> Jeni Tennison
> >> http://www.jenitennison.com/
> >>
> >
>
>


-- 


*Rufus PollockFounder and CEO | skype: rufuspollock | @rufuspollock
<https://twitter.com/rufuspollock>The Open Knowledge Foundation
<http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/
<http://okfn.org/> | @okfn <http://twitter.com/OKFN> | OKF on Facebook
<https://www.facebook.com/OKFNetwork> |  Blog <http://blog.okfn.org/>  |
 Newsletter <http://okfn.org/about/newsletter>*
Received on Thursday, 6 March 2014 14:04:35 UTC