Re: CSV parser specification? from Alf Eaton on 2014-03-06 (public-csv-wg@w3.org from March 2014)

From: Alf Eaton <eaton.alf@gmail.com>
Date: Thu, 6 Mar 2014 14:12:33 +0000
To: Rufus Pollock <rufus.pollock@okfn.org>
Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAJVrAaQQYpBYbE-1nxLsmzRfMZg2MN3Gea+-uf=n_=15YCTAfg@mail.gmail.com>
Yes, indeed - my starting point has basically been to take that
specification and try to find things that it doesn't cover.

The main things I've found have been backslash escaping (supported by
PHP's CSV parser, for example), trimming trailing space (supported by
some CSV parsers), and multiple header rows/columns (supported by
_pandas_, at least).

That's not counting all the options that some parsers have for doing
data conversion/transformation, which is a separate issue…

Alf

On 6 March 2014 14:04, Rufus Pollock <rufus.pollock@okfn.org> wrote:
> I note there is this existing "mini-spec" for describing CSV dialects that
> may be useful here:
>
> http://dataprotocols.org/csv-dialect/
>
> Both inference of given CSV structure (csv "sniffing") and validation would,
> of course, be separate issues.
>
> Rufus
>
>
> On 6 March 2014 13:54, Alf Eaton <eaton.alf@gmail.com> wrote:
>>
>> It occurred to me that there are essentially 3 levels of "strictness"
>> that a parser might need to support:
>>
>> 1. "Strict": no options for the parser, all the encoding, escaping and
>> delimiter options are fixed (similar to JSON).
>> 2. "Intermediate": dialect options can be given to the parser (or it
>> may use auto-detection), allowing different encodings, delimiters,
>> enclosures, escape characters and header rows/columns (similar to how
>> most CSV encoders and parsers currently work).
>> 3. "Liberal": the parser may need extra options to be able to generate
>> clean data, such as removing leading/trailing rows or columns,
>> trimming whitespace, converting date formats, etc (similar to how some
>> more complex CSV parsers currently work).
>>
>> I've attempted to write these up and include some examples:
>> https://github.com/hubgit/csvw/wiki/CSV-Strictness
>>
>> It may be that the syntax document should aim for the strictest of
>> these, as a recommendation for publishing data as CSV, but then
>> describe the options that a parser would need in order to be more
>> liberal and handle existing CSV files.
>>
>> Alf
>>
>>
>> On 5 March 2014 23:49, Gregg Kellogg <gregg@greggkellogg.net> wrote:
>> > On Mar 5, 2014, at 3:48 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
>> >
>> >> Alf,
>> >>
>> >> See http://w3c.github.io/csvw/syntax/#parsing. Please add yourself as
>> >> an editor and feel free to edit that content.
>> >>
>> >> What it really needs to do better is link back to describe the creation
>> >> of the tabular data model that’s described in the earlier section. Note that
>> >> that model doesn’t contain anything about indexes of columns or rows, so I
>> >> have left that out of the parsing description too.
>> >
>> > If we define a syntax, we probably also want EBNF to describe it. This
>> > is a simple first-cut at that:
>> >
>> > # EBNF description of CSV+
>> > [1] csv       ::= header record+
>> > [2] header    ::= record
>> > [3] record    ::= fields ("\r\n" | "\n")
>> > [4] fields    ::= field ("," fields)*
>> > [5] field     ::= WS* rawfield WS*
>> > [6] rawfield  ::= '"' QCHAR* '"'
>> >                 | SCHAR*
>> > [6] QCHAR     ::= [^"] | '""'
>> > [7] SCHAR     ::= [^",\r\n]
>> > [8] WS        ::= [ \t]
>> >
>> > Of course, it can’t do field counting. We should probably place further
>> > restrictions on QCHAR and SCHAR to avoid control characters. If header
>> > weren’t optional, it would be better defined as in RFC4180, but if the
>> > syntax allows it to be optional, this would make it not an LL(1) grammar,
>> > which isn’t too much of an issue.
>> >
>> > If people feel that we should stick closer to the RFC4180 grammar, it
>> > might just be a matter of loosening CRLF and add the non-ASCII Unicode range
>> > to TEXTDATA, although expressed as W3C EBNF.
>> >
>> > If there’s general consensus to adding this, I’m happy to put it in the
>> > spec.
>> >
>> > Gregg
>> >
>> >> Jeni
>> >>
>> >> ------------------------------------------------------
>> >> From: Alf Eaton eaton.alf@gmail.com
>> >> Reply: Alf Eaton eaton.alf@gmail.com
>> >> Date: 5 March 2014 at 18:32:01
>> >> To: public-csv-wg@w3.org public-csv-wg@w3.org
>> >> Subject:  CSV parser specification?
>> >>
>> >>>
>> >>> Are there any plans to write a specification for a CSV parser,
>> >>> that
>> >>> would cover all the kinds of files described in the use cases?
>> >>>
>> >>> I had a go at an outline today[1], in an attempt to organise my
>> >>> thoughts about which parameters would be useful to a parser at
>> >>> which
>> >>> points during the process.
>> >>>
>> >>> pandas[2] is the closest tool I've found that incorporates most
>> >>> or all
>> >>> of these (particularly the generation of "multi-index" keys
>> >>> using
>> >>> multiple header rows and index columns), though it also includes
>> >>> a lot
>> >>> of parameters that are only relevant to parsing/transforming
>> >>> the
>> >>> values of each cell, which I think should probably be in a separate
>> >>> step.
>> >>>
>> >>> Alf
>> >>>
>> >>> [1] https://github.com/hubgit/csvw/wiki/CSV-Parsing
>> >>> [2]
>> >>> http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
>> >>>
>> >>>
>> >>>
>> >>
>> >> --
>> >> Jeni Tennison
>> >> http://www.jenitennison.com/
>> >>
>> >
>>
>
>
>
> --
>
> Rufus Pollock
>
> Founder and CEO | skype: rufuspollock | @rufuspollock
>
> The Open Knowledge Foundation
>
> Empowering through Open Knowledge
>
> http://okfn.org/ | @okfn | OKF on Facebook |  Blog  |  Newsletter
Received on Thursday, 6 March 2014 14:13:25 UTC