Re: CSV parser specification? from Alf Eaton on 2014-03-10 (public-csv-wg@w3.org from March 2014)

From: Alf Eaton <eaton.alf@gmail.com>
Date: Mon, 10 Mar 2014 14:50:21 +0000
To: Rufus Pollock <rufus.pollock@okfn.org>
Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <CAJVrAaQB2p8C4TYK9agPfz0KbyDehiUXm5ohLJk3aMsqSgdLnA@mail.gmail.com>
Yes, skipInitialSpace covers removing initial space (it's not clear
whether it means a single character or all the whitespace, but seems
like the latter). I have seen parsers offer options to trim space from
the start, end or both ends of the field - presumably having
whitespace at the other end of the value is less common, but someone
needed it at some point. There's also the difference between trimming
(removing all the whitespace; probably more useful for hand-edited CSV
files) and just removing a single character (more likely to be an
actual dialect/export option in some tools).

Alf

On 10 March 2014 13:04, Rufus Pollock <rufus.pollock@okfn.org> wrote:
> Great stuff Alf.
>
> I note trimming trailing (or leading) space is purpose of
> "skipInitialSpace" attribute in the CSV Dialect Description Format spec.
>
> Rufus
>
>
>
>
>
> On 6 March 2014 14:12, Alf Eaton <eaton.alf@gmail.com> wrote:
>>
>> Yes, indeed - my starting point has basically been to take that
>> specification and try to find things that it doesn't cover.
>>
>> The main things I've found have been backslash escaping (supported by
>> PHP's CSV parser, for example), trimming trailing space (supported by
>> some CSV parsers), and multiple header rows/columns (supported by
>> _pandas_, at least).
>>
>> That's not counting all the options that some parsers have for doing
>> data conversion/transformation, which is a separate issue…
>>
>> Alf
>>
>> On 6 March 2014 14:04, Rufus Pollock <rufus.pollock@okfn.org> wrote:
>> > I note there is this existing "mini-spec" for describing CSV dialects
>> > that
>> > may be useful here:
>> >
>> > http://dataprotocols.org/csv-dialect/
>> >
>> > Both inference of given CSV structure (csv "sniffing") and validation
>> > would,
>> > of course, be separate issues.
>> >
>> > Rufus
>> >
>> >
>> > On 6 March 2014 13:54, Alf Eaton <eaton.alf@gmail.com> wrote:
>> >>
>> >> It occurred to me that there are essentially 3 levels of "strictness"
>> >> that a parser might need to support:
>> >>
>> >> 1. "Strict": no options for the parser, all the encoding, escaping and
>> >> delimiter options are fixed (similar to JSON).
>> >> 2. "Intermediate": dialect options can be given to the parser (or it
>> >> may use auto-detection), allowing different encodings, delimiters,
>> >> enclosures, escape characters and header rows/columns (similar to how
>> >> most CSV encoders and parsers currently work).
>> >> 3. "Liberal": the parser may need extra options to be able to generate
>> >> clean data, such as removing leading/trailing rows or columns,
>> >> trimming whitespace, converting date formats, etc (similar to how some
>> >> more complex CSV parsers currently work).
>> >>
>> >> I've attempted to write these up and include some examples:
>> >> https://github.com/hubgit/csvw/wiki/CSV-Strictness
>> >>
>> >> It may be that the syntax document should aim for the strictest of
>> >> these, as a recommendation for publishing data as CSV, but then
>> >> describe the options that a parser would need in order to be more
>> >> liberal and handle existing CSV files.
>> >>
>> >> Alf
>> >>
>> >>
>> >> On 5 March 2014 23:49, Gregg Kellogg <gregg@greggkellogg.net> wrote:
>> >> > On Mar 5, 2014, at 3:48 PM, Jeni Tennison <jeni@jenitennison.com>
>> >> > wrote:
>> >> >
>> >> >> Alf,
>> >> >>
>> >> >> See http://w3c.github.io/csvw/syntax/#parsing. Please add yourself
>> >> >> as
>> >> >> an editor and feel free to edit that content.
>> >> >>
>> >> >> What it really needs to do better is link back to describe the
>> >> >> creation
>> >> >> of the tabular data model that’s described in the earlier section.
>> >> >> Note that
>> >> >> that model doesn’t contain anything about indexes of columns or
>> >> >> rows, so I
>> >> >> have left that out of the parsing description too.
>> >> >
>> >> > If we define a syntax, we probably also want EBNF to describe it.
>> >> > This
>> >> > is a simple first-cut at that:
>> >> >
>> >> > # EBNF description of CSV+
>> >> > [1] csv       ::= header record+
>> >> > [2] header    ::= record
>> >> > [3] record    ::= fields ("\r\n" | "\n")
>> >> > [4] fields    ::= field ("," fields)*
>> >> > [5] field     ::= WS* rawfield WS*
>> >> > [6] rawfield  ::= '"' QCHAR* '"'
>> >> >                 | SCHAR*
>> >> > [6] QCHAR     ::= [^"] | '""'
>> >> > [7] SCHAR     ::= [^",\r\n]
>> >> > [8] WS        ::= [ \t]
>> >> >
>> >> > Of course, it can’t do field counting. We should probably place
>> >> > further
>> >> > restrictions on QCHAR and SCHAR to avoid control characters. If
>> >> > header
>> >> > weren’t optional, it would be better defined as in RFC4180, but if
>> >> > the
>> >> > syntax allows it to be optional, this would make it not an LL(1)
>> >> > grammar,
>> >> > which isn’t too much of an issue.
>> >> >
>> >> > If people feel that we should stick closer to the RFC4180 grammar, it
>> >> > might just be a matter of loosening CRLF and add the non-ASCII
>> >> > Unicode range
>> >> > to TEXTDATA, although expressed as W3C EBNF.
>> >> >
>> >> > If there’s general consensus to adding this, I’m happy to put it in
>> >> > the
>> >> > spec.
>> >> >
>> >> > Gregg
>> >> >
>> >> >> Jeni
>> >> >>
>> >> >> ------------------------------------------------------
>> >> >> From: Alf Eaton eaton.alf@gmail.com
>> >> >> Reply: Alf Eaton eaton.alf@gmail.com
>> >> >> Date: 5 March 2014 at 18:32:01
>> >> >> To: public-csv-wg@w3.org public-csv-wg@w3.org
>> >> >> Subject:  CSV parser specification?
>> >> >>
>> >> >>>
>> >> >>> Are there any plans to write a specification for a CSV parser,
>> >> >>> that
>> >> >>> would cover all the kinds of files described in the use cases?
>> >> >>>
>> >> >>> I had a go at an outline today[1], in an attempt to organise my
>> >> >>> thoughts about which parameters would be useful to a parser at
>> >> >>> which
>> >> >>> points during the process.
>> >> >>>
>> >> >>> pandas[2] is the closest tool I've found that incorporates most
>> >> >>> or all
>> >> >>> of these (particularly the generation of "multi-index" keys
>> >> >>> using
>> >> >>> multiple header rows and index columns), though it also includes
>> >> >>> a lot
>> >> >>> of parameters that are only relevant to parsing/transforming
>> >> >>> the
>> >> >>> values of each cell, which I think should probably be in a separate
>> >> >>> step.
>> >> >>>
>> >> >>> Alf
>> >> >>>
>> >> >>> [1] https://github.com/hubgit/csvw/wiki/CSV-Parsing
>> >> >>> [2]
>> >> >>>
>> >> >>> http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >> --
>> >> >> Jeni Tennison
>> >> >> http://www.jenitennison.com/
>> >> >>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> >
>> > Rufus Pollock
>> >
>> > Founder and CEO | skype: rufuspollock | @rufuspollock
>> >
>> > The Open Knowledge Foundation
>> >
>> > Empowering through Open Knowledge
>> >
>> > http://okfn.org/ | @okfn | OKF on Facebook |  Blog  |  Newsletter
>
>
>
>
> --
>
> Rufus Pollock
>
> Founder and CEO | skype: rufuspollock | @rufuspollock
>
> The Open Knowledge Foundation
>
> Empowering through Open Knowledge
>
> http://okfn.org/ | @okfn | OKF on Facebook |  Blog  |  Newsletter
Received on Monday, 10 March 2014 14:51:10 UTC