Re: CSV parser specification?

It occurred to me that there are essentially 3 levels of "strictness"
that a parser might need to support:

1. "Strict": no options for the parser, all the encoding, escaping and
delimiter options are fixed (similar to JSON).
2. "Intermediate": dialect options can be given to the parser (or it
may use auto-detection), allowing different encodings, delimiters,
enclosures, escape characters and header rows/columns (similar to how
most CSV encoders and parsers currently work).
3. "Liberal": the parser may need extra options to be able to generate
clean data, such as removing leading/trailing rows or columns,
trimming whitespace, converting date formats, etc (similar to how some
more complex CSV parsers currently work).

I've attempted to write these up and include some examples:
https://github.com/hubgit/csvw/wiki/CSV-Strictness

It may be that the syntax document should aim for the strictest of
these, as a recommendation for publishing data as CSV, but then
describe the options that a parser would need in order to be more
liberal and handle existing CSV files.

Alf


On 5 March 2014 23:49, Gregg Kellogg <gregg@greggkellogg.net> wrote:
> On Mar 5, 2014, at 3:48 PM, Jeni Tennison <jeni@jenitennison.com> wrote:
>
>> Alf,
>>
>> See http://w3c.github.io/csvw/syntax/#parsing. Please add yourself as an editor and feel free to edit that content.
>>
>> What it really needs to do better is link back to describe the creation of the tabular data model that’s described in the earlier section. Note that that model doesn’t contain anything about indexes of columns or rows, so I have left that out of the parsing description too.
>
> If we define a syntax, we probably also want EBNF to describe it. This is a simple first-cut at that:
>
> # EBNF description of CSV+
> [1] csv       ::= header record+
> [2] header    ::= record
> [3] record    ::= fields ("\r\n" | "\n")
> [4] fields    ::= field ("," fields)*
> [5] field     ::= WS* rawfield WS*
> [6] rawfield  ::= '"' QCHAR* '"'
>                 | SCHAR*
> [6] QCHAR     ::= [^"] | '""'
> [7] SCHAR     ::= [^",\r\n]
> [8] WS        ::= [ \t]
>
> Of course, it can’t do field counting. We should probably place further restrictions on QCHAR and SCHAR to avoid control characters. If header weren’t optional, it would be better defined as in RFC4180, but if the syntax allows it to be optional, this would make it not an LL(1) grammar, which isn’t too much of an issue.
>
> If people feel that we should stick closer to the RFC4180 grammar, it might just be a matter of loosening CRLF and add the non-ASCII Unicode range to TEXTDATA, although expressed as W3C EBNF.
>
> If there’s general consensus to adding this, I’m happy to put it in the spec.
>
> Gregg
>
>> Jeni
>>
>> ------------------------------------------------------
>> From: Alf Eaton eaton.alf@gmail.com
>> Reply: Alf Eaton eaton.alf@gmail.com
>> Date: 5 March 2014 at 18:32:01
>> To: public-csv-wg@w3.org public-csv-wg@w3.org
>> Subject:  CSV parser specification?
>>
>>>
>>> Are there any plans to write a specification for a CSV parser,
>>> that
>>> would cover all the kinds of files described in the use cases?
>>>
>>> I had a go at an outline today[1], in an attempt to organise my
>>> thoughts about which parameters would be useful to a parser at
>>> which
>>> points during the process.
>>>
>>> pandas[2] is the closest tool I've found that incorporates most
>>> or all
>>> of these (particularly the generation of "multi-index" keys
>>> using
>>> multiple header rows and index columns), though it also includes
>>> a lot
>>> of parameters that are only relevant to parsing/transforming
>>> the
>>> values of each cell, which I think should probably be in a separate
>>> step.
>>>
>>> Alf
>>>
>>> [1] https://github.com/hubgit/csvw/wiki/CSV-Parsing
>>> [2] http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
>>>
>>>
>>>
>>
>> --
>> Jeni Tennison
>> http://www.jenitennison.com/
>>
>

Received on Thursday, 6 March 2014 13:56:00 UTC