- From: Rufus Pollock <rufus.pollock@okfn.org>
- Date: Thu, 6 Mar 2014 14:04:07 +0000
- To: Alf Eaton <eaton.alf@gmail.com>
- Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
- Message-ID: <CAKssCpNXTvVoxTG73G+1zkAQcHBRmYUTzuy23FN1ecx05H=MRg@mail.gmail.com>
I note there is this existing "mini-spec" for describing CSV dialects that may be useful here: http://dataprotocols.org/csv-dialect/ Both inference of given CSV structure (csv "sniffing") and validation would, of course, be separate issues. Rufus On 6 March 2014 13:54, Alf Eaton <eaton.alf@gmail.com> wrote: > It occurred to me that there are essentially 3 levels of "strictness" > that a parser might need to support: > > 1. "Strict": no options for the parser, all the encoding, escaping and > delimiter options are fixed (similar to JSON). > 2. "Intermediate": dialect options can be given to the parser (or it > may use auto-detection), allowing different encodings, delimiters, > enclosures, escape characters and header rows/columns (similar to how > most CSV encoders and parsers currently work). > 3. "Liberal": the parser may need extra options to be able to generate > clean data, such as removing leading/trailing rows or columns, > trimming whitespace, converting date formats, etc (similar to how some > more complex CSV parsers currently work). > > I've attempted to write these up and include some examples: > https://github.com/hubgit/csvw/wiki/CSV-Strictness > > It may be that the syntax document should aim for the strictest of > these, as a recommendation for publishing data as CSV, but then > describe the options that a parser would need in order to be more > liberal and handle existing CSV files. > > Alf > > > On 5 March 2014 23:49, Gregg Kellogg <gregg@greggkellogg.net> wrote: > > On Mar 5, 2014, at 3:48 PM, Jeni Tennison <jeni@jenitennison.com> wrote: > > > >> Alf, > >> > >> See http://w3c.github.io/csvw/syntax/#parsing. Please add yourself as > an editor and feel free to edit that content. > >> > >> What it really needs to do better is link back to describe the creation > of the tabular data model that's described in the earlier section. Note > that that model doesn't contain anything about indexes of columns or rows, > so I have left that out of the parsing description too. > > > > If we define a syntax, we probably also want EBNF to describe it. This > is a simple first-cut at that: > > > > # EBNF description of CSV+ > > [1] csv ::= header record+ > > [2] header ::= record > > [3] record ::= fields ("\r\n" | "\n") > > [4] fields ::= field ("," fields)* > > [5] field ::= WS* rawfield WS* > > [6] rawfield ::= '"' QCHAR* '"' > > | SCHAR* > > [6] QCHAR ::= [^"] | '""' > > [7] SCHAR ::= [^",\r\n] > > [8] WS ::= [ \t] > > > > Of course, it can't do field counting. We should probably place further > restrictions on QCHAR and SCHAR to avoid control characters. If header > weren't optional, it would be better defined as in RFC4180, but if the > syntax allows it to be optional, this would make it not an LL(1) grammar, > which isn't too much of an issue. > > > > If people feel that we should stick closer to the RFC4180 grammar, it > might just be a matter of loosening CRLF and add the non-ASCII Unicode > range to TEXTDATA, although expressed as W3C EBNF. > > > > If there's general consensus to adding this, I'm happy to put it in the > spec. > > > > Gregg > > > >> Jeni > >> > >> ------------------------------------------------------ > >> From: Alf Eaton eaton.alf@gmail.com > >> Reply: Alf Eaton eaton.alf@gmail.com > >> Date: 5 March 2014 at 18:32:01 > >> To: public-csv-wg@w3.org public-csv-wg@w3.org > >> Subject: CSV parser specification? > >> > >>> > >>> Are there any plans to write a specification for a CSV parser, > >>> that > >>> would cover all the kinds of files described in the use cases? > >>> > >>> I had a go at an outline today[1], in an attempt to organise my > >>> thoughts about which parameters would be useful to a parser at > >>> which > >>> points during the process. > >>> > >>> pandas[2] is the closest tool I've found that incorporates most > >>> or all > >>> of these (particularly the generation of "multi-index" keys > >>> using > >>> multiple header rows and index columns), though it also includes > >>> a lot > >>> of parameters that are only relevant to parsing/transforming > >>> the > >>> values of each cell, which I think should probably be in a separate > >>> step. > >>> > >>> Alf > >>> > >>> [1] https://github.com/hubgit/csvw/wiki/CSV-Parsing > >>> [2] > http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table > >>> > >>> > >>> > >> > >> -- > >> Jeni Tennison > >> http://www.jenitennison.com/ > >> > > > > -- *Rufus PollockFounder and CEO | skype: rufuspollock | @rufuspollock <https://twitter.com/rufuspollock>The Open Knowledge Foundation <http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/ <http://okfn.org/> | @okfn <http://twitter.com/OKFN> | OKF on Facebook <https://www.facebook.com/OKFNetwork> | Blog <http://blog.okfn.org/> | Newsletter <http://okfn.org/about/newsletter>*
Received on Thursday, 6 March 2014 14:04:35 UTC