- From: Rufus Pollock <rufus.pollock@okfn.org>
- Date: Mon, 10 Mar 2014 13:04:19 +0000
- To: Alf Eaton <eaton.alf@gmail.com>
- Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
- Message-ID: <CAKssCpMKCT_whSJp8m6jqeROBuWMWjYipKp33rjC=6+FfqPp_Q@mail.gmail.com>
Great stuff Alf. I note trimming trailing (or leading) space is purpose of "skipInitialSpace" attribute in the CSV Dialect Description Format spec<http://dataprotocols.org/csv-dialect/> . Rufus On 6 March 2014 14:12, Alf Eaton <eaton.alf@gmail.com> wrote: > Yes, indeed - my starting point has basically been to take that > specification and try to find things that it doesn't cover. > > The main things I've found have been backslash escaping (supported by > PHP's CSV parser, for example), trimming trailing space (supported by > some CSV parsers), and multiple header rows/columns (supported by > _pandas_, at least). > > That's not counting all the options that some parsers have for doing > data conversion/transformation, which is a separate issue... > > Alf > > On 6 March 2014 14:04, Rufus Pollock <rufus.pollock@okfn.org> wrote: > > I note there is this existing "mini-spec" for describing CSV dialects > that > > may be useful here: > > > > http://dataprotocols.org/csv-dialect/ > > > > Both inference of given CSV structure (csv "sniffing") and validation > would, > > of course, be separate issues. > > > > Rufus > > > > > > On 6 March 2014 13:54, Alf Eaton <eaton.alf@gmail.com> wrote: > >> > >> It occurred to me that there are essentially 3 levels of "strictness" > >> that a parser might need to support: > >> > >> 1. "Strict": no options for the parser, all the encoding, escaping and > >> delimiter options are fixed (similar to JSON). > >> 2. "Intermediate": dialect options can be given to the parser (or it > >> may use auto-detection), allowing different encodings, delimiters, > >> enclosures, escape characters and header rows/columns (similar to how > >> most CSV encoders and parsers currently work). > >> 3. "Liberal": the parser may need extra options to be able to generate > >> clean data, such as removing leading/trailing rows or columns, > >> trimming whitespace, converting date formats, etc (similar to how some > >> more complex CSV parsers currently work). > >> > >> I've attempted to write these up and include some examples: > >> https://github.com/hubgit/csvw/wiki/CSV-Strictness > >> > >> It may be that the syntax document should aim for the strictest of > >> these, as a recommendation for publishing data as CSV, but then > >> describe the options that a parser would need in order to be more > >> liberal and handle existing CSV files. > >> > >> Alf > >> > >> > >> On 5 March 2014 23:49, Gregg Kellogg <gregg@greggkellogg.net> wrote: > >> > On Mar 5, 2014, at 3:48 PM, Jeni Tennison <jeni@jenitennison.com> > wrote: > >> > > >> >> Alf, > >> >> > >> >> See http://w3c.github.io/csvw/syntax/#parsing. Please add yourself > as > >> >> an editor and feel free to edit that content. > >> >> > >> >> What it really needs to do better is link back to describe the > creation > >> >> of the tabular data model that's described in the earlier section. > Note that > >> >> that model doesn't contain anything about indexes of columns or > rows, so I > >> >> have left that out of the parsing description too. > >> > > >> > If we define a syntax, we probably also want EBNF to describe it. This > >> > is a simple first-cut at that: > >> > > >> > # EBNF description of CSV+ > >> > [1] csv ::= header record+ > >> > [2] header ::= record > >> > [3] record ::= fields ("\r\n" | "\n") > >> > [4] fields ::= field ("," fields)* > >> > [5] field ::= WS* rawfield WS* > >> > [6] rawfield ::= '"' QCHAR* '"' > >> > | SCHAR* > >> > [6] QCHAR ::= [^"] | '""' > >> > [7] SCHAR ::= [^",\r\n] > >> > [8] WS ::= [ \t] > >> > > >> > Of course, it can't do field counting. We should probably place > further > >> > restrictions on QCHAR and SCHAR to avoid control characters. If header > >> > weren't optional, it would be better defined as in RFC4180, but if the > >> > syntax allows it to be optional, this would make it not an LL(1) > grammar, > >> > which isn't too much of an issue. > >> > > >> > If people feel that we should stick closer to the RFC4180 grammar, it > >> > might just be a matter of loosening CRLF and add the non-ASCII > Unicode range > >> > to TEXTDATA, although expressed as W3C EBNF. > >> > > >> > If there's general consensus to adding this, I'm happy to put it in > the > >> > spec. > >> > > >> > Gregg > >> > > >> >> Jeni > >> >> > >> >> ------------------------------------------------------ > >> >> From: Alf Eaton eaton.alf@gmail.com > >> >> Reply: Alf Eaton eaton.alf@gmail.com > >> >> Date: 5 March 2014 at 18:32:01 > >> >> To: public-csv-wg@w3.org public-csv-wg@w3.org > >> >> Subject: CSV parser specification? > >> >> > >> >>> > >> >>> Are there any plans to write a specification for a CSV parser, > >> >>> that > >> >>> would cover all the kinds of files described in the use cases? > >> >>> > >> >>> I had a go at an outline today[1], in an attempt to organise my > >> >>> thoughts about which parameters would be useful to a parser at > >> >>> which > >> >>> points during the process. > >> >>> > >> >>> pandas[2] is the closest tool I've found that incorporates most > >> >>> or all > >> >>> of these (particularly the generation of "multi-index" keys > >> >>> using > >> >>> multiple header rows and index columns), though it also includes > >> >>> a lot > >> >>> of parameters that are only relevant to parsing/transforming > >> >>> the > >> >>> values of each cell, which I think should probably be in a separate > >> >>> step. > >> >>> > >> >>> Alf > >> >>> > >> >>> [1] https://github.com/hubgit/csvw/wiki/CSV-Parsing > >> >>> [2] > >> >>> > http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table > >> >>> > >> >>> > >> >>> > >> >> > >> >> -- > >> >> Jeni Tennison > >> >> http://www.jenitennison.com/ > >> >> > >> > > >> > > > > > > > > -- > > > > Rufus Pollock > > > > Founder and CEO | skype: rufuspollock | @rufuspollock > > > > The Open Knowledge Foundation > > > > Empowering through Open Knowledge > > > > http://okfn.org/ | @okfn | OKF on Facebook | Blog | Newsletter > -- *Rufus PollockFounder and CEO | skype: rufuspollock | @rufuspollock <https://twitter.com/rufuspollock>The Open Knowledge Foundation <http://okfn.org/>Empowering through Open Knowledgehttp://okfn.org/ <http://okfn.org/> | @okfn <http://twitter.com/OKFN> | OKF on Facebook <https://www.facebook.com/OKFNetwork> | Blog <http://blog.okfn.org/> | Newsletter <http://okfn.org/about/newsletter>*
Received on Monday, 10 March 2014 13:04:48 UTC