- From: Alf Eaton <eaton.alf@gmail.com>
- Date: Mon, 10 Mar 2014 14:50:21 +0000
- To: Rufus Pollock <rufus.pollock@okfn.org>
- Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Yes, skipInitialSpace covers removing initial space (it's not clear whether it means a single character or all the whitespace, but seems like the latter). I have seen parsers offer options to trim space from the start, end or both ends of the field - presumably having whitespace at the other end of the value is less common, but someone needed it at some point. There's also the difference between trimming (removing all the whitespace; probably more useful for hand-edited CSV files) and just removing a single character (more likely to be an actual dialect/export option in some tools). Alf On 10 March 2014 13:04, Rufus Pollock <rufus.pollock@okfn.org> wrote: > Great stuff Alf. > > I note trimming trailing (or leading) space is purpose of > "skipInitialSpace" attribute in the CSV Dialect Description Format spec. > > Rufus > > > > > > On 6 March 2014 14:12, Alf Eaton <eaton.alf@gmail.com> wrote: >> >> Yes, indeed - my starting point has basically been to take that >> specification and try to find things that it doesn't cover. >> >> The main things I've found have been backslash escaping (supported by >> PHP's CSV parser, for example), trimming trailing space (supported by >> some CSV parsers), and multiple header rows/columns (supported by >> _pandas_, at least). >> >> That's not counting all the options that some parsers have for doing >> data conversion/transformation, which is a separate issue… >> >> Alf >> >> On 6 March 2014 14:04, Rufus Pollock <rufus.pollock@okfn.org> wrote: >> > I note there is this existing "mini-spec" for describing CSV dialects >> > that >> > may be useful here: >> > >> > http://dataprotocols.org/csv-dialect/ >> > >> > Both inference of given CSV structure (csv "sniffing") and validation >> > would, >> > of course, be separate issues. >> > >> > Rufus >> > >> > >> > On 6 March 2014 13:54, Alf Eaton <eaton.alf@gmail.com> wrote: >> >> >> >> It occurred to me that there are essentially 3 levels of "strictness" >> >> that a parser might need to support: >> >> >> >> 1. "Strict": no options for the parser, all the encoding, escaping and >> >> delimiter options are fixed (similar to JSON). >> >> 2. "Intermediate": dialect options can be given to the parser (or it >> >> may use auto-detection), allowing different encodings, delimiters, >> >> enclosures, escape characters and header rows/columns (similar to how >> >> most CSV encoders and parsers currently work). >> >> 3. "Liberal": the parser may need extra options to be able to generate >> >> clean data, such as removing leading/trailing rows or columns, >> >> trimming whitespace, converting date formats, etc (similar to how some >> >> more complex CSV parsers currently work). >> >> >> >> I've attempted to write these up and include some examples: >> >> https://github.com/hubgit/csvw/wiki/CSV-Strictness >> >> >> >> It may be that the syntax document should aim for the strictest of >> >> these, as a recommendation for publishing data as CSV, but then >> >> describe the options that a parser would need in order to be more >> >> liberal and handle existing CSV files. >> >> >> >> Alf >> >> >> >> >> >> On 5 March 2014 23:49, Gregg Kellogg <gregg@greggkellogg.net> wrote: >> >> > On Mar 5, 2014, at 3:48 PM, Jeni Tennison <jeni@jenitennison.com> >> >> > wrote: >> >> > >> >> >> Alf, >> >> >> >> >> >> See http://w3c.github.io/csvw/syntax/#parsing. Please add yourself >> >> >> as >> >> >> an editor and feel free to edit that content. >> >> >> >> >> >> What it really needs to do better is link back to describe the >> >> >> creation >> >> >> of the tabular data model that’s described in the earlier section. >> >> >> Note that >> >> >> that model doesn’t contain anything about indexes of columns or >> >> >> rows, so I >> >> >> have left that out of the parsing description too. >> >> > >> >> > If we define a syntax, we probably also want EBNF to describe it. >> >> > This >> >> > is a simple first-cut at that: >> >> > >> >> > # EBNF description of CSV+ >> >> > [1] csv ::= header record+ >> >> > [2] header ::= record >> >> > [3] record ::= fields ("\r\n" | "\n") >> >> > [4] fields ::= field ("," fields)* >> >> > [5] field ::= WS* rawfield WS* >> >> > [6] rawfield ::= '"' QCHAR* '"' >> >> > | SCHAR* >> >> > [6] QCHAR ::= [^"] | '""' >> >> > [7] SCHAR ::= [^",\r\n] >> >> > [8] WS ::= [ \t] >> >> > >> >> > Of course, it can’t do field counting. We should probably place >> >> > further >> >> > restrictions on QCHAR and SCHAR to avoid control characters. If >> >> > header >> >> > weren’t optional, it would be better defined as in RFC4180, but if >> >> > the >> >> > syntax allows it to be optional, this would make it not an LL(1) >> >> > grammar, >> >> > which isn’t too much of an issue. >> >> > >> >> > If people feel that we should stick closer to the RFC4180 grammar, it >> >> > might just be a matter of loosening CRLF and add the non-ASCII >> >> > Unicode range >> >> > to TEXTDATA, although expressed as W3C EBNF. >> >> > >> >> > If there’s general consensus to adding this, I’m happy to put it in >> >> > the >> >> > spec. >> >> > >> >> > Gregg >> >> > >> >> >> Jeni >> >> >> >> >> >> ------------------------------------------------------ >> >> >> From: Alf Eaton eaton.alf@gmail.com >> >> >> Reply: Alf Eaton eaton.alf@gmail.com >> >> >> Date: 5 March 2014 at 18:32:01 >> >> >> To: public-csv-wg@w3.org public-csv-wg@w3.org >> >> >> Subject: CSV parser specification? >> >> >> >> >> >>> >> >> >>> Are there any plans to write a specification for a CSV parser, >> >> >>> that >> >> >>> would cover all the kinds of files described in the use cases? >> >> >>> >> >> >>> I had a go at an outline today[1], in an attempt to organise my >> >> >>> thoughts about which parameters would be useful to a parser at >> >> >>> which >> >> >>> points during the process. >> >> >>> >> >> >>> pandas[2] is the closest tool I've found that incorporates most >> >> >>> or all >> >> >>> of these (particularly the generation of "multi-index" keys >> >> >>> using >> >> >>> multiple header rows and index columns), though it also includes >> >> >>> a lot >> >> >>> of parameters that are only relevant to parsing/transforming >> >> >>> the >> >> >>> values of each cell, which I think should probably be in a separate >> >> >>> step. >> >> >>> >> >> >>> Alf >> >> >>> >> >> >>> [1] https://github.com/hubgit/csvw/wiki/CSV-Parsing >> >> >>> [2] >> >> >>> >> >> >>> http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table >> >> >>> >> >> >>> >> >> >>> >> >> >> >> >> >> -- >> >> >> Jeni Tennison >> >> >> http://www.jenitennison.com/ >> >> >> >> >> > >> >> >> > >> > >> > >> > -- >> > >> > Rufus Pollock >> > >> > Founder and CEO | skype: rufuspollock | @rufuspollock >> > >> > The Open Knowledge Foundation >> > >> > Empowering through Open Knowledge >> > >> > http://okfn.org/ | @okfn | OKF on Facebook | Blog | Newsletter > > > > > -- > > Rufus Pollock > > Founder and CEO | skype: rufuspollock | @rufuspollock > > The Open Knowledge Foundation > > Empowering through Open Knowledge > > http://okfn.org/ | @okfn | OKF on Facebook | Blog | Newsletter
Received on Monday, 10 March 2014 14:51:10 UTC