- From: Dan Brickley <danbri@danbri.org>
- Date: Wed, 25 Feb 2015 07:58:07 -0800
- To: public-csv-wg@w3.org
- Message-ID: <CAFfrAFrDBMPoDzfmyST9WA_HyMxCFt8O5AdLv3ifp35a4Pw=zg@mail.gmail.com>
Maybe the attachments were too big? ---------- Forwarded message ---------- From: Juergen Umbrich <juergen.umbrich@wu.ac.at> Date: 25 February 2015 at 07:16 Subject: Fwd: Analysis of different datatypes per column To: Dan Brickley <danbri@danbri.org>, Jeni Tennison <jeni@jenitennison.com> Hi, here the email, i forgot to gzip one of the text files, and the email was 7MB- > that might be the reason that the email was not delivered > Begin forwarded message: > > Hi all, > > attached is a short survey of the amount of different “datatypes” per columns in CSV files and our logs containing the CSV URL and at what column we detected which types. > > DISCLAIMER: > We are not sure how representative these statistics are for now due to the following reasons: > *) we only analysed 9000 csv files due to temporarily limited amount of resources and some not performance optimised code > *) we used our datatype detection algorithm which are far from optimal > > Sorry, that we were not able to compile the report with more documents and a more accurate datatype detection. > Anyway, i hope it sheds some light and can be of any help. > > Below are some more details to better understand the results > The explanation of the data file: > first column = the different data types for a column ( separated by “_”= > second column = the count of tables for which we detected more than 1 datatype in one column > > > Sorting them by their count, we can see that > *) the most prominent pair of different datatypes per column is TEXT and NUMERICAL ( e.g ALPHA_FLOAT, ALPHA_INTEGER, ….) > *) we observed also very often a mix of empty cell values in a column > *) mix of different numerical values ( INTEGER and FLOAT , or various ranges of integer values) > > Little bit more details: > *) 2882 tables contain a column with text and float values (ALPHA_FLOAT) > *) 2513 tables contain a column with either characters+numbers or only float > roughly the same amount of tables have in addition also at least one empty cell in the column > *) 1848 tables have a column with FLOAT or Integer values > > Explanation of our datatypes detection algorithm and datatypes. > > for each cell in a column: > 1) EMPTY -check: cell is empty or contains “null” > 2) EMAIL based on email_regex = '[a-zA-Z0-9_\.\+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-\.]+’ > 3) RESOURCE = check if the cell value might be xx://xxx resource ( this is very buggy code :S) > 4) if cell contains numbers > 4a) ALPAHNUM = numbers and characters > 4b) DATE if it matches one of the date_regex ( see below) > 4c) FLOAT typecasting to float > 4d) YEAR if val >= 1400 and int_val <= 2100 > 4e) YEAR_MONTH = if val >= 197000 and int_val < 210000 > 4f) INTEGER if no comma and does not match any of YEAR/Date/etc. > 5) if contains alpha > 5a) text if len(cell) >50 > 5b) ALPHA = all characters in the string are alphabetic and there is at least one character > 5c) COMPLEX = if nothing of the above matches > !!! You already recognised that this is far from optimal or very accurate and by looking again at the code and the results we discovered *many* improvements and input for an improved datatype algorithm > -- Dr. Jürgen Umbrich WU Vienna, Institute for Information Business
Attachments
- application/x-gzip attachment: csv_urls_with_type.txt.gz
- text/csv attachment: csv_datatypes.csv
Received on Wednesday, 25 February 2015 15:58:40 UTC