- From: Juergen Umbrich <juergen.umbrich@wu.ac.at>
- Date: Wed, 4 Mar 2015 15:34:46 +0100
- To: David Booth <david@dbooth.org>
- Cc: "public-csv-wg@w3.org" <public-csv-wg@w3.org>
- Message-Id: <50A2D381-0783-4B25-B67F-36F094EC2D14@wu.ac.at>
Hi David, Seems there happen some concatenating errors, ALPHA++ should be ALPHA+, the same for FLOAT+ which should be FLOAT best Jürgen > On 04 Mar 2015, at 15:32, David Booth <david@dbooth.org> wrote: > > What does ALPHA++ mean? It appears more than once in the results, but it is not listed in the key below. ALPHA+ is listed, but not ALPHA++ . > > Thanks, > David Booth > > On 03/04/2015 08:44 AM, Juergen Umbrich wrote: >> Hi all, >> >> we compiled a new list of common datatype tuples in columns of > > csv files ( see attached file). >> >> Overall, we parsed 50007 csv files and parsed for the following datatypes: >> >> EMPTY= empty string or “null” >> ALPHA = all characters in the string are alphabetic and there is at least one character >> ALPHA+ = alphabetic characters + special characters >> EMAIL: email patterns >> URL = based on a regex to detect IP, URL, FTP …. ( from the python django library) >> ALPHANUM = alphabetical and numerical characters >> ALPHANUM+ = alphabetical and numerical characters, plus special characters >> DATE = if parseable by the dateutil.parser.parse(cell) (python) >> NUMBER = only digits >> FLOAT = digits with “.” or “,” separation >> NUMERIC+= NUMBERS + special character which are not “.” or “,” >> UNKNOWN = everything which does not fit the top part >> >> >> best >> Jürgen >>> On 25 Feb 2015, at 16:58, Dan Brickley <danbri@danbri.org> wrote: >>> >>> Maybe the attachments were too big? >>> >>> ---------- Forwarded message ---------- >>> From: Juergen Umbrich <juergen.umbrich@wu.ac.at> >>> Date: 25 February 2015 at 07:16 >>> Subject: Fwd: Analysis of different datatypes per column >>> To: Dan Brickley <danbri@danbri.org>, Jeni Tennison <jeni@jenitennison.com> >>> >>> >>> Hi, >>> here the email, >>> i forgot to gzip one of the text files, and the email was 7MB- > that >>> might be the reason that the email was not delivered >>> >>> >>>> Begin forwarded message: >>>> >>>> Hi all, >>>> >>>> attached is a short survey of the amount of different “datatypes” per columns in CSV files and our logs containing the CSV URL and at what column we detected which types. >>>> >>>> DISCLAIMER: >>>> We are not sure how representative these statistics are for now due to the following reasons: >>>> *) we only analysed 9000 csv files due to temporarily limited amount of resources and some not performance optimised code >>>> *) we used our datatype detection algorithm which are far from optimal >>>> >>>> Sorry, that we were not able to compile the report with more documents and a more accurate datatype detection. >>>> Anyway, i hope it sheds some light and can be of any help. >>>> >>>> Below are some more details to better understand the results >>>> The explanation of the data file: >>>> first column = the different data types for a column ( separated by “_”= >>>> second column = the count of tables for which we detected more than 1 datatype in one column >>> >>>> >>>> >>>> Sorting them by their count, we can see that >>>> *) the most prominent pair of different datatypes per column is TEXT and NUMERICAL ( e.g ALPHA_FLOAT, ALPHA_INTEGER, ….) >>>> *) we observed also very often a mix of empty cell values in a column >>>> *) mix of different numerical values ( INTEGER and FLOAT , or various ranges of integer values) >>>> >>>> Little bit more details: >>>> *) 2882 tables contain a column with text and float values (ALPHA_FLOAT) >>>> *) 2513 tables contain a column with either characters+numbers or only float >>>> roughly the same amount of tables have in addition also at least one empty cell in the column >>>> *) 1848 tables have a column with FLOAT or Integer values >>>> >>>> Explanation of our datatypes detection algorithm and datatypes. >>>> >>>> for each cell in a column: >>>> 1) EMPTY -check: cell is empty or contains “null” >>>> 2) EMAIL based on email_regex = '[a-zA-Z0-9_\.\+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-\.]+’ >>>> 3) RESOURCE = check if the cell value might be xx://xxx resource ( this is very buggy code :S) >>>> 4) if cell contains numbers >>>> 4a) ALPAHNUM = numbers and characters >>>> 4b) DATE if it matches one of the date_regex ( see below) >>>> 4c) FLOAT typecasting to float >>>> 4d) YEAR if val >= 1400 and int_val <= 2100 >>>> 4e) YEAR_MONTH = if val >= 197000 and int_val < 210000 >>>> 4f) INTEGER if no comma and does not match any of YEAR/Date/etc. >>>> 5) if contains alpha >>>> 5a) text if len(cell) >50 >>>> 5b) ALPHA = all characters in the string are alphabetic and there is at least one character >>>> 5c) COMPLEX = if nothing of the above matches >>>> !!! You already recognised that this is far from optimal or very accurate and by looking again at the code and the results we discovered *many* improvements and input for an improved datatype algorithm >>>> >>> >>> -- >>> Dr. Jürgen Umbrich >>> WU Vienna, Institute for Information Business >>> <csv_urls_with_type.txt.gz><csv_datatypes.csv><signature.asc> >> >> -- >> Dr. Jürgen Umbrich >> WU Vienna, Institute for Information Business >> >> >> >> >> > -- Dr. Jürgen Umbrich WU Vienna, Institute for Information Business
Received on Wednesday, 4 March 2015 14:35:17 UTC