- From: David Booth <david@dbooth.org>
- Date: Wed, 04 Mar 2015 09:32:04 -0500
- To: Juergen Umbrich <juergen.umbrich@wu.ac.at>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
What does ALPHA++ mean? It appears more than once in the results, but it is not listed in the key below. ALPHA+ is listed, but not ALPHA++ . Thanks, David Booth On 03/04/2015 08:44 AM, Juergen Umbrich wrote: > Hi all, > > we compiled a new list of common datatype tuples in columns of > csv files ( see attached file). > > Overall, we parsed 50007 csv files and parsed for the following datatypes: > > EMPTY= empty string or “null” > ALPHA = all characters in the string are alphabetic and there is at least one character > ALPHA+ = alphabetic characters + special characters > EMAIL: email patterns > URL = based on a regex to detect IP, URL, FTP …. ( from the python django library) > ALPHANUM = alphabetical and numerical characters > ALPHANUM+ = alphabetical and numerical characters, plus special characters > DATE = if parseable by the dateutil.parser.parse(cell) (python) > NUMBER = only digits > FLOAT = digits with “.” or “,” separation > NUMERIC+= NUMBERS + special character which are not “.” or “,” > UNKNOWN = everything which does not fit the top part > > > best > Jürgen >> On 25 Feb 2015, at 16:58, Dan Brickley <danbri@danbri.org> wrote: >> >> Maybe the attachments were too big? >> >> ---------- Forwarded message ---------- >> From: Juergen Umbrich <juergen.umbrich@wu.ac.at> >> Date: 25 February 2015 at 07:16 >> Subject: Fwd: Analysis of different datatypes per column >> To: Dan Brickley <danbri@danbri.org>, Jeni Tennison <jeni@jenitennison.com> >> >> >> Hi, >> here the email, >> i forgot to gzip one of the text files, and the email was 7MB- > that >> might be the reason that the email was not delivered >> >> >>> Begin forwarded message: >>> >>> Hi all, >>> >>> attached is a short survey of the amount of different “datatypes” per columns in CSV files and our logs containing the CSV URL and at what column we detected which types. >>> >>> DISCLAIMER: >>> We are not sure how representative these statistics are for now due to the following reasons: >>> *) we only analysed 9000 csv files due to temporarily limited amount of resources and some not performance optimised code >>> *) we used our datatype detection algorithm which are far from optimal >>> >>> Sorry, that we were not able to compile the report with more documents and a more accurate datatype detection. >>> Anyway, i hope it sheds some light and can be of any help. >>> >>> Below are some more details to better understand the results >>> The explanation of the data file: >>> first column = the different data types for a column ( separated by “_”= >>> second column = the count of tables for which we detected more than 1 datatype in one column >> >>> >>> >>> Sorting them by their count, we can see that >>> *) the most prominent pair of different datatypes per column is TEXT and NUMERICAL ( e.g ALPHA_FLOAT, ALPHA_INTEGER, ….) >>> *) we observed also very often a mix of empty cell values in a column >>> *) mix of different numerical values ( INTEGER and FLOAT , or various ranges of integer values) >>> >>> Little bit more details: >>> *) 2882 tables contain a column with text and float values (ALPHA_FLOAT) >>> *) 2513 tables contain a column with either characters+numbers or only float >>> roughly the same amount of tables have in addition also at least one empty cell in the column >>> *) 1848 tables have a column with FLOAT or Integer values >>> >>> Explanation of our datatypes detection algorithm and datatypes. >>> >>> for each cell in a column: >>> 1) EMPTY -check: cell is empty or contains “null” >>> 2) EMAIL based on email_regex = '[a-zA-Z0-9_\.\+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-\.]+’ >>> 3) RESOURCE = check if the cell value might be xx://xxx resource ( this is very buggy code :S) >>> 4) if cell contains numbers >>> 4a) ALPAHNUM = numbers and characters >>> 4b) DATE if it matches one of the date_regex ( see below) >>> 4c) FLOAT typecasting to float >>> 4d) YEAR if val >= 1400 and int_val <= 2100 >>> 4e) YEAR_MONTH = if val >= 197000 and int_val < 210000 >>> 4f) INTEGER if no comma and does not match any of YEAR/Date/etc. >>> 5) if contains alpha >>> 5a) text if len(cell) >50 >>> 5b) ALPHA = all characters in the string are alphabetic and there is at least one character >>> 5c) COMPLEX = if nothing of the above matches >>> !!! You already recognised that this is far from optimal or very accurate and by looking again at the code and the results we discovered *many* improvements and input for an improved datatype algorithm >>> >> >> -- >> Dr. Jürgen Umbrich >> WU Vienna, Institute for Information Business >> <csv_urls_with_type.txt.gz><csv_datatypes.csv><signature.asc> > > -- > Dr. Jürgen Umbrich > WU Vienna, Institute for Information Business > > > > >
Received on Wednesday, 4 March 2015 14:32:35 UTC