Re: Analysis of different datatypes per column from David Booth on 2015-03-04 (public-csv-wg@w3.org from March 2015)

From: David Booth <david@dbooth.org>
Date: Wed, 04 Mar 2015 09:32:04 -0500
To: Juergen Umbrich <juergen.umbrich@wu.ac.at>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Message-ID: <54F71764.4000504@dbooth.org>
What does ALPHA++ mean?  It appears more than once in the results, but 
it is not listed in the key below.  ALPHA+ is listed, but not ALPHA++ .

Thanks,
David Booth

On 03/04/2015 08:44 AM, Juergen Umbrich wrote:
> Hi all,
>
> we compiled a new list of common datatype tuples in columns of
 > csv files ( see attached file).
>
> Overall, we parsed 50007 csv files and parsed for the following datatypes:
>
> EMPTY= empty string or “null”
> ALPHA = all characters in the string are alphabetic and there is at least one character
> ALPHA+ = alphabetic characters + special characters
> EMAIL: email patterns
> URL = based on a regex to detect IP, URL, FTP …. ( from the python django library)
> ALPHANUM = alphabetical and numerical characters
> ALPHANUM+ = alphabetical and numerical characters, plus special characters
> DATE = if parseable by the dateutil.parser.parse(cell) (python)
> NUMBER = only digits
> FLOAT = digits with “.” or “,” separation
> NUMERIC+= NUMBERS + special character which are not “.” or “,”
> UNKNOWN = everything which does not fit the top part
>
>
> best
>    Jürgen
>> On 25 Feb 2015, at 16:58, Dan Brickley <danbri@danbri.org> wrote:
>>
>> Maybe the attachments were too big?
>>
>> ---------- Forwarded message ----------
>> From: Juergen Umbrich <juergen.umbrich@wu.ac.at>
>> Date: 25 February 2015 at 07:16
>> Subject: Fwd: Analysis of different datatypes per column
>> To: Dan Brickley <danbri@danbri.org>, Jeni Tennison <jeni@jenitennison.com>
>>
>>
>> Hi,
>> here the email,
>> i forgot to gzip one of the text files, and the email was 7MB- > that
>> might be the reason that the email was not delivered
>>
>>
>>> Begin forwarded message:
>>>
>>> Hi all,
>>>
>>> attached is a short survey of the amount of different “datatypes” per columns in CSV files and our logs containing the CSV URL and at what column we detected which types.
>>>
>>> DISCLAIMER:
>>> We are not sure how representative these statistics are for now due to the following reasons:
>>>       *) we only analysed 9000 csv files due to temporarily limited amount of resources and some not performance  optimised code
>>>       *) we used our datatype detection algorithm which are far from optimal
>>>
>>> Sorry, that we were not able to compile the report with more documents and a more accurate datatype detection.
>>> Anyway, i hope it sheds some light and can be of any help.
>>>
>>> Below are some more details to better understand the results
>>> The explanation of the data file:
>>>       first column = the different data types for a column ( separated by “_”=
>>>       second column = the count of tables for which we detected more than 1 datatype  in one column
>>
>>>
>>>
>>> Sorting them by their count, we can see that
>>>       *) the most prominent pair of different datatypes per column is TEXT and NUMERICAL ( e.g ALPHA_FLOAT, ALPHA_INTEGER, ….)
>>>       *) we observed also very often a mix of empty cell values in a column
>>>       *) mix of different numerical values ( INTEGER and FLOAT , or various ranges of integer values)
>>>
>>> Little bit more details:
>>>       *) 2882 tables contain a column with text and float values (ALPHA_FLOAT)
>>>       *) 2513 tables contain a column with either characters+numbers  or only float
>>>               roughly the same amount of tables have in addition also at least one empty cell in the column
>>>       *) 1848 tables have a column with FLOAT or Integer values
>>>
>>> Explanation of our datatypes detection algorithm and datatypes.
>>>
>>>       for each cell in a column:
>>>               1) EMPTY -check: cell is empty or contains “null”
>>>               2) EMAIL based on email_regex = '[a-zA-Z0-9_\.\+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-\.]+’
>>>               3) RESOURCE = check if the cell value might be xx://xxx resource ( this is very buggy code :S)
>>>               4) if cell contains numbers
>>>                       4a) ALPAHNUM = numbers and characters
>>>                       4b) DATE if it matches one of the date_regex ( see below)
>>>                       4c) FLOAT typecasting to float
>>>                       4d) YEAR  if val >= 1400 and int_val <= 2100
>>>                       4e) YEAR_MONTH = if val >= 197000 and int_val < 210000
>>>                       4f) INTEGER if no comma and does not match any of YEAR/Date/etc.
>>>               5) if contains alpha
>>>                       5a) text if len(cell) >50
>>>                       5b) ALPHA = all characters in the string are alphabetic and there is at least one character
>>>                       5c) COMPLEX = if nothing of the above matches
>>> !!! You already recognised that this is far from optimal or very accurate and by looking again at the code and the results we discovered *many* improvements and input for an improved datatype algorithm
>>>
>>
>> --
>> Dr. Jürgen Umbrich
>> WU Vienna, Institute for Information Business
>> <csv_urls_with_type.txt.gz><csv_datatypes.csv><signature.asc>
>
> --
> Dr. Jürgen Umbrich
> WU Vienna, Institute for Information Business
>
>
>
>
>
Received on Wednesday, 4 March 2015 14:32:35 UTC