Re: Updates to the use-case document from Ivan Herman on 2014-05-26 (public-csv-wg@w3.org from May 2014)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 26 May 2014 16:10:34 +0200
To: "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>, Eric Stephan <ericphb@gmail.com>, Davide Ceolin <davide.ceolin@gmail.com>
Message-Id: <727D7594-9FF2-4D42-975A-843F47B126BD@w3.org>
Hey Jeremy,


On 26 May 2014, at 15:49 , Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk> wrote:

> Yesterday I updated the biodiversity / GBIF / Darwin Core Archive use case <http://w3c.github.io/csvw/use-cases-and-requirements/#UC-PublicationOfBiodiversityInformation> & am awaiting comments.
> 
> Today I have updated the RTL use case <http://w3c.github.io/csvw/use-cases-and-requirements/#UC-SupportingRightToLeftDirectionality>; cleaning up the text and example data files / images for the Arabic example. I decided to remove the Hebrew example as the web-page which was referenced provided different content to the CSV file, so it was impossible to make a comparison between the two. I had a hunt around on the Israeli Gov web site for relevant resources, but my lack of Hebrew meant that I drew a blank. That said, I think the Arabic example provides sufficient illustration. Comments please - especially Yakov who was the original contributor. 
> 
> ... and apologies to Eric for deleting some of your work in getting rid of the Hebrew example :-(
> 

I am not sure the following remark is correct: "In contrast, over the wire and in non-Unicode-aware text editors" (right after the example picture for the Egyptian election result). If the text editor was not Unicode aware, then the arabic characters would not be displayed correctly...

The text editor will reflect what comes on the wire. In this case the wire seems to be unintuitive for a RTL person, because it comes in the 'wrong' order, so to say, ie, it does not come in a 'logical' order.

That being said this is a good news. In contrast to what I was worried before, the example shows that, in the _logical_ sense, the left-to-right internal representation is fine, ie, the '0'-th field in a row is the (row) header, the '1'-st field is the next field, etc. Eg., for a JSON generation, the logical way of generating a row would be to simply follow the cells from the left-to-right order, ie, there may not be a necessity to take care of some sort of an inverse ordering of the field.

Are we sure that all CSV files for arabic and hebrew will indeed be encoded this way? Is it possible that some of the CSV files will do it the other way round, ie, the '0'-th field is the 'last' field in the row, etc? I do not have the pointer to the hebrew CSV files that you had in a previous version, may be worth checking. We do have a problem if CSV do not follow the same order every time! 


> Regarding the health informatics use case (HL7) <http://w3c.github.io/csvw/use-cases-and-requirements/#UC-HealthLevelSevenHL7>, further information from Eric Prud'hommeaux indicates that HL7 might be more than we can (or want to) cope with. See an excerpt from his email [1] where you'll see an example included. From what I can see, this is _NOT_ regular tabular data. OK, so the "microsyntax" in each field is complicated but it can be worked out, but the real issue to me is that the rows are not uniform - they have different numbers of fields. Furthermore, it appears that the data is parsed according to a specific set of rules defined in a "table" and without this table there's no way to label the parsed attributes.
> 
> I propose that we review this in more detail to see if we should include this use case. Personally, I don't think it adds anything - except to illustrate that row-oriented data can be more complicated than our tabular data model! I propose to drop this use case.
> 

... or keep it as to illustrate exactly what you just said: a warning to the reader that row-oriented data does not necessarily mean CSV! (Either way is fine with me, I would go with the flow.)

Ivan


> Finally, I note that JeniT suggested (during our teleconf, 14-May) that she would add an additional use case based around ONS data to help underpin the data model. Is there any progress on this?
> 
> Other than that, there's still work to do on the Requirements and I feel like we should review the email lists since FPWD to make sure nothing relating to use cases has fallen through the net.
> 
> Jeremy 
> 
> ---
> 
> [1] Email from Eric Prud'hommeaux, 21-May
> 
> [a potential narrative for the use case ...]
> John Doe is being transferred from a one clinic to another to recieve specialied care. The machine-readable transfer documentation includes his name, patient ID, his visit to the first clinic, and some information about his next of kin. The visit info (and many other fields) require microparsing on the '^' separator to extract further structured information about, for example, the referring physician.
> 
> [on the HL7 data format ...]
>> I think you want to give up on this one because the message format is 
>> hilariously complex and requires a ton of extra info to parse. For 
>> instance, the header in the first line of
>> 
>> MSH|^~\&|EPIC|EPICADT|SMS|SMSADT|199912271408|CHARRIS|ADT^A04|1817457|
>> MSH|D|2.5|
>> PID||0493575^^^2^ID 
>> PID||1|454721||DOE^JOHN^^^^|DOE^JOHN^^^^|19480203|M||B|254 MYSTREET 
>> PID||AVE^^MYTOWN^OH^44123^USA||(216)123-4567|||M|NON|||
>> NK1||ROE^MARIE^^^^|SPO||(216)123-4567||EC|||||||||||||||||||||||||||
>> PV1||O|168 ~219~C~PMA^^^^^^^^^||||277^ALLEN 
>> PV1||O|MYLASTNAME^BONNIE^^^^|||||||||| 
>> PV1||O|||2688684|||||||||||||||||||||||||199912271408||||||002376853
>> 
>> says that the rest must be parsed with V2.5 tables (I think you'll see 2.2 to 2.6 in the wild).  The data is oriented in rows, so I'm not sure how applicable CSV techniques would be. It's also 3 or maybe 4 dimentional ("^~\&" being the declared separators for the fields within fields in this particular document).
>> 
>> The V2.5 table tells you how to parse the rest of the fields, e.g. the PID field, which happens to include subfields like lastname and firstname ("DOE" and "JOHN" respectively). Without that table, there's no way to know how to label the parsed attributes.
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me
Received on Monday, 26 May 2014 14:11:07 UTC