Re: RTL Directionality use case #19 added from Ivan Herman on 2014-05-11 (public-csv-wg@w3.org from May 2014)

From: Ivan Herman <ivan@w3.org>
Date: Sun, 11 May 2014 10:21:08 +0200
To: Jeni Tennison <jeni@jenitennison.com>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>, Yakov Shafranovich <yakov-ietf@shaftek.org>
Message-Id: <B98E6BA3-8E5D-4FA8-980D-E19A9A2FE190@w3.org>
Jeni,

first of all, I agree that we should defer to I18N experts. 

I think that, although I agree with the statements in your first two paragraphs below, I do not think I agree with your conclusions. 

The problem is bidi: making a global decision on whether the document is inherently ltr or rtl based on the first few directional characters (or words) is error prone. A wrong decision, in the case of HTML, would affect the way bidi texts are correctly displayed. Unicode has some of its own code to control the direction that can surround, say, an English word in a Hebrew text, but I am not even sure how one would type that; [1] clearly advices to rely on markup rather than the extra unicode characters.

In the case of CSV we have an analogous situation; it is not a matter of display, it is a matter of deciding which column plays what role (header column, or columns to be skipped). 

Again, I do not want to present myself as being an expert on this matter: I am certainly not. The only thing I learned from Richard Ishida (our I18N guy at W3C) over the years that it is complicated, and certainly more complicated than it looks:-) Hence I think we should defer to I18N experts on this.

Indeed, at the end of the day, this may have to be pushed to the ITF, but I still believe that the parsing algorithm we describe in our own document (albeit non-normatively) is incomplete...

Ivan

[1] http://www.w3.org/International/questions/qa-bidi-controls




On 10 May 2014, at 22:10 , Jeni Tennison <jeni@jenitennison.com> wrote:

> Do you agree that in the equivalent case in a document, the first character in a paragraph is the first character in the byte stream for the paragraph? Do you think that’s different in a RTL document? If so, how?
> 
> Do you agree that in the equivalent case in a document, the first word in a paragraph is the sequence of characters before the first non-word character? Do you think that’s different in a RTL document? If so, how?
> 
> If you agree with both the cases above, I can’t see how you would disagree with the definition of the first cell in a row (and therefore the first column) being the sequence of characters before the first comma, whether or not the table is LTR or RTL.
> 
> In all cases, the fact the paragraph or table is *displayed* with the first character/word/column on the right by applications that are aware of how to work out the text is RTL, is distinct from the structure of the file.
> 
> I suggest that we define it like this in the document (albeit non-normatively, since the actual definition will have to be in the RFC) and then ask the I18N WG as they’re the experts. 
> 
> Jeni
> 
> ------------------------------------------------------
> From: Ivan Herman ivan@w3.org
> Reply: Ivan Herman ivan@w3.org
> Date: 9 May 2014 at 19:39:28
> To: Jeni Tennison jeni@jenitennison.com
> Cc: W3C CSV on the Web Working Group public-csv-wg@w3.org, Yakov Shafranovich yakov-ietf@shaftek.org
> Subject:  Re: RTL Directionality use case #19 added
> 
>> Jeni,
>> 
>> sigh. We seem to between a rock and a hard place. I fully understand what you say. However...  
>> the current algorithm refers to 'skip column' and 'header column count' which refers  
>> to the 'beginning of each row'. These parameters are then underdefined; they should  
>> be dependent on the direction. If we decide that these values are always LTR, then these  
>> parameters should also appear in the metadata to interpret the data properly...
>> 
>> We know that the headers can indeed be lost, let alone the fact that the file can be local  
>> on the disk...
>> 
>> As I said, sigh...
>> 
>> Ivan
>> 
>> ---
>> Ivan Herman
>> Tel:+31 641044153
>> http://www.ivan-herman.net
>> 
>> (Written on mobile, sorry for brevity and misspellings...)
>> 
>> 
>> 
>>> On 9 May 2014, at 18:45, Jeni Tennison wrote:
>>> 
>>> Ivan,
>>> 
>>> Using a RTL/LTR flag during parsing seems like a really bad idea because the number of  
>> the column is used in references into the table, and if different parsers are given different  
>> flags then the column numbering will be different which will mean that they interpret  
>> the table and references into it completely differently.
>>> 
>>> Anyway, we can’t say anything normative about that in the model document: it’s down  
>> to the definition of text/csv. Yakov might want to separately add a bidi flag as a media  
>> type parameter. I think it would be a bad idea to because media type parameters get lost  
>> so easily (leading to tables being interpreted completely differently), but it’s an  
>> option.
>>> 
>>> Jeni
>>> 
>>> ------------------------------------------------------
>>> From: Ivan Herman ivan@w3.org
>>> Reply: Ivan Herman ivan@w3.org
>>> Date: 9 May 2014 at 10:32:37
>>> To: Jeni Tennison jeni@jenitennison.com
>>> Cc: W3C CSV on the Web Working Group public-csv-wg@w3.org, Yakov Shafranovich yakov-ietf@shaftek.org
>>> Subject: Re: RTL Directionality use case #19 added
>>> 
>>>> Hi Jeni,
>>>> 
>>>>> On 08 May 2014, at 19:38 , Jeni Tennison wrote:
>>>>> 
>>>>> Ivan,
>>>>> 
>>>>> OK. I think the important thing to note in the model document is from a parsing standpoint  
>>>> the normal parsing applies.
>>>>> 
>>>>> But given your comments I’m inclined to move the majority of the text into the metadata  
>>>> spec, to have the ability to set the default paragraph direction at the level of the  
>> table,
>>>> columns and cells.
>>>> 
>>>> This actually raises the issue about the 'dialect' settings in section 5 of the syntax  
>>>> document and the metadata entries in the metadata document.
>>>> 
>>>> As far as I am concerned, the RTL/LTR flag for the table as a whole should be in the parsing  
>>>> section: indeed, knowledge about this flag is indeed important to establish the correct  
>>>> order of the column. Ie, if the flag is RTL, that means the correct order, when turned  
>> into
>>>> an internal representation of the core tabular data, is to take the columns from the  
>> right.
>>>> I do not think that we should necessary depend on the metadata to establish this step  
>> (although,
>>>> as an information, it is probably a good idea to have that information in the metadata,  
>>>> too).
>>>> 
>>>> I think a RTL/LTR flag for each column or for a cell in the metadata part makes indeed  
>> lots
>>>> of sense.
>>>> 
>>>> 
>>>>> Falling back on the Unicode Bidirectional Algorithm is the right thing to do in the  
>> case
>>>> explicit bidi instructions aren’t found.
>>>> 
>>>> +1
>>>> 
>>>>> I’ll rework along those lines.
>>>>> 
>>>>> Jeni
>>>>> 
>>>>> ------------------------------------------------------
>>>>> From: Ivan Herman ivan@w3.org
>>>>> Reply: Ivan Herman ivan@w3.org
>>>>> Date: 8 May 2014 at 17:28:56
>>>>> To: Jeni Tennison jeni@jenitennison.com
>>>>> Cc: W3C CSV on the Web Working Group public-csv-wg@w3.org, Yakov Shafranovich yakov-ietf@shaftek.org  
>>>>> Subject: Re: RTL Directionality use case #19 added
>>>>> 
>>>>>> Hi Jeni,
>>>>>> 
>>>>>> thanks, I was the one nagging about this:-)
>>>>>> 
>>>>>>> On 08 May 2014, at 16:47 , Jeni Tennison wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I was tasked with adding something about RTL directionality based on use case #19.  
>>>>>>> 
>>>>>>> I’ve put something into the Model document here:
>>>>>>> 
>>>>>>> http://w3c.github.io/csvw/syntax/#bidirectionality-in-csv-files
>>>>>>> 
>>>>>>> Comments welcome.
>>>>>> 
>>>>>> Well... I am not sure the algorithm you describe will work easily in practice, at  
>> least
>>>>>> for the table directionality...
>>>>>> 
>>>>>> - Retrieving the directionality may be fairly complex task for a user level. I must  
>>>> admit
>>>>>> I am not sure how I would do that, say, in Python from the top of my head (I say Python  
>> because
>>>>>> that is the programming language I am most familiar with)
>>>>>> 
>>>>>> - Just as HTML needs its own tags to handle bidi, don't we incur the danger to run into  
>>>> similar
>>>>>> problems? What if the some of the headers and, to be really unlucky, both the rightmost  
>>>>>> and the leftmost, are a left-to-right text (eg, mixing English and Hebrew headers)?  
>>>>>> The algorithm might believe that the table is a LTR one, although it is, in fact, a  
>> RTL
>>>> one,
>>>>>> or vice versa
>>>>>> 
>>>>>> Bottomline: I believe we need to add a separate flag in section 5 to indicate directionality,  
>>>>>> the same way as directionality may be set explicitly for an HTML file. This should  
>> have
>>>>>> the highest priority on setting the table directionality
>>>>>> 
>>>>>> B.t.w., there may be a number of rows that are to be skipped. I presume those should  
>> not
>>>>>> come into play for the table directionality.
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> Ivan
>>>>>> 
>>>>>> P.S. We may want to ask the advise of somebody in the I18N community. Should I ask either  
>>>>>> Felix or Richard, our two international guys in the team? Or should we send an official  
>>>>>> note to the I18N WG? The latter is probably the best approach.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Jeni
>>>>>>> 
>>>>>>> ------------------------------------------------------
>>>>>>> From: Ivan Herman ivan@w3.org
>>>>>>> Reply: Ivan Herman ivan@w3.org
>>>>>>> Date: 6 May 2014 at 08:20:29
>>>>>>> To: Eric Stephan ericphb@gmail.com
>>>>>>> Cc: Jeremy Tandy jeremy.tandy@metoffice.gov.uk, Ceolin, D. d.ceolin@vu.nl,  
>>>> Yakov
>>>>>> Shafranovich yakov-ietf@shaftek.org, W3C CSV on the Web Working Group public-csv-wg@w3.org  
>>>>>>> Subject: Re: RTL Directionality use case #19 added
>>>>>>> 
>>>>>>>> Oops, sorry, I read the other mail first and I put my comment onto that one:-( Just  
>>>> for
>>>>>> the
>>>>>>>> records:
>>>>>>>> 
>>>>>>>> I believe that use case should also lead to a separate requirement, something  
>> like
>>>>>> any
>>>>>>>> parser should be informed about RTL and retrieve the headers accordingly for  
>> the
>>>> CSV+.
>>>>>>>> 
>>>>>>>> Ivan
>>>>>>>> 
>>>>>>>>> On 06 May 2014, at 05:54 , Eric Stephan wrote:
>>>>>>>>> 
>>>>>>>>> I've made an attempt at the use case for the RTL directionality and
>>>>>>>>> checked in the edits in github.
>>>>>>>>> http://w3c.github.io/csvw/use-cases-and-requirements/index.html
>>>>>>>>> 
>>>>>>>>> What is still missing, is incorrect, and needs to be changed?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Eric Stephan
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ----
>>>>>>>> Ivan Herman, W3C
>>>>>>>> Digital Publishing Activity Lead
>>>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>>>> mobile: +31-641044153
>>>>>>>> GPG: 0x343F1A3D
>>>>>>>> WebID: http://www.ivan-herman.net/foaf#me
>>>>>>> 
>>>>>>> --
>>>>>>> Jeni Tennison
>>>>>>> http://www.jenitennison.com/
>>>>>> 
>>>>>> 
>>>>>> ----
>>>>>> Ivan Herman, W3C
>>>>>> Digital Publishing Activity Lead
>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>> mobile: +31-641044153
>>>>>> GPG: 0x343F1A3D
>>>>>> WebID: http://www.ivan-herman.net/foaf#me
>>>>> 
>>>>> --
>>>>> Jeni Tennison
>>>>> http://www.jenitennison.com/
>>>> 
>>>> 
>>>> ----
>>>> Ivan Herman, W3C
>>>> Digital Publishing Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> GPG: 0x343F1A3D
>>>> WebID: http://www.ivan-herman.net/foaf#me
>>> 
>>> --
>>> Jeni Tennison
>>> http://www.jenitennison.com/
>> 
>> 
>> 
> 
> --  
> Jeni Tennison
> http://www.jenitennison.com/


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
WebID: http://www.ivan-herman.net/foaf#me
Received on Sunday, 11 May 2014 08:21:40 UTC