Re: Add scripts to XForms input-mode script list in Appendix E (PR#106) from Martin Duerst on 2008-06-13 (public-forms@w3.org from June 2008)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Fri, 13 Jun 2008 16:21:49 +0900
To: "Steven Pemberton" <steven.pemberton@cwi.nl>, "John Boyer" <boyerj@ca.ibm.com>
Cc: "Richard Ishida" <ishida@w3.org>, "Felix Sasaki" <fsasaki@w3.org>, "Forms WG" <public-forms@w3.org>
Message-Id: <6.0.0.20.2.20080613155400.09bffe30@localhost>
Hello Steven,

This is a quick answer. Please feel free to ask back about details
if you need.

At 23:23 08/06/12, Steven Pemberton wrote:
>Hi Martin,
>
>We are here at the Forms FtF and trying to come to some resolution on your  
>last call comment.
>
>Our basic problem (and why we originally asked if you would be willing to  
>do the work) is that we don't understand the algorithm you used to select  
>which entries in http://unicode.org/iso15924/iso15924-codes.html and  
>http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.UnicodeBlock.html  
>should end up in the inputmode list.

ISO 15924:2004 (the first version of that standard)
wasn't baked when this list was created, so it's
irrelevant to what's currently in the list (but it's very relevant
helpful for any updates). The list says it's based on Unicode 3.2,
which contained cherokee but not cypriot. As you may be able to
guess form http://unicode.org/iso15924/iso15924-codes.html,
even a script like Latin (sure one of the first to be added)
has an addition date of 2004-05-01. The bases for the original
list were http://www.unicode.org/Public/3.2-Update/Scripts-3.2.0.txt
and http://java.sun.com/j2se/1.4.2/docs/api/java/lang/Character.UnicodeBlock.html,
but not http://unicode.org/iso15924/iso15924-codes.html.

 From Scripts-3.2.0.txt, I think I only excluded INHERITED, a clear
non-starter. I'm unable to find an actual example of a block name
that I used from Java. The Comment column makes clear where something
is special, either because it needs a clarification of meaning or
because it is 'made up' in the sense that it isn't taken from
the indicated sources.


>Could you give us some help here: what is it about particular entries in  
>those two lists that make them suitable or not for use as inputmode  
>values? Just to take one example early in the alphabet, why is cherokee in  
>the list, and cypriot not (both which were added to the ISO list on the  
>same date).

Many of the entries in the Java source are not suitable because they
are blocks, not scripts. As an example, LATIN_EXTENDED_A is completely
useless as a set of letters to indicate any kind of input device.

Almost all entries in http://unicode.org/iso15924/iso15924-codes.html
are suitable, with very few exceptions:
- Font class distinctions useful for librarians but irrelevant to keyboards
  (Cyrillic (Old Church Slavonic variant), Latin (Fraktur variant),
   Latin (Gaelic variant), Syriac variants
- Private use (Qaaa-Qabx)
- unXXX: Zxxx, Zyyy, Zzzz (Code for unwritten documents,
  Code for undetermined script, Code for uncoded script)

But damage should be minimal even if these are officially allowed
by a new version of the spec, because in practice, nobody will try
to use them.

Please also note that some of the scripts at
http://unicode.org/iso15924/iso15924-codes.html are not yet
encoded in Unicode. This is the case if a script can be very
clearly identified, but work is still ongoing on how to exactly
identify and code its characters. Again, it doesn't hurt to
allow use of these codes, because nobody will try to use them
as long as the characters aren't actually coded.

>If you can help us understand thos, maybe we can understand how to  
>describe which future values are suitable for later addition.

I think that just using text along the lines below as I provided it
would be fine. You just have to make sure that you don't exclude
scripts that actually exist and may be useful in practice, even
in (relatively speaking) small numbers. If the spec allows some
script codes that don't make sense for input-mode (e.g. Zzzz,
Code for uncoded script), the potential damage is zero in my opinion.


Regards,   Martin.


>Thanks!
>
>Best wishes,
>
>Steven
>
>On Thu, 21 Feb 2008 07:48:58 +0100, Martin Duerst <duerst@it.aoyama.ac.jp>  
>wrote:
>
>> Hello John,
>>
>> Here's a resend of the mail I sent to Steven earlier this year.
>>
>>
>> At 13:34 08/01/10, Martin Duerst wrote:
>>> Hello Steven,
>>>
>>> Thanks for contacting me. Hope everything is well with you.
>>>
>>> [I cut out the thead because currenty, my mailer seems to have
>>> occasional weird problems with sending long messages.]
>>>
>>> At 01:30 08/01/10, Steven Pemberton wrote:
>>>> Hi Martin,
>>>>
>>>> Any news on this?
>>>
>>> Well, yes and no. I have to admit that I had the editing token,
>>> and didn't act on it. I also have to admit that I was a bit demotivated
>>> by the fact that I did the actual work,
>>
>> That referred to my earlier creation of a list of script tokens that
>> needed to be added.
>>
>>
>>> and it would have been rather
>>> easy for somebody on your side to contribute, e.g. at least for cross-
>>> checking.
>>>
>>> But by chance, I got an idea that I think should meet all our
>>> concerns in a simple way. What we want to do is to add tokens
>>> for more scripts. You suggested that we simply say that other
>>> scripts are also allowed. I responded that because there are
>>> some irregularities/transformations with spelling, things are
>>> not so easy. I still believe that to be the case, but I agree
>>> that having to update the list by hand is work that we should
>>> try to avoid. The solution to this may be quite easy, actually:
>>>
>>> Use ISO 15924 four-letter script codes.
>>> (http://unicode.org/iso15924/iso15924-codes.html)
>>>
>>> As a result, I propose the following changes:
>>>
>>> In E.3.1, Script Tokens
>>> (http://www.w3.org/TR/2007/REC-xforms-20071029/#mode-scripts,
>>> or similar in whatever version of your spec that's actually affected),
>>> add at the end of the first paragraph, the following sentence:
>>>
>>>>>>>
>>> For scripts added to Unicode after version 3.2, use the four-letter
>>> ISO 15924 script code, with the first letter in upper-case and the
>>> remaining three letters in lower case.
>>>>>>>
>>
>> If we want to add an example, we could do that. Here is a proposal:
>> Add after the sentence above:
>>
>>>>>>
>> For example, the script token for Tifinagh (used in North Africa)
>> is Tfng.
>>>>>>
>>
>>
>>> Please add references as you see fit (different specs have somewhat
>>> differing traditions on how much and what to add as references, but
>>> here's a list of possible candidates:
>>>
>>> The standard itself, also available on the net at
>>> http://unicode.org/iso15924/standard/index.html.
>>> The registration authority Web page:
>>> http://unicode.org/iso15924/.
>>> The list of alphabetical codes:
>>> http://unicode.org/iso15924/iso15924-codes.html
>>>
>>> Comments:
>>> - The clause about case is necessary becase these tokens are case-
>>>  sensitive.
>>> - There are currently two tokens with 4 letters, namely 'thai' and
>>>  'user'. But because they are all lower-case, there is no potential
>>>  for conflict.
>>>
>>> I think this is the fastest and cleanest (including future-proofness)
>>> way to deal with this issue, and I'm sure that somebody in your
>>> group can do the editing more quickly and safely than be, but I'd
>>> be extremely glad to do some proofreading and cross-checking.
>>>
>>> Regards,    Martin.
>>>
>>> #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>>> #-#-#  http://www.sw.it.aoyama.ac.jp     mailto:duerst@it.aoyama.ac.jp
>>
>> Regards,   Martin.
>>
>>
>> #-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
>> #-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp
>>
>
>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Received on Friday, 13 June 2008 07:31:10 UTC