RE: Gental Reminder: Re: Segmentation of Indian Languages viz UAX#29 [I18N-ACTION-484]

Hi,

I have been actioned [1] by the W3C I18N Working Group with cross-posting this discussion to our public-i18n-indic@ list because we believe the broader Indic script community is probably interested in this work and we are very interested in documenting the various segmentation requirements as part of the embryonic “Indic Layout Requirements” [2] document that Swaran mentions below.

Because the Unicode editors list is a closed list, I have bcc that list when forwarding this note.

Addison

[1] http://www.w3.org/International/track/actions/484

[2] https://www.w3.org/International/wiki/Project_radar#Indic_Layout_Requirements

      http://w3c.github.io/ilreq/

Addison Phillips
Principal SDE, I18N Architect (Amazon)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.



From: Book [mailto:book-bounces@unicode.org] On Behalf Of Eric Muller
Sent: Wednesday, December 02, 2015 1:54 PM
To: Mark Davis ☕️; Swaran Lata
Cc: Somnath Chandra; Prashant verma; Unicode Book; Steven R. Loomis (Google+)
Subject: Re: Gental Reminder: Re: Segmentation of Indian Languages viz UAX#29

A couple of (personal) comments:

- the document makes no mention of the joiners in the syllable structure.

- as an implementer, with UAX#14 well established, I would much prefer to have a description of linebreaking in those terms. As it stand I am left with having to understand whether a UAX#14 implementation match the requirements or not.

- you indicate that hyphenation can occur at syllable boundaries. This seems to differ from the hyphenation patterns provided with OpenOffice. For example, those patterns prevent hyphenation before an avagraha, a bengla khanda ta, or a malayalam chillu. Also, I believe that hyphenation (= line breaking inside word) in Malayalam may or may not result in the display of an hyphen (because so many lines are going to be hyphenated)

Eric Muller


On 12/2/2015 3:12 AM, Mark Davis ☕️ wrote:
Dear Swaran Lata,

I had some initial feedback, which I circulated for comment with a few people but didn't get any additional feedback.

You could submit your document for the UTC meeting in January, but I'd suggest that you could look to address this initial feedback first and submit a revision, in the interests of time. The process for submitting a document, once you are ready, is on http://www.unicode.org/pending/docsubmit.html#submit_email .

My feedback was:
1.       Is this document for only scripts of India, or only modern scripts of India, or only Hindi, etc? Presumably it would exclude Urdu.
2.       In order for us to consider any change, we'd have to have the precise list of characters encompassed by each of your categories. Would the composition of the classes of Consonant, Vowel, etc. be derivable by looking at http://unicode.org/Public/UNIDATA/IndicSyllabicCategory.txt(and/or http://unicode.org/Public/UNIDATA/IndicPositionalCategory.txt)?
1.       If so, are they exactly the same, or are there particular exceptions?
3.       Is there any practical limit to the {CH}C?
1.       Could one have 10 consonants linked that way, for example?
2.       Having a particular limit makes it much easier to support in certain algorithms.
4.       Rule3 has CH; would it not be {CH}CH?
1.       And again, is there a limit to the number?
Notes
·         There are various kinds of gaps or inconsistencies in the document, like the "rule separator" is not used, and Rule 3 doesn't appear in table 3.2.
·         ​The end target of this could possibly be CLDR instead of UAX, if it is more language-based than script-based, but the UTC can consider that.
Regards,

Mark

Mark

On Wed, Dec 2, 2015 at 10:17 AM, Swaran Lata <slata@deity.gov.in<mailto:slata@deity.gov.in>> wrote:
Dear Sir,

I hope you looked the report "Indic Text Segmentation" as mentioned in the appended mail. We are waiting for your kind feedback.

With kind regards,

Swaran Lata,
Senior Director & HoD (TDIL Programme)
Department of Electronics and Information Technology (DeitY), MC&IT,
Government of India,
Electronics Niketan, 6, CGO Complex, Room No. 2072
Lodhi Road, New Delhi - 110 003
INDIA

-------- Original Message --------
From: "Swaran Lata" <slata@deity.gov.in<mailto:slata@deity.gov.in>>
Date: Nov 20, 2015 1:11:20 PM
Subject: Re: Segmentation of Indian Languages viz UAX#29
To: Mark Davis ☕️ <mark@macchiato.com<mailto:mark@macchiato.com>>, "Steven R. Loomis (Google+)" <srl@icu-project.org<mailto:srl@icu-project.org>>
Cc: Somnath Chandra <schandra@mit.gov.in<mailto:schandra@mit.gov.in>>, Prashant verma <vermaprashant1@gmail.com<mailto:vermaprashant1@gmail.com>>
Dear Mr. Mark and Mr. Steven,

We have looked UAX#29 for its applicability with reference to Indian Languages. We have come up with the report which broadly covered the information on following aspect of orthographic Indian Languages syllable boundaries:

           I.            Additional information on Indic orthographic syllable boundaries based on tailored grapheme cluster define in UAX#29

         II.            ABNF valid segmentation definition to define Indian languages orthographic syllable

       III.            No break rules for determination of Indic syllable boundary

       IV.            Information for identification of boundaries of first letter styling and vertical text

         V.            Guiding principles of line breaking for Indian languages.
A copy of the report is enclosed for your reference and if appropriate you may kindly include in the agenda of next UTC meeting for discussion.

regards,
Swaran Lata,
Director & HoD (TDIL Programme)
Department of Electronics and Information Technology (DeitY), MC&IT,
Government of India,
Electronics Niketan, 6, CGO Complex, Room No. 2072
Lodhi Road, New Delhi - 110 003
INDIA


On 01/10/15 05:36 PM, Mark Davis ☕️ <mark@macchiato.com<mailto:mark@macchiato.com>> wrote:
​ There are really three areas for segmentation. I've added Steven, who can give you more info, especially about http://uli.unicode.org/


If there are relatively simple rules that apply to classes of characters across scripts, they can be applied in UAX#29. In some simple cases, exceptions can also be made for particular scripts. For that you would submit a proposal to the UTC.

If there are relatively simple rules that apply to different languages​, then you can submit a proposal to CLDR for the per-language rules.

There is a third option, which is the ULI project, which among other things has been focusing on segmentation that requires data-based processing. And I'll leave it to Steven to talk more about that.



Mark<https://google.com/+MarkDavis>

— Il meglio è l’inimico del bene —

On Thu, Oct 1, 2015 at 11:55 AM, Swaran Lata <slata@deity.gov.in<mailto:slata@deity.gov.in>> wrote:

Dear Dr. Mark Davis,
            We have worked with the W3C Internationalization Group for defining the text segmentation requirement in the context of  multi-lingual web.  Most of these will get applicable and can be examined for its incorporation in UAX#29 to improve the text segmentation of Indian Languages in the context of Unicode data.  I am looking at the Mongolian Report as a reference.
            Kindly advise the process to be followed for sending recommendations on this subject.
            With regards.

Swaran Lata,
Director & HoD (TDIL Programme)
Department of Electronics and Information Technology (DeitY), MC&IT,
Government of India,
Electronics Niketan, 6, CGO Complex, Room No. 2072
Lodhi Road, New Delhi - 110 003
INDIA



Telfax: +91-11-24363525<tel:%2B91-11-24363525>
E-mail: slata@deity.gov.in<mailto:slata@deity.gov.in>

--
Swaran Lata,
Senior Director & HoD (TDIL Programme)
Department of Electronics and Information Technology (DeitY), MC&IT,
Government of India,
Electronics Niketan, 6, CGO Complex, Room No. 2072
Lodhi Road, New Delhi - 110 003
INDIA



Telfax: +91-11-24363525<tel:%2B91-11-24363525>
E-mail: slata@deity.gov.in<mailto:slata@deity.gov.in>


--
Swaran Lata,
Senior Director & HoD (TDIL Programme)
Department of Electronics and Information Technology (DeitY), MC&IT,
Government of India,
Electronics Niketan, 6, CGO Complex, Room No. 2072
Lodhi Road, New Delhi - 110 003
INDIA



Telfax: +91-11-24363525<tel:%2B91-11-24363525>
E-mail: slata@deity.gov.in<mailto:slata@deity.gov.in>

Received on Wednesday, 9 December 2015 13:15:40 UTC