Review comments on Indic layout doc from Richard Ishida on 2014-09-29 (public-i18n-indic@w3.org from July to September 2014)

From: Richard Ishida <ishida@w3.org>
Date: Mon, 29 Sep 2014 20:44:31 +0100
To: indic <public-i18n-indic@w3.org>, prashant verma <vermaprashant1@gmail.com>, Somnath Chandra <schandra@mit.gov.in>
Message-ID: <5429B69F.10605@w3.org>
These are review comments on 
http://www.w3.org/International/docs/indic-layout/



First, structure of the document.

I suggest that section 5, ABNF segmentation, be moved to immediately 
after the introduction, since it is central to much of the rest of the 
document.

The title 'Issues in Indic Layout' is a throwback to a previous version 
of the document.  I think that if we keep that heading, we should change 
it to "Requirements for Indic Layout". However, the whole of the 
document is about requirements for indic layout, so I suggest that we 
adopt the following organisation:

Introduction
Units of text in Indic Scripts
 Text segmentation
 Indic Syllable boundaries (this is the current ABNF section)
Line breaking
First letter styling
Letter spacing
Vertical arrangements...
Collation
then the end matter

====




Now some more detailed comments, per section.



Section 1.1 Indic language complexities

[1] the document should indicate what SI No means

[2] in addition to the link to South-Asian-Scripts, i suggest pointing 
the reader to Unicode Technical Note #10, An Introduction to Indic 
Scripts, the latest version of which is to be found at 
http://rishida.net/scripts/indic-overview/

[3] fig 1's picture is 3579x4493 pixels - far too big to be included in 
the document, and I've had problems downloading it even on the desktop. 
We should create a smaller version.  And by the way, are there any 
copyright issues in using it?



Section 1.2 Basic components of Indian languages

[1] section 1.2.1: "Unicode uses a 16 bit encoding that provides code 
point for more than 65000 characters (65536)."  This is waaay out of 
date. Unicode has over 1 million codepoints available. Please correct.

[2] fig 3: again, we should check it's ok to use this

[3] section 1.2.4: "This section provides the basic alphabet system of 
Devanagari Script i.e Consonants, Vowels, Modifiers, Matras, Halant, 
Nukta etc."  should probably say "the basic alphabetic system of 
Devanagari script as used for Hindi"

[4] to be consistent, we should explain the function of the visarga and 
the halant (and probably mention that the latter is called virama by 
Unicode).

[5] section 1.2.1, CLDR: "It is a part of the W3C and Unicode Standard."
It's not a W3C standard.



Section 2.1 First letter

[1] The first para, except the last sentence, and the para immediately 
after the 2 pictures are CSS-specific, and so should be removed from 
this document. (They may be useful in the other document that will map 
the requirements to technology in order to point out the delta.)

[2] "the sequence of characters in the first syllable is as follows in 
memory:"
I suggest:
"the sequence of characters in the first syllable as stored in memory is 
as shown at the top of Figure 4."

[3] "There are two default grapheme clusters here. The first includes 
the SA+VIRAMA+THA+I. (The second is the last two characters, T+II.)"
That is incorrect. There are three grapheme clusters (which is why this 
is problematic, of course): SA+VIRAMA, THA+I and TA+II.



Section 2.2 Letter Spacing

[1] <h3> markup is used for may lines of text in this section where it 
is not appropriate. Please remove/fix.

[2] Fig 6: I suggest turning this picture into prose, so that 
explanations can be added. There appear to be 3 approaches illustrated: 
one is segmentation by grapheme cluster, another by syllable, and I'm 
not at all sure what the third one is.

[3] I think the document needs to be clearer about which of the three 
approaches just mentioned are actually appropriate (all? some?), and 
give some idea of the frequency of use and what is preferred, or if that 
information is not available, to at least say so clearly.


Section 3 Text segmentation

[1] I think the following text would naturally sit after the rest of the 
text in this section:

"Word boundaries are used in a number of different contexts. The most 
familiar ones are selection (double-click mouse selection, or “move to 
next word” control-arrow keys), and “Whole Word Search” for search and 
replace. They are also used in database queries, to determine whether 
elements are within a certain number of words of one another. Some 
special sentence boundaries like the double poorna virama, possibly with 
numbers (as in Sanskrit text, shlokas etc.)"

[2] What is the requirement wrt word boundaries?  This isn't clear.

[3] "Solution :
Grapheme Cluster Boundaries: Indic Syllable definition [See section 5]
Possible Extension for handling some cases Mouse Selection: At Indic 
syllable and code point level"

This needs significant expansion.  (Note that grapheme cluster 
boundaries are not equivalent to the syllable definition.)

[4] Give a picture of the danda.  What about the double danda?

[5] "The precise determination of text elements may vary according to 
orthographic conventions for a given script or language."
In some scripts it also depends on the operation being applied by the 
application. Is that the case for Hindi?



Section 4 Line breaking

[1] "Hyphens are used when a word remains incomplete at the end of a 
line while writing or when specifying a range."
This sentence is ambiguous wrt what follows. I suggest just dropping it.

[2] "Rule 2: The definition of Indic syllable may be used to break the 
line and a hyphen should be at the breaking point so that word can be 
read intuitively"
Can a Hindi word be broken at any syllable boundary? If so, we should 
say so.



Section 5 ABNF ...

[1] "needs to be evolved" -> "is provided here"

[2] "V(upper case) is complete vowel"
I think the generally used term is 'independent vowel', no?



Section 6 Contributors

[1] please remove the baroque styling.



Other editorial

There are still many editorial nits, such as spaces before punctuation, 
'is' instead of 'are', word running together, missing 's' in plurals. 
etc. It would be good to clean these up as we work through the text.



Hope that helps,
ri
Received on Monday, 29 September 2014 19:45:02 UTC