- From: Richard Ishida <ishida@w3.org>
- Date: Mon, 29 Sep 2014 20:44:31 +0100
- To: indic <public-i18n-indic@w3.org>, prashant verma <vermaprashant1@gmail.com>, Somnath Chandra <schandra@mit.gov.in>
These are review comments on http://www.w3.org/International/docs/indic-layout/ First, structure of the document. I suggest that section 5, ABNF segmentation, be moved to immediately after the introduction, since it is central to much of the rest of the document. The title 'Issues in Indic Layout' is a throwback to a previous version of the document. I think that if we keep that heading, we should change it to "Requirements for Indic Layout". However, the whole of the document is about requirements for indic layout, so I suggest that we adopt the following organisation: Introduction Units of text in Indic Scripts Text segmentation Indic Syllable boundaries (this is the current ABNF section) Line breaking First letter styling Letter spacing Vertical arrangements... Collation then the end matter ==== Now some more detailed comments, per section. Section 1.1 Indic language complexities [1] the document should indicate what SI No means [2] in addition to the link to South-Asian-Scripts, i suggest pointing the reader to Unicode Technical Note #10, An Introduction to Indic Scripts, the latest version of which is to be found at http://rishida.net/scripts/indic-overview/ [3] fig 1's picture is 3579x4493 pixels - far too big to be included in the document, and I've had problems downloading it even on the desktop. We should create a smaller version. And by the way, are there any copyright issues in using it? Section 1.2 Basic components of Indian languages [1] section 1.2.1: "Unicode uses a 16 bit encoding that provides code point for more than 65000 characters (65536)." This is waaay out of date. Unicode has over 1 million codepoints available. Please correct. [2] fig 3: again, we should check it's ok to use this [3] section 1.2.4: "This section provides the basic alphabet system of Devanagari Script i.e Consonants, Vowels, Modifiers, Matras, Halant, Nukta etc." should probably say "the basic alphabetic system of Devanagari script as used for Hindi" [4] to be consistent, we should explain the function of the visarga and the halant (and probably mention that the latter is called virama by Unicode). [5] section 1.2.1, CLDR: "It is a part of the W3C and Unicode Standard." It's not a W3C standard. Section 2.1 First letter [1] The first para, except the last sentence, and the para immediately after the 2 pictures are CSS-specific, and so should be removed from this document. (They may be useful in the other document that will map the requirements to technology in order to point out the delta.) [2] "the sequence of characters in the first syllable is as follows in memory:" I suggest: "the sequence of characters in the first syllable as stored in memory is as shown at the top of Figure 4." [3] "There are two default grapheme clusters here. The first includes the SA+VIRAMA+THA+I. (The second is the last two characters, T+II.)" That is incorrect. There are three grapheme clusters (which is why this is problematic, of course): SA+VIRAMA, THA+I and TA+II. Section 2.2 Letter Spacing [1] <h3> markup is used for may lines of text in this section where it is not appropriate. Please remove/fix. [2] Fig 6: I suggest turning this picture into prose, so that explanations can be added. There appear to be 3 approaches illustrated: one is segmentation by grapheme cluster, another by syllable, and I'm not at all sure what the third one is. [3] I think the document needs to be clearer about which of the three approaches just mentioned are actually appropriate (all? some?), and give some idea of the frequency of use and what is preferred, or if that information is not available, to at least say so clearly. Section 3 Text segmentation [1] I think the following text would naturally sit after the rest of the text in this section: "Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection, or “move to next word” control-arrow keys), and “Whole Word Search” for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Some special sentence boundaries like the double poorna virama, possibly with numbers (as in Sanskrit text, shlokas etc.)" [2] What is the requirement wrt word boundaries? This isn't clear. [3] "Solution : Grapheme Cluster Boundaries: Indic Syllable definition [See section 5] Possible Extension for handling some cases Mouse Selection: At Indic syllable and code point level" This needs significant expansion. (Note that grapheme cluster boundaries are not equivalent to the syllable definition.) [4] Give a picture of the danda. What about the double danda? [5] "The precise determination of text elements may vary according to orthographic conventions for a given script or language." In some scripts it also depends on the operation being applied by the application. Is that the case for Hindi? Section 4 Line breaking [1] "Hyphens are used when a word remains incomplete at the end of a line while writing or when specifying a range." This sentence is ambiguous wrt what follows. I suggest just dropping it. [2] "Rule 2: The definition of Indic syllable may be used to break the line and a hyphen should be at the breaking point so that word can be read intuitively" Can a Hindi word be broken at any syllable boundary? If so, we should say so. Section 5 ABNF ... [1] "needs to be evolved" -> "is provided here" [2] "V(upper case) is complete vowel" I think the generally used term is 'independent vowel', no? Section 6 Contributors [1] please remove the baroque styling. Other editorial There are still many editorial nits, such as spaces before punctuation, 'is' instead of 'are', word running together, missing 's' in plurals. etc. It would be good to clean these up as we work through the text. Hope that helps, ri
Received on Monday, 29 September 2014 19:45:02 UTC