- From: Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>
- Date: Mon, 20 Jun 2011 17:45:07 +0800
- To: HTML WG <public-html@w3.org>, WWW International <www-international@w3.org>, Indic Community <public-i18n-indic@w3.org>, CJK discussion <public-i18n-cjk@w3.org>
I was recently surprised by the fact, if I understand correctly, that HTML5 counts Unicode code points for @maxlength, which means that a surrogate pair would be counted as 1. However, WebKit's implementation[1] counts grapheme clusters instead and other implementations simply use the length of the internal representation of a string, which are 16-bits. I wonder if there's a rationale on the current choice of definition. Some test cases that might help: *Case 1* "𠂇", a non-BMP Unicode character I've never seen in my life spec: 1 (if I understand correctly) WebKit: 1 others: 2 the same for "\ud840\udc87", at least in ECMAScript5 *Case 2* "\ud840", an unpaired surrogate code point spec: 1 (if I understand correctly) WebKit: 1 others: 1 *Case 3* "A\u0301", an "A" with acute (non-normalized) spec: 2 WebKit: 1 others: 2 *Case 4* "นี้ก็ดี" some Thai text with 7 Unicode characters and 3 grapheme clusters spec: 7 WebKit: 3 others: 7 *Case 5* "विकिपीडिया" some Indic text with 10 Unicode characters and 5 grapheme clusters spec: 10 WebKit: 5 others: 10 == Methods of counting == 1. Count the number of Unicode code points. Unpaired surrogate code point counts as one for each. This, I suppose, is the current definition. 2. Count the number of grapheme clusters. This is WebKit's implementation when ICU is linked. 3. Simply count the length of the DOMString. Interestingly, Twitter does Method 1 (but throw an error when encountering unpaired surrogates), but SMSGupShup, which seems to be a popular microblogging service in India, does Method 3 (and not 2), like any normal website. What seems to be the most important is whether users of complex scripts, for example the Indic community, would treat it unnatural to count the length of a string as grapheme clusters. If that's the case, we should be able to rule out Method 2. I am not sure Method 1 is really such a good idea because if a microblogging service is connected to an SMS service, which normally uses UTF-16 for encoding[2], and limits the number of characters, you would be able to double the size of the message by using non-BMP characters. Using Method 1 might give users of non-BMP characters surprises, but is that common enough? Even Weibo, the biggest microblogging service in China ignores non-BMP characters and does Method 3. I am note sure why Twitter is playing smart here. Thoughts? I am not very knowledgeable about use cases of @maxlength. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=535043 [2] http://en.wikipedia.org/wiki/SMS#Message_size Cheers, Kenny
Received on Monday, 20 June 2011 09:41:20 UTC