Should @maxlength count 16-bits, grapheme clusters or Unicode code points? from Kang-Hao (Kenny) Lu on 2011-06-20 (public-i18n-cjk@w3.org from April to June 2011)

From: Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>
Date: Mon, 20 Jun 2011 17:45:07 +0800
To: HTML WG <public-html@w3.org>, WWW International <www-international@w3.org>, Indic Community <public-i18n-indic@w3.org>, CJK discussion <public-i18n-cjk@w3.org>
Message-ID: <4DFF16A3.8000606@csail.mit.edu>

I was recently surprised by the fact, if I understand correctly, that
HTML5 counts Unicode code points for @maxlength, which means that a
surrogate pair would be counted as 1. However, WebKit's
implementation[1] counts grapheme clusters instead and other
implementations simply use the length of the internal representation of
a string, which are 16-bits. I wonder if there's a rationale on the
current choice of definition.

Some test cases that might help:

*Case 1*

"𠂇", a non-BMP Unicode character I've never seen in my life

spec: 1 (if I understand correctly)
WebKit: 1
others: 2

the same for "\ud840\udc87", at least in ECMAScript5

*Case 2*

"\ud840", an unpaired surrogate code point

spec: 1 (if I understand correctly)
WebKit: 1
others: 1

*Case 3*

"A\u0301", an "A" with acute (non-normalized)

spec: 2
WebKit: 1
others: 2

*Case 4*

"นี้ก็ดี" some Thai text with 7 Unicode characters and 3 grapheme clusters

spec: 7
WebKit: 3
others: 7

*Case 5*

"विकिपीडिया" some Indic text with 10 Unicode characters and 5 grapheme
clusters

spec: 10
WebKit: 5
others: 10

== Methods of counting ==
1. Count the number of Unicode code points. Unpaired surrogate code
point counts as one for each. This, I suppose, is the current definition.
2. Count the number of grapheme clusters. This is WebKit's
implementation when ICU is linked.
3. Simply count the length of the DOMString.

Interestingly, Twitter does Method 1 (but throw an error when
encountering unpaired surrogates), but SMSGupShup, which seems to be a
popular microblogging service in India, does Method 3 (and not 2), like
any normal website.

What seems to be the most important is whether users of complex scripts,
for example the Indic community, would treat it unnatural to count the
length of a string as grapheme clusters. If that's the case, we should
be able to rule out Method 2.

I am not sure Method 1 is really such a good idea because if a
microblogging service is connected to an SMS service, which normally
uses UTF-16 for encoding[2], and limits the number of characters, you
would be able to double the size of the message by using non-BMP
characters. Using Method 1 might give users of non-BMP characters
surprises, but is that common enough? Even Weibo, the biggest
microblogging service in China ignores non-BMP characters and does
Method 3. I am note sure why Twitter is playing smart here.

Thoughts? I am not very knowledgeable about use cases of @maxlength.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=535043
[2] http://en.wikipedia.org/wiki/SMS#Message_size

Cheers,
Kenny

Received on Monday, 20 June 2011 09:41:20 UTC