Re: Should @maxlength count 16-bits, grapheme clusters or Unicode code points? from Martin J. Dürst on 2011-06-21 (public-i18n-indic@w3.org from April to June 2011)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 21 Jun 2011 12:56:35 +0900
To: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>
CC: HTML WG <public-html@w3.org>, WWW International <www-international@w3.org>, Indic Community <public-i18n-indic@w3.org>, CJK discussion <public-i18n-cjk@w3.org>
Message-ID: <4E001673.6030207@it.aoyama.ac.jp>

Hello Kenny,

As far as I remember, the main purpose of maxlength is client-side data 
validation in the sense that it avoids sending data to the server and 
then getting a message back saying "entry too long".

The main reason for an entry being too long is that it has to fit into a 
fixed-length field in a database. Then the question is how the length of 
that field is measured. The answer is "it depends". It may be bytes or 
code points of UTF-8, it may be bytes or code points of UTF-16, it may 
be something else whatever (think e.g. punycode).

It's a typical case of a feature that made a lot of sense in an 
ASCII-only or ISO-8859(-1) only world, but just doesn't work easily for 
today's multilingual world.

With that in mind, I'd prefer if HTML5 talked about the limits of this 
feature rather than invested a lot of time to define it to every detail. 
Even if you can get it to work exactly the same in every browser, this 
still won't mean that it's really useful (other than as an 'about' value 
that still has to be checked on the server side).

As for e.g. Microblogging services, each of them can do what it wants. 
There was a talk (see 
http://www.unicodeconference.org/program-d.htm#S7-T2) last year at the 
Unicode conference where Matt Sanford explained what they did about 
internationalization, and how they measured characters, not code units 
or some such. The fact that Weibo doesn't count surrogate pairs as 
characters may be based on the fact that they use UTF-16 internally 
somewhere, or it may be just by accident. If I were a microblogging 
service, deciding on how to count surrogates wouldn't be high on my 
priority list, either way doesn't make much of a difference. As for 
unpaired surrogate codepoints, the right thing is to not let them enter 
the system. Garbage-in-garbage-out was never a good idea.

Regards,   Martin.

P.S.: There will be another talk by Matt at this year's Unicode 
conference, please see 
http://www.unicodeconference.org/conference-at-a-glance.htm.

On 2011/06/20 18:45, Kang-Hao (Kenny) Lu wrote:
> I was recently surprised by the fact, if I understand correctly, that
> HTML5 counts Unicode code points for @maxlength, which means that a
> surrogate pair would be counted as 1. However, WebKit's
> implementation[1] counts grapheme clusters instead and other
> implementations simply use the length of the internal representation of
> a string, which are 16-bits. I wonder if there's a rationale on the
> current choice of definition.
>
> Some test cases that might help:
>
> *Case 1*
>
> "𠂇", a non-BMP Unicode character I've never seen in my life
>
> spec: 1 (if I understand correctly)
> WebKit: 1
> others: 2
>
> the same for "\ud840\udc87", at least in ECMAScript5
>
> *Case 2*
>
> "\ud840", an unpaired surrogate code point
>
> spec: 1 (if I understand correctly)
> WebKit: 1
> others: 1
>
> *Case 3*
>
> "A\u0301", an "A" with acute (non-normalized)
>
> spec: 2
> WebKit: 1
> others: 2
>
> *Case 4*
>
> "นี้ก็ดี" some Thai text with 7 Unicode characters and 3 grapheme clusters
>
> spec: 7
> WebKit: 3
> others: 7
>
> *Case 5*
>
> "विकिपीडिया" some Indic text with 10 Unicode characters and 5 grapheme
> clusters
>
> spec: 10
> WebKit: 5
> others: 10
>
>
> == Methods of counting ==
> 1. Count the number of Unicode code points. Unpaired surrogate code
> point counts as one for each. This, I suppose, is the current definition.
> 2. Count the number of grapheme clusters. This is WebKit's
> implementation when ICU is linked.
> 3. Simply count the length of the DOMString.
>
> Interestingly, Twitter does Method 1 (but throw an error when
> encountering unpaired surrogates), but SMSGupShup, which seems to be a
> popular microblogging service in India, does Method 3 (and not 2), like
> any normal website.
>
> What seems to be the most important is whether users of complex scripts,
> for example the Indic community, would treat it unnatural to count the
> length of a string as grapheme clusters. If that's the case, we should
> be able to rule out Method 2.
>
> I am not sure Method 1 is really such a good idea because if a
> microblogging service is connected to an SMS service, which normally
> uses UTF-16 for encoding[2], and limits the number of characters, you
> would be able to double the size of the message by using non-BMP
> characters. Using Method 1 might give users of non-BMP characters
> surprises, but is that common enough? Even Weibo, the biggest
> microblogging service in China ignores non-BMP characters and does
> Method 3. I am note sure why Twitter is playing smart here.
>
> Thoughts? I am not very knowledgeable about use cases of @maxlength.
>
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=535043
> [2] http://en.wikipedia.org/wiki/SMS#Message_size
>
>
> Cheers,
> Kenny
>
>

Received on Tuesday, 21 June 2011 03:57:24 UTC