Re: For review: Migrating to Unicode

Richard Ishida wrote:

> Comments are being sought on this article prior to final release.

| the ASCII-look-alike bytes contained in UTF-16 and UTF-32 text
| might be a problem for some network devices or file processing
| tools.

s/might be/are/ is clearer in conjunction with "some".  It is a
real problem, not hypothetical.

| Outgoing XML should always be encoded in UTF-8

Maybe add "or its proper subset US-ASCII", because that avoids any
potential problems with a text/xml Content-Type.  Maybe say this:
"but note that US-ASCII is the default for Content-Type text/xml".

| Examples are HTTP,

s/HTTP/HTTP and MIME/ as a Content-Type works for mail and news as
well.  HTTP adopted it from MIME.

| the external encoding specification may duplicate one that's part
| of the byte sequence - that's a good thing

Dubious, it can be a pain when the info differs.  Maybe "usually
a good thing" or similar (often, generally, typically, dunno, but
definitely not always).

| users commonly change the browser encoding

Why would they still do this ?  This sounds as if written in 1996.

| Windows-1252, an extension of ISO-8859-1

Is "extension" strictly correct ?  Or is it only a "variation" ?

| such as UTF-8, EUC-KR, ISO 2022-JP

US-ASCII is another prominent example allowing validation.

| emoji

http://en.wikipedia.org/w/index.php?title=Emoji&oldid=196580748
is the Permalink for this page when I looked at it, you find it
by following the "Cite this page" in the "Toolbar" (if you use
the default "skin", in essence a stylesheet).  Plain Wikipedia
links are a moving target, not good enough for your article.

| ISO-8859-1 | Western European |  10% | 100% |

Interesting, where did you find 10% as a "typical expansion" ?

| representing completely different character sets from ASCII.

Maybe s/completely/completely or slightly/ for the old ISO 646
variants of US-ASCII.  What's an example for "completely" ?

| 0x0E, 0x0F, and 0x1B are not used

It could make sense to note that 0x1B 0x5B is a 7bit variant of
0x9B and harmless.  0x1B followed by 0x40..0x4D and 0x50..0x5F
might be all "harmless" representing 0x80..0x8D and 0x90..0x9F,
but I guess only CSI (0x9B) is relevant for legacy files.  CSI
has nothing to do with ISO 2022 magic.

| UTF-7

For that you don't need Wikipedia, it's defined in RFC 2152, but
if you like Wikipedia better please use a Permalink (see above).

| "Œ" (Œ) or "€" (€).

s /"€"/"€"/ (= s/"€"/"€"/ in the source)

| RI: As you mention in the next section, stripping doesn't
| always happen, and that can be problematic sometimes, eg.
| in PHP. Perhaps look at stripping in these two sections again.

That's apparently an editorial annotation.

 Frank

Received on Thursday, 20 March 2008 23:01:10 UTC