- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Fri, 21 Mar 2008 00:02:48 +0100
- To: www-international@w3.org
Richard Ishida wrote: > Comments are being sought on this article prior to final release. | the ASCII-look-alike bytes contained in UTF-16 and UTF-32 text | might be a problem for some network devices or file processing | tools. s/might be/are/ is clearer in conjunction with "some". It is a real problem, not hypothetical. | Outgoing XML should always be encoded in UTF-8 Maybe add "or its proper subset US-ASCII", because that avoids any potential problems with a text/xml Content-Type. Maybe say this: "but note that US-ASCII is the default for Content-Type text/xml". | Examples are HTTP, s/HTTP/HTTP and MIME/ as a Content-Type works for mail and news as well. HTTP adopted it from MIME. | the external encoding specification may duplicate one that's part | of the byte sequence - that's a good thing Dubious, it can be a pain when the info differs. Maybe "usually a good thing" or similar (often, generally, typically, dunno, but definitely not always). | users commonly change the browser encoding Why would they still do this ? This sounds as if written in 1996. | Windows-1252, an extension of ISO-8859-1 Is "extension" strictly correct ? Or is it only a "variation" ? | such as UTF-8, EUC-KR, ISO 2022-JP US-ASCII is another prominent example allowing validation. | emoji http://en.wikipedia.org/w/index.php?title=Emoji&oldid=196580748 is the Permalink for this page when I looked at it, you find it by following the "Cite this page" in the "Toolbar" (if you use the default "skin", in essence a stylesheet). Plain Wikipedia links are a moving target, not good enough for your article. | ISO-8859-1 | Western European | 10% | 100% | Interesting, where did you find 10% as a "typical expansion" ? | representing completely different character sets from ASCII. Maybe s/completely/completely or slightly/ for the old ISO 646 variants of US-ASCII. What's an example for "completely" ? | 0x0E, 0x0F, and 0x1B are not used It could make sense to note that 0x1B 0x5B is a 7bit variant of 0x9B and harmless. 0x1B followed by 0x40..0x4D and 0x50..0x5F might be all "harmless" representing 0x80..0x8D and 0x90..0x9F, but I guess only CSI (0x9B) is relevant for legacy files. CSI has nothing to do with ISO 2022 magic. | UTF-7 For that you don't need Wikipedia, it's defined in RFC 2152, but if you like Wikipedia better please use a Permalink (see above). | "Œ" (Œ) or "€" (€). s /"€"/"€"/ (= s/"€"/"&euro;"/ in the source) | RI: As you mention in the next section, stripping doesn't | always happen, and that can be problematic sometimes, eg. | in PHP. Perhaps look at stripping in these two sections again. That's apparently an editorial annotation. Frank
Received on Thursday, 20 March 2008 23:01:10 UTC