- From: Christophe Strobbe <christophe.strobbe@esat.kuleuven.ac.be>
- Date: Fri, 22 Apr 2005 02:13:29 +0200
- To: wendy@w3.org, wai-gl <w3c-wai-gl@w3.org>
At 18:08 20/04/2005, Wendy Chisholm wrote: >(...) > >Draft definitions (not quite proposals): > > * text - A sequence of characters. Characters are those included in > the Unicode character set. Refer to Characters (in Extensible > Markup Language (XML) 1.1) for more information about the accepted > character range. During today's conference call (or yesterday's, in my case), Wendy clarified that any character set may be used as long as all characters in the set can be mapped to Unicode characters. Although Unicode wants to define "a universal character set that defines all the characters needed for writing the majority of living languages in use on computers" (quoted from Wendy's draft definition of Unicode), we should bear in mind that this work is not finished. I propose rephrasing the definition of text as: "A sequence of characters. Characters are those included in the Unicode character set or any other character set or character encoding scheme registered with the Internet Assigned Numbers Authority." I write "other character set or character encoding scheme" because the two should not be confused. ISO/IEC 10646 is a character set; UTF-8, UTF-16 and UTF-32 are character encoding schemes (often simply called "encodings"). The Unicode Consortium defines Unicode as a "character encoding scheme", although it defines a character set as well as encoding schemes. Wendy's definition was based on the XML 1.1 spec (see [1], where the definition of character only takes into account Unicode and ISO/IEC 10646). However, elsewhere, the XML spec also takes the existence of other "encodings" into account (see [6], which talks about external parsed entities). Some of the examples given there ("ISO-2022-JP", "Shift_JIS" and "EUC-JP" ) are encodings of JIS character sets, not Unicode. Similarly, the WCAG definition of "text" should reflect the existence of other recognised character sets/encoding schemes. With the proposed definition, we don't need to worry about characters from Chinese, Japanese or Korean that might not have been properly handled in Unicode's 'Han unification' according to some users of these languages. See [2] for one such view, and [3] and [4] for more background. [5] is the Unicode Consortium's version of the "Han Unification History'. Regards, Christophe Strobbe [1] http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets [2] http://www.hastingsresearch.com/net/04-unicode-limitations.shtml (Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations). Note that some of the controversy about Han unification seems to be caused by misunderstandings. [3] http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html (Han Unification and Unicode) [4] http://encyclopedia.laborlawtalk.com/Han_unification [5] http://www.unicode.org/book/appA.pdf [6] http://www.w3.org/TR/2004/REC-xml11-20040204/#charencoding >(...) > >-- >wendy a chisholm >world wide web consortium >web accessibility initiative >http://www.w3.org/WAI/ >/-- > > >
Received on Friday, 22 April 2005 00:14:00 UTC