Re: Thinking aloud...Definitions (pre-Guideline 1.1 summary)

At 18:08 20/04/2005, Wendy Chisholm wrote:
>(...)
>
>Draft definitions (not quite proposals):
>
>   * text - A sequence of characters. Characters are those included in
>     the Unicode character set. Refer to Characters (in Extensible
>     Markup Language (XML) 1.1) for more information about the accepted
>     character range.

During today's conference call (or yesterday's, in my case), 
Wendy clarified that any character set may be used
as long as all characters in the set can be mapped to Unicode characters.
Although Unicode wants to define "a universal character set that defines all
the characters needed for writing the majority of living languages
in use on computers" (quoted from Wendy's draft definition of Unicode), 
we should bear in mind that this work is not finished. I propose 
rephrasing the definition of text as:

"A sequence of characters. Characters are those included in the Unicode 
character set or any other character set or character encoding scheme 
registered with the Internet Assigned Numbers Authority."


I write "other character set or character encoding scheme" because the two
should not be confused. ISO/IEC 10646 is a character set; UTF-8, UTF-16
and UTF-32 are character encoding schemes (often simply called "encodings"). 
The Unicode Consortium defines Unicode as a "character encoding scheme",
although it defines a character set as well as encoding schemes.

Wendy's definition was based on the XML 1.1 spec (see [1], where the 
definition of character only takes into account Unicode and ISO/IEC 10646).
However, elsewhere, the XML spec also takes the existence of other
"encodings" into account (see [6], which talks about external parsed
entities). Some of the examples given there ("ISO-2022-JP", "Shift_JIS"
and "EUC-JP" ) are encodings of JIS character sets, not Unicode.
Similarly, the WCAG definition of "text" should reflect the existence
of other recognised character sets/encoding schemes.

With the proposed definition, we don't need to worry about characters from 
Chinese, Japanese or Korean that might not have been properly handled 
in Unicode's 'Han unification' according to some users of these languages.
See [2] for one such view, and [3] and [4] for more background. 
[5] is the Unicode Consortium's version of the "Han Unification History'.


Regards,
Christophe Strobbe


[1] http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets
[2] http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
(Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical 
Limitations). Note that some of the controversy about Han unification seems
to be caused by misunderstandings.
[3] http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html (Han Unification and
Unicode)
[4] http://encyclopedia.laborlawtalk.com/Han_unification
[5] http://www.unicode.org/book/appA.pdf
[6] http://www.w3.org/TR/2004/REC-xml11-20040204/#charencoding




>(...)
>
>-- 
>wendy a chisholm
>world wide web consortium
>web accessibility initiative
>http://www.w3.org/WAI/
>/--
>
>
>

Received on Friday, 22 April 2005 00:14:00 UTC