RE: Thinking aloud...Definitions (pre-Guideline 1.1 summary)

At 19:08 22/04/2005, Richard Ishida wrote:

>> From: w3c-wai-gl-request@w3.org 
>> [mailto:w3c-wai-gl-request@w3.org] On Behalf Of Christophe Strobbe
>> Sent: 22 April 2005 01:13
>
>...
>> At 18:08 20/04/2005, Wendy Chisholm wrote:
>> >(...)
>> >
>> >Draft definitions (not quite proposals):
>> >
>> >   * text - A sequence of characters. Characters are those 
>> included in
>> >     the Unicode character set. Refer to Characters (in Extensible
>> >     Markup Language (XML) 1.1) for more information about 
>> the accepted
>> >     character range.
>> 
>> During today's conference call (or yesterday's, in my case), 
>> Wendy clarified that any character set may be used as long as 
>> all characters in the set can be mapped to Unicode characters.
>> Although Unicode wants to define "a universal character set 
>> that defines all the characters needed for writing the 
>> majority of living languages in use on computers" (quoted 
>> from Wendy's draft definition of Unicode), we should bear in 
>> mind that this work is not finished. I propose rephrasing the 
>> definition of text as:
>> 
>> "A sequence of characters. Characters are those included in 
>> the Unicode character set or any other character set or 
>> character encoding scheme registered with the Internet 
>> Assigned Numbers Authority."
>
>The Document Character Set of XML and HTML is defined to be bounded by the
>ISO 10646 / Unicode repertoire, so I disagree that you can say "or any other
>character set".  As an alternative, due to the confusion that sometimes
>surrounds character set vs character encoding terminology, I'd suggest
>something like "Characters are those included in the Unicode / ISO/IEC
>106464 repertoire."

1. I did not write "any other chcaracter set" but only those registered
with IANA. Is that the same thing?
2. Bounding the character repertoire of HTML and XML to that of Unicode /
ISO/IEC 10646 is only acceptable if all other character sets that are
commonly used on the Web are subsets of Unicode / ISO/IEC 10646. If that
is true (and the experts cited by Joe Clark confirm this), I rest
my case.


>...
>> Wendy's definition was based on the XML 1.1 spec (see [1], 
>> where the definition of character only takes into account 
>> Unicode and ISO/IEC 10646).
>> However, elsewhere, the XML spec also takes the existence of 
>> other "encodings" into account (see [6], which talks about 
>> external parsed entities). Some of the examples given there 
>> ("ISO-2022-JP", "Shift_JIS"
>> and "EUC-JP" ) are encodings of JIS character sets, not Unicode.
>
>The repertoires of these character sets are all included in Unicode. This is
>not an issue. 
>
>I'm not sure it is necessary or wise, though, to try to define the character
>range in this way. If you are using XML 1.0, the character range that is
>admissable is much smaller.

OK, referring to XML 1.0 is safer. I assumed that Wendy was more interested
in Unicode's character repertoire than in the differences between XML 1.0
and XML 1.1.

>> 
>> With the proposed definition, we don't need to worry about 
>> characters from Chinese, Japanese or Korean that might not 
>> have been properly handled in Unicode's 'Han unification' 
>> according to some users of these languages.
>> See [2] for one such view, and [3] and [4] for more background. 
>> [5] is the Unicode Consortium's version of the "Han 
>> Unification History'.
>
>This is way out of date, apart from anything else.  Unicode currently
>contains over 70,000 Han characters, and has plenty of room for more.  There
>is the potential for 16 additional planes of code points locations, which
>makes just over a million in total - not 94K as stated in this article.
>Please take a look at the latest standard at http://www.unicode.org/

I know that Unicode moved beyond the Basic Multilingual Plane since some 
of those articles were written, but that did not necessarily mean that
the objections raised in those articles have been addressed. Again,if 
these objections are unjustified or out of date, I rest my case.

Regards,

Christophe Strobbe


>> 
>> Regards,
>> Christophe Strobbe
>> 
>> 
>> [1] http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets
>> [2] http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
>> (Why Unicode Won't Work on the Internet: Linguistic, 
>> Political, and Technical Limitations). Note that some of the 
>> controversy about Han unification seems to be caused by 
>> misunderstandings.
>> [3] http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html (Han 
>> Unification and
>> Unicode)
>> [4] http://encyclopedia.laborlawtalk.com/Han_unification
>> [5] http://www.unicode.org/book/appA.pdf
>> [6] http://www.w3.org/TR/2004/REC-xml11-20040204/#charencoding

-- 
Christophe Strobbe
K.U.Leuven - Departement of Electrical Engineering - Research Group on Document Architectures
Kasteelpark Arenberg 10 - 3001 Leuven-Heverlee - BELGIUM
tel: +32 16 32 85 51 
tel mobile: +32 473 97 70 25
fax: +32 16 32 85 39 
http://www.docarch.be/ 

Received on Saturday, 23 April 2005 15:10:17 UTC