W3C home > Mailing lists > Public > w3c-wai-gl@w3.org > April to June 2005

RE: Thinking aloud...Definitions (pre-Guideline 1.1 summary)

From: Richard Ishida <ishida@w3.org>
Date: Fri, 22 Apr 2005 18:08:32 +0100
To: "'Christophe Strobbe'" <christophe.strobbe@esat.kuleuven.ac.be>, <wendy@w3.org>, "'wai-gl'" <w3c-wai-gl@w3.org>
Message-Id: <20050422170829.DDEF14F1D8@homer.w3.org>

> From: w3c-wai-gl-request@w3.org 
> [mailto:w3c-wai-gl-request@w3.org] On Behalf Of Christophe Strobbe
> Sent: 22 April 2005 01:13

> At 18:08 20/04/2005, Wendy Chisholm wrote:
> >(...)
> >
> >Draft definitions (not quite proposals):
> >
> >   * text - A sequence of characters. Characters are those 
> included in
> >     the Unicode character set. Refer to Characters (in Extensible
> >     Markup Language (XML) 1.1) for more information about 
> the accepted
> >     character range.
> During today's conference call (or yesterday's, in my case), 
> Wendy clarified that any character set may be used as long as 
> all characters in the set can be mapped to Unicode characters.
> Although Unicode wants to define "a universal character set 
> that defines all the characters needed for writing the 
> majority of living languages in use on computers" (quoted 
> from Wendy's draft definition of Unicode), we should bear in 
> mind that this work is not finished. I propose rephrasing the 
> definition of text as:
> "A sequence of characters. Characters are those included in 
> the Unicode character set or any other character set or 
> character encoding scheme registered with the Internet 
> Assigned Numbers Authority."

The Document Character Set of XML and HTML is defined to be bounded by the
ISO 10646 / Unicode repertoire, so I disagree that you can say "or any other
character set".  As an alternative, due to the confusion that sometimes
surrounds character set vs character encoding terminology, I'd suggest
something like "Characters are those included in the Unicode / ISO/IEC
106464 repertoire."

> Wendy's definition was based on the XML 1.1 spec (see [1], 
> where the definition of character only takes into account 
> Unicode and ISO/IEC 10646).
> However, elsewhere, the XML spec also takes the existence of 
> other "encodings" into account (see [6], which talks about 
> external parsed entities). Some of the examples given there 
> ("ISO-2022-JP", "Shift_JIS"
> and "EUC-JP" ) are encodings of JIS character sets, not Unicode.

The repertoires of these character sets are all included in Unicode. This is
not an issue. 

I'm not sure it is necessary or wise, though, to try to define the character
range in this way. If you are using XML 1.0, the character range that is
admissable is much smaller.

> Similarly, the WCAG definition of "text" should reflect the 
> existence of other recognised character sets/encoding schemes.

Not really.  Again, the Document Character Set is by definition Unicode.
Schemes that encode these characters in different ways (ie. assign them to
different numbers) have nothing to do with a definition of what constitutes
text here.

> With the proposed definition, we don't need to worry about 
> characters from Chinese, Japanese or Korean that might not 
> have been properly handled in Unicode's 'Han unification' 
> according to some users of these languages.
> See [2] for one such view, and [3] and [4] for more background. 
> [5] is the Unicode Consortium's version of the "Han 
> Unification History'.

This is way out of date, apart from anything else.  Unicode currently
contains over 70,000 Han characters, and has plenty of room for more.  There
is the potential for 16 additional planes of code points locations, which
makes just over a million in total - not 94K as stated in this article.
Please take a look at the latest standard at http://www.unicode.org/

> Regards,
> Christophe Strobbe
> [1] http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets
> [2] http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
> (Why Unicode Won't Work on the Internet: Linguistic, 
> Political, and Technical Limitations). Note that some of the 
> controversy about Han unification seems to be caused by 
> misunderstandings.
> [3] http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html (Han 
> Unification and
> Unicode)
> [4] http://encyclopedia.laborlawtalk.com/Han_unification
> [5] http://www.unicode.org/book/appA.pdf
> [6] http://www.w3.org/TR/2004/REC-xml11-20040204/#charencoding
> >(...)
> >
> >-- 
> >wendy a chisholm
> >world wide web consortium
> >web accessibility initiative
> >http://www.w3.org/WAI/
> >/--
> >
> >
> >
Received on Friday, 22 April 2005 17:08:32 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 21:07:39 UTC