W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > June 1997

Once more into the breach (I18N)

From: David G. Durand <dgd@cs.bu.edu>
Date: Fri, 13 Jun 1997 10:59:17 -0500
Message-Id: <v03007803afc71af2b9d6@[205.181.197.93]>
To: w3c-sgml-wg@w3.org
I see that the character set issue is raising its hoary, moss-festooned
head again. While I'm not a character set expert, and I don't even play one
on TV, I do have a suggestion. It's old, but then the topic is too.

We should stick with the decision that the document coded character set
(i.e. the character repertoire, numberical character codes, and preferred
binary transmission formats) should stay Unicode.

We should allow wider character repertoires to be dealt with by processors,
as no finite character set ever seems to be enough. In fact we should allow
as manya characters as a user can stomach declaring, and ones that are as
weird as they want.

In fact, we should bring back SDATA, _strictly defined_ as "name for
character not represented in UNICODE, and thus not possible to be directly
encoded by a literal character of XML syntax.

Since just putting strings in SDATA leaves a bit too much freedom, we will
make the following constriants.

Unregistered characters (like unregistered FPIs, _not_ guaranteed unique)
are repesented by SDATA values containing any string of Unicode characters
not starting with a left bracket "[" character.

ISO 10646 characters are represented by SDATA values containing the
official name of the character enclosed in a "[[" "]]" pair or delimiters.

SDATA strings delimited by "[" and "]" that are not ISO 10646 character
codes (i.e. single brackets instead of a pair) are _reserved syntax_. It is
an error to have such an SDATA string in an XML 1.0 document, but they may
later be used for encoding registered glyphs when appropriate standards and
software exist.

This means that Private-Use characters are allowed, but should probably be
discouraged, just as other compatibility characters are discouraged in
Unicode applications.

This should be easy to implement: Applications that don't care to handle
non-unicode characters can simply treat SDATA entities as strings. It would
probably be nicer to tack on additional text to produce something like
"[Undefined character: Humlan Vowel Squiggle right]"

I earlier promised not to spontaneously attempt to re-animate the SDATA
question, but it is now so apropos of the ongoing discussion that I do so
without guilt.

   -- David

_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
Received on Friday, 13 June 1997 11:02:56 EDT

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 10:04:41 EDT