Re: accented characters, etc. from Murray Altheim on 1999-12-02 (www-html@w3.org from December 1999)

From: Murray Altheim <altheim@eng.sun.com>
Date: Thu, 02 Dec 1999 13:18:11 -0800
To: Sean Healy <jalopeura@hotmail.com>
CC: www-html@w3.org
Message-ID: <3846E213.FE30ECE7@eng.sun.com>
Sean Healy wrote:
> 
> I'm new to the list, and I didn't see anything like this in the archives for
> the last few months, so here goes:
> 
> The current list of accented letters available in HTML isn't nearly enough.
> Is it possible to put an overstrike tag in the next version that will allow
> authors to specify two (or more) characters to place overtop each other.
> There is something similar with the strikeout tag that places a line through
> letters.  Perhaps something like <OS>~n</OS> could replace &ntilde; (for
> those of you with HTML-enabled readers, &lt;OS&gt;~n&lt;/OS&gt; and
> &amp;ntilde;).  This would be a big step toward true internationalization.

When SGML became an ISO standard it included a rather large set of 
character entities, a set often supported in various ways by tools.
This was the original direction for "internationalization", but is 
in reality not internationalization at all, but localization.

Back in 1997 I posted a list of the ISO character entities (plus some
others) that is probably in dire need of update, but you can get an 
idea of what's available:

   "ISO Character Entity Sets"
   http://www.altheim.com/specs/charents.html

We could certainly include all of the ISO sets in XHTML (as is done
in DocBook and many other SGML languages) but to what end?

> If there's some technical reason why this is impossible with SGML, could
> someone explain it for me in layman's terms?

It's not impossible at all, it's a matter of interoperability. This isn't
so much a case of whether or not the named character entity exists, rather
it's whether (and how) it would be supported in commonly-available tools 
(like browsers and editors). We in the W3C HTML WG have resisted adding 
any new character entities because of two simple reasons:

  1. the current set of HTML character entities (a small subset culled
     from the larger ISO sets*) aren't even yet fully supported by all
     browsers. Pragmatically, adding new characters knowing they won't
     be supported without an overhaul of how unknown font glyphs can be
     reliably displayed would be pointless, setting an expectation that
     would likely be unmet.

  2. XML uses Unicode as its base character model. The direction in
     the industry has been away from attempting to come up with named
     entity sets for all languages (which is not only an exercise in
     frustration and favoratism, but a practical impossibility) but
     toward moving toward using native Unicode encodings.

So that, for example, rather than relying on some named character entity
for the Arabic 'dotless noon with small tah', a text editor would simply
support some type of input (perhaps using an Arabic keyboard or an
on-screen input device), and encode the actual character number (0x06BB) 
into the file much as we type an 'a' (0x0061). Currently, if you lack
an Arabic keyboard you can input '&#x0061;' into your XML file for the
same result. 

The problem is, again, font support. "True internationalization" will
happen not with increasing the number of named character entities but
with internationalized operating systems (that can handle Unicode 
character encodings) and a means of obtaining generalized font support
for character glyphs not installed on one's OS. 

The W3C draft for Scalable Vector Graphics (SVG) includes a feature
that would allow for definition of a font, and perhaps this may be
generalizable to support such a need on the Web. I can imagine SVG
documents that exist solely as "modules" that define a font, and are
included in documents in order to provide that font support. But 
perhaps not. I don't see any notion of such document modularity in 
the SVG draft, but it's certainly not precluded. There are a number
of large vendors very interested in SVG, so we should remain hopeful. 

We still have a ways to go before a document can be posted on the Web that 
includes an arbitrary mix of Unicode characters (say, from ten languages)
that can be reliably, interoperably displayed on all browsers. But I
expect that within the next five years we'll see widespread support
for multi-language documents on the Web, thanks due to XML. *True*
internationalization.

Murray

* you can tell the Unicode character number and which ISO file a particular
  entity comes from in its comment, eg., for '&copy;':

  <!ENTITY copy "&#169;"><!-- copyright sign, U+00A9 ISOnum -->

    'U+00A9' is the hexidecimal equivalent for decimal 169.
    'ISOnum' indicates this entity comes from the 
      'ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN' set.
...........................................................................
Murray Altheim, SGML Grease Monkey         <mailto:altheim&#64;eng.sun.com>
Member of Technical Staff, Tools Development & Support
Sun Microsystems, 901 San Antonio Rd., UMPK17-102, Palo Alto, CA 94303-4900

   the honey bee is sad and cross and wicked as a weasel
   and when she perches on you boss she leaves a little measle -- archy
Received on Thursday, 2 December 1999 16:17:35 UTC