- From: Murray Altheim <altheim@eng.sun.com>
- Date: Thu, 02 Dec 1999 13:18:11 -0800
- To: Sean Healy <jalopeura@hotmail.com>
- CC: www-html@w3.org
Sean Healy wrote: > > I'm new to the list, and I didn't see anything like this in the archives for > the last few months, so here goes: > > The current list of accented letters available in HTML isn't nearly enough. > Is it possible to put an overstrike tag in the next version that will allow > authors to specify two (or more) characters to place overtop each other. > There is something similar with the strikeout tag that places a line through > letters. Perhaps something like <OS>~n</OS> could replace ñ (for > those of you with HTML-enabled readers, <OS>~n</OS> and > &ntilde;). This would be a big step toward true internationalization. When SGML became an ISO standard it included a rather large set of character entities, a set often supported in various ways by tools. This was the original direction for "internationalization", but is in reality not internationalization at all, but localization. Back in 1997 I posted a list of the ISO character entities (plus some others) that is probably in dire need of update, but you can get an idea of what's available: "ISO Character Entity Sets" http://www.altheim.com/specs/charents.html We could certainly include all of the ISO sets in XHTML (as is done in DocBook and many other SGML languages) but to what end? > If there's some technical reason why this is impossible with SGML, could > someone explain it for me in layman's terms? It's not impossible at all, it's a matter of interoperability. This isn't so much a case of whether or not the named character entity exists, rather it's whether (and how) it would be supported in commonly-available tools (like browsers and editors). We in the W3C HTML WG have resisted adding any new character entities because of two simple reasons: 1. the current set of HTML character entities (a small subset culled from the larger ISO sets*) aren't even yet fully supported by all browsers. Pragmatically, adding new characters knowing they won't be supported without an overhaul of how unknown font glyphs can be reliably displayed would be pointless, setting an expectation that would likely be unmet. 2. XML uses Unicode as its base character model. The direction in the industry has been away from attempting to come up with named entity sets for all languages (which is not only an exercise in frustration and favoratism, but a practical impossibility) but toward moving toward using native Unicode encodings. So that, for example, rather than relying on some named character entity for the Arabic 'dotless noon with small tah', a text editor would simply support some type of input (perhaps using an Arabic keyboard or an on-screen input device), and encode the actual character number (0x06BB) into the file much as we type an 'a' (0x0061). Currently, if you lack an Arabic keyboard you can input 'a' into your XML file for the same result. The problem is, again, font support. "True internationalization" will happen not with increasing the number of named character entities but with internationalized operating systems (that can handle Unicode character encodings) and a means of obtaining generalized font support for character glyphs not installed on one's OS. The W3C draft for Scalable Vector Graphics (SVG) includes a feature that would allow for definition of a font, and perhaps this may be generalizable to support such a need on the Web. I can imagine SVG documents that exist solely as "modules" that define a font, and are included in documents in order to provide that font support. But perhaps not. I don't see any notion of such document modularity in the SVG draft, but it's certainly not precluded. There are a number of large vendors very interested in SVG, so we should remain hopeful. We still have a ways to go before a document can be posted on the Web that includes an arbitrary mix of Unicode characters (say, from ten languages) that can be reliably, interoperably displayed on all browsers. But I expect that within the next five years we'll see widespread support for multi-language documents on the Web, thanks due to XML. *True* internationalization. Murray * you can tell the Unicode character number and which ISO file a particular entity comes from in its comment, eg., for '©': <!ENTITY copy "©"><!-- copyright sign, U+00A9 ISOnum --> 'U+00A9' is the hexidecimal equivalent for decimal 169. 'ISOnum' indicates this entity comes from the 'ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN' set. ........................................................................... Murray Altheim, SGML Grease Monkey <mailto:altheim@eng.sun.com> Member of Technical Staff, Tools Development & Support Sun Microsystems, 901 San Antonio Rd., UMPK17-102, Palo Alto, CA 94303-4900 the honey bee is sad and cross and wicked as a weasel and when she perches on you boss she leaves a little measle -- archy
Received on Thursday, 2 December 1999 16:17:35 UTC