- From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- Date: Tue, 10 Sep 96 12:38:48 CDT
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Help! We need answers to some simple factual questions. I don't know the answers. I sure hope someone else does. For example: - How easy is it to find libraries to deal with ISO 10646 in general, or Unicode in general, or UTF-8 in particular? - Do these libraries coexist well with current versions of yacc, lex, bison, and flex? - Are there relatively simple ways of either converting from the system character sets of prominent platforms into Unicode / UTF-8, or ways of persuading standard tools to emit Unicode/utf-8 data? I would like, on principle, to commit ourselves to proper support for i18n -- but I would equally like to keep to our goal of twenty pages of documentation. I think we can have both if: - there are good libraries, freely available, to handle wide characters -- at least utf-8 encoding of Unicode ... - they work with yacc and lex (or, probably more important, flex and bison) and reasonably widely available C compilers (notably gcc) - we can include a clear set of dos and donts for programmers to follow, so that those used to thinking of characters as seven-bit numbers can have a prayer of writing code that actually works with wide character sets. - we can point people to sources of information and instruction. - we can specify a reasonably straightforward way to work with XML on systems that don't have system support for Unicode. Current Java implementations may be worth emulating here; they seem to work very well with non-Unicode data despite the unbending fundamental principle that Java data and program source are all, always, Unicode, period. It isn't enough for internationalization to be *possible*; we need to say, crisply and clearly and *briefly*, what the requirements are and how to meet them. About the absolute necessity of non-Latin-1 characters and the relative importance of ease of implementation and support for culturally apt markup -- well, SGML has *always* made it possible to use non-Latin-1 characters, and HTML has not; at the same time, HTML has been relatively easy to implement, and SGML has not. Which is supported by more software? Which is used by more people? As noted: I'm in favor of i18n. The best way to advance that cause, though, is to provide a simple spec that shows implementors how to support i18n. Complexity of treatment, or even worse complexity of implementation, will not help the cause. If we want XML to support i18n, we have to find ways to help implementors find their way through the attendant problems. -C. M. Sperberg-McQueen ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago tei@uic.edu
Received on Tuesday, 10 September 1996 14:10:51 UTC