Re: Concrete syntax, character sets from Michael Sperberg-McQueen on 1996-09-10 (w3c-sgml-wg@w3.org from September 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
Date: Tue, 10 Sep 96 12:38:48 CDT
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199609101810.OAA17611@www10.w3.org>

Help!  We need answers to some simple factual questions.  I don't know
the answers.  I sure hope someone else does.  For example:

  - How easy is it to find libraries to deal with ISO 10646 in general,
or Unicode in general, or UTF-8 in particular?
  - Do these libraries coexist well with current versions of yacc, lex,
bison, and flex?
  - Are there relatively simple ways of either converting from the
system character sets of prominent platforms into Unicode / UTF-8, or
ways of persuading standard tools to emit Unicode/utf-8 data?

I would like, on principle, to commit ourselves to proper support for
i18n -- but I would equally like to keep to our goal of twenty pages of
documentation.  I think we can have both if:

  - there are good libraries, freely available, to handle wide
characters -- at least utf-8 encoding of Unicode ...
  - they work with yacc and lex (or, probably more important, flex and
bison) and reasonably widely available C compilers (notably gcc)
  - we can include a clear set of dos and donts for programmers to
follow, so that those used to thinking of characters as seven-bit
numbers can have a prayer of writing code that actually works with wide
character sets.
  - we can point people to sources of information and instruction.
  - we can specify a reasonably straightforward way to work with
XML on systems that don't have system support for Unicode.  Current
Java implementations may be worth emulating here; they seem to work
very well with non-Unicode data despite the unbending fundamental
principle that Java data and program source are all, always, Unicode,
period.

It isn't enough for internationalization to be *possible*; we need to
say, crisply and clearly and *briefly*, what the requirements are and
how to meet them.

About the absolute necessity of non-Latin-1 characters and the relative
importance of ease of implementation and support for culturally apt
markup -- well, SGML has *always* made it possible to use non-Latin-1
characters, and HTML has not; at the same time, HTML has been relatively
easy to implement, and SGML has not.  Which is supported by more
software?  Which is used by more people?

As noted:  I'm in favor of i18n.  The best way to advance that cause,
though, is to provide a simple spec that shows implementors how to
support i18n.  Complexity of treatment, or even worse complexity of
implementation, will not help the cause.  If we want XML to support
i18n, we have to find ways to help implementors find their way through
the attendant problems.

-C. M. Sperberg-McQueen
 ACH / ACL / ALLC Text Encoding Initiative
 University of Illinois at Chicago
 tei@uic.edu

Received on Tuesday, 10 September 1996 14:10:51 UTC