Re: Concrete syntax, character sets from Gavin Nicol on 1996-09-10 (w3c-sgml-wg@w3.org from September 1996)

From: Gavin Nicol <gtn@ebt.com>
Date: Tue, 10 Sep 1996 20:17:03 GMT
To: U35395@UICVM.CC.UIC.EDU
CC: w3c-sgml-wg@w3.org
Message-Id: <199609102017.UAA04386@wiley.EBT.COM>
>  - How easy is it to find libraries to deal with ISO 10646 in general,
>or Unicode in general, or UTF-8 in particular?

There are a few. I remember a group in France (I just moved so I've
lost my URL list for the moment) produced such a library
recently. Also, FreeBSD has a fairly good runes package. MSVC, and
most modern OS's also come with wide version of all the str* faimily.

>  - Do these libraries coexist well with current versions of yacc, lex,
>bison, and flex?

Yes, thought not as easily as one would wish, it is still easy. I'm
also surprised that everyone is so hooked on YACC. There are better
tools available, and for the trivial grammars we are talking about,
hand written parsers would take very little time to write.

>  - Are there relatively simple ways of either converting from the
>system character sets of prominent platforms into Unicode / UTF-8, or
>ways of persuading standard tools to emit Unicode/utf-8 data?

Yes. The tables are available at www.unicode.org, and writing a tool
to take the tables, and turn them into a conversion engine is fairly
trivial. 

>  - there are good libraries, freely available, to handle wide
>characters -- at least utf-8 encoding of Unicode ...

Most OS's/compilers  have them...

>  - they work with yacc and lex (or, probably more important, flex and
>bison) and reasonably widely available C compilers (notably gcc)

This requires a little work, but not much. GCC can support wide
characters, though the input side of GCC (last time I looked) still
required more work to be truly I18N. The libraries from the GNU folk
do support wide characters.

>  - we can include a clear set of dos and donts for programmers to
>follow, so that those used to thinking of characters as seven-bit
>numbers can have a prayer of writing code that actually works with wide
>character sets.

If client and servers do content negotiation correctly, a 7 bit parser 
will never see wide data (except via numeric character references
etc. which can he handled in any way they wish). 

>- we can point people to sources of information and instruction.

The I18N pages at W3C are a reasonable starting point.

>  - we can specify a reasonably straightforward way to work with
>XML on systems that don't have system support for Unicode.  Current
>Java implementations may be worth emulating here; they seem to work
>very well with non-Unicode data despite the unbending fundamental
>principle that Java data and program source are all, always, Unicode,
>period.

Well. JAVA doesn't yet handle Japanese properly... nor do it's
localised input streams work properly...

>It isn't enough for internationalization to be *possible*; we need to
>say, crisply and clearly and *briefly*, what the requirements are and
>how to meet them.

Easy to do. If we stick to a single document character set, the all we
need to do is require (as the HTML I18N draft does) that all numeric
character references be resolved in the document character set, whuich
in practise *may* mean that an XML application just tosses data
outside the acceptable range of it's internal representation.

>markup -- well, SGML has *always* made it possible to use non-Latin-1
>characters, and HTML has not; at the same time, HTML has been relatively
>easy to implement, and SGML has not.  Which is supported by more
>software?  Which is used by more people?

This is not due to I18N problems, though they certainly play a role. I
would argue that other "features" are more to blame.

>As noted:  I'm in favor of i18n.  The best way to advance that cause,
>though, is to provide a simple spec that shows implementors how to
>support i18n.  Complexity of treatment, or even worse complexity of
>implementation, will not help the cause.  If we want XML to support
>i18n, we have to find ways to help implementors find their way through
>the attendant problems.

Once the core syntax is decided, I would be more than happy to define
the I18N features required. Actually, we could probably get Rick to
join in that effort as well, to our benefit.
Received on Tuesday, 10 September 1996 16:18:13 UTC