Re: A truly multilingual WWW from Rick Jelliffe on 1995-01-02 (ietf-http-wg@w3.org from January to March 1995)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Mon, 2 Jan 1995 19:59:43 +1100 (EST)
To: Gavin Nicol <gtn@ebt.com>, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Cc: david_goldsmith@taligent.com, html-wg@oclc.org, alb@ebt-inc.ebt.com, www@unicode.org
Message-Id: <Pine.ULT.3.90.950102185525.5971A-100000@chuckd>
David Goldsmith writes:

> Again, thanks to Gavin for putting the work into writing up this proposal.
> Does anyone else have comments?
...
> Section 3.2.2 
> 
> I'm of two minds about supporting other non-ASCII character sets. On the
> one hand, lots of people are doing this already, and you can't very well
> expect them to stop. On the other hand, formalizing this will impede the
> goal of getting to internationalization of the WWW, where any client can
> talk to any server. You would get into the situation where clients and
> servers spoke different character sets, just like now (although you'd know
> it up front, rather than finding out when garbage showed up on your
> screen).

Unless non-Latin-based languages are treated as equal to Latin-based 
languages, WWW should be renamed. I think the Accept-charset would 
address this well: also, it suggests an architecture/protocol, and so is 
inherently enabling rather than fettering.

> I suppose that if the number of supported character sets was limited
> (unlikely), clients could support them all, translating through Unicode.
> This doesn't seem like it's a probably outcome.

I think Gavin's solution comes down to that every WWW server should be 
able to negotiate and supply:

* some ISO 646 based lowest-common-denominator encoding, e.g. ISO 8859-1;
* some ISO 10646 encoding for rest-of-world documents; and
* the local national character set (in English-speaking countries, this is 
not needed). 
 
The first allows good support of English readers, and majority-white
countries and their ex-colonies well. (I mention "white" not to impute
racism or at least parochialism, but to emphasise that for WWW to be able
reach a browsership in non-Western countries other than the technical and
educated classes, the small Latin-based character sets are of no use.) The
second allows support of international access of documents. The third
allows national support of local languages. 

> This feature is probably necessary to support existing practice, but I
> worry it will lead to "Balkanization" of the WWW, namely clients and
> servers that don't speak to each other. 

Local/regional character sets and encodings won't go away in the forseeable 
future. WWW should have a "lingua franca" character set, and that 
character set shouldn't be a national or regional character set: ISO 
10646/Unicode is the only choice.

Such a three character set approach (LCD, internationals,
national/regional) is also fairly efficient: it means that clients and
servers can negotiate the character set for the least amount of character
mapping. 

Gavin goes further and specifies particular encodings of ISO 10646 that 
every server should support: I think that naturally follows.

> Of course, even with Unicode, if
> your server is serving up Thai, and I don't have any Thai fonts, I'm out of
> luck. It would be nice, though, if that didn't happen just because you and
> I use different character sets.

Perhaps a smart server could serve the fonts too: it could have 
"Accept-charset" for the character sets and encodings it will handle, and 
"Display-charset" for the character sets it can display. The smart server 
would figure out what is needed. Alternatively, it could be done on an 
exception basis: if a client can't display a character, it sends a 
request somewhere to a font-server (back to the WWW server?): this may be 
slow, but this is handling exceptions not normal operation.


As a side comment, I'd mention that the Extended Reference Concrete Syntax
(ERCS) proposals I am involved in developing have as an aim to allow SGML
tagging using native-language tag names (GIs), if the name characters also
appear in Unicode.  This is in part to allow structure-based searching of
documents by non-English speakers using meaningful (to them) tag names. At
the moment, it is a reasonable assumption that SGML markup will only use
ISO 646 characters: the ERCS would hopefully change that, as far as SGML
documents go. If some version of HTML allows arbitrary tag names, I hope 
that there are no ISO 646 dependencies.

-ricko

Rick Jelliffe                          email: ricko@allette.com.au
Allette Systems                        phone: +61 2 262 4777
Sydney, Australia                      fax:   +61 2 262 4774
Received on Tuesday, 17 January 1995 01:02:43 UTC