- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Mon, 2 Jan 1995 19:59:43 +1100 (EST)
- To: Gavin Nicol <gtn@ebt.com>, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
- Cc: david_goldsmith@taligent.com, html-wg@oclc.org, alb@ebt-inc.ebt.com, www@unicode.org
David Goldsmith writes: > Again, thanks to Gavin for putting the work into writing up this proposal. > Does anyone else have comments? ... > Section 3.2.2 > > I'm of two minds about supporting other non-ASCII character sets. On the > one hand, lots of people are doing this already, and you can't very well > expect them to stop. On the other hand, formalizing this will impede the > goal of getting to internationalization of the WWW, where any client can > talk to any server. You would get into the situation where clients and > servers spoke different character sets, just like now (although you'd know > it up front, rather than finding out when garbage showed up on your > screen). Unless non-Latin-based languages are treated as equal to Latin-based languages, WWW should be renamed. I think the Accept-charset would address this well: also, it suggests an architecture/protocol, and so is inherently enabling rather than fettering. > I suppose that if the number of supported character sets was limited > (unlikely), clients could support them all, translating through Unicode. > This doesn't seem like it's a probably outcome. I think Gavin's solution comes down to that every WWW server should be able to negotiate and supply: * some ISO 646 based lowest-common-denominator encoding, e.g. ISO 8859-1; * some ISO 10646 encoding for rest-of-world documents; and * the local national character set (in English-speaking countries, this is not needed). The first allows good support of English readers, and majority-white countries and their ex-colonies well. (I mention "white" not to impute racism or at least parochialism, but to emphasise that for WWW to be able reach a browsership in non-Western countries other than the technical and educated classes, the small Latin-based character sets are of no use.) The second allows support of international access of documents. The third allows national support of local languages. > This feature is probably necessary to support existing practice, but I > worry it will lead to "Balkanization" of the WWW, namely clients and > servers that don't speak to each other. Local/regional character sets and encodings won't go away in the forseeable future. WWW should have a "lingua franca" character set, and that character set shouldn't be a national or regional character set: ISO 10646/Unicode is the only choice. Such a three character set approach (LCD, internationals, national/regional) is also fairly efficient: it means that clients and servers can negotiate the character set for the least amount of character mapping. Gavin goes further and specifies particular encodings of ISO 10646 that every server should support: I think that naturally follows. > Of course, even with Unicode, if > your server is serving up Thai, and I don't have any Thai fonts, I'm out of > luck. It would be nice, though, if that didn't happen just because you and > I use different character sets. Perhaps a smart server could serve the fonts too: it could have "Accept-charset" for the character sets and encodings it will handle, and "Display-charset" for the character sets it can display. The smart server would figure out what is needed. Alternatively, it could be done on an exception basis: if a client can't display a character, it sends a request somewhere to a font-server (back to the WWW server?): this may be slow, but this is handling exceptions not normal operation. As a side comment, I'd mention that the Extended Reference Concrete Syntax (ERCS) proposals I am involved in developing have as an aim to allow SGML tagging using native-language tag names (GIs), if the name characters also appear in Unicode. This is in part to allow structure-based searching of documents by non-English speakers using meaningful (to them) tag names. At the moment, it is a reasonable assumption that SGML markup will only use ISO 646 characters: the ERCS would hopefully change that, as far as SGML documents go. If some version of HTML allows arbitrary tag names, I hope that there are no ISO 646 dependencies. -ricko Rick Jelliffe email: ricko@allette.com.au Allette Systems phone: +61 2 262 4777 Sydney, Australia fax: +61 2 262 4774
Received on Tuesday, 17 January 1995 01:02:43 UTC