- From: David Goldsmith <david_goldsmith@taligent.com>
- Date: Mon, 16 Jan 1995 16:41:26 -0800
- To: Gavin Nicol <gtn@ebt.com>, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
- Cc: html-wg@oclc.org, rick0@allette.com.au, alb@ebt-inc.ebt.com, www@unicode.org
Sorry to take so long to reply to Gavin Nicol's proposal; first there were the holidays, then things got busy at work. In general, I think the document does a good job of discussing the issues, and I agree (not surprisingly) with the idea of using Unicode for this purpose. I do have some specific comments on aspects of the document. Section 3.2.1 There is no character set "UNICODE-1-1-UCS-2"; the UCS-2 form is identified by "UNICODE-1-1" (cf. RFC 1641). There is also no "UNICODE-1-1-UTF-8", but I am in the process of getting it registered right now. I don't particularly consider UTF-7 a good candidate for WWW work, since HTTP is an 8-bit-clean protocol, but if someone had another context in mind where a 7-bit safe form was needed, I'd like to know about it. Similarly, later in the section it says that "[UTF-7] is perfect for WWW and MIME uses". The same comment applies. It says at the end of 3.2.1 that the costs of translating documents to Unicode might appear prohibitive, but that caches would solve the problem. I guess I don't understand why transcoding a typical HTML document using a table would be expensive. I would expect it to be much cheaper than the time needed to push the document out over the wire. Why would this be prohibitively expensive? Section 3.2.2 I'm of two minds about supporting other non-ASCII character sets. On the one hand, lots of people are doing this already, and you can't very well expect them to stop. On the other hand, formalizing this will impede the goal of getting to internationalization of the WWW, where any client can talk to any server. You would get into the situation where clients and servers spoke different character sets, just like now (although you'd know it up front, rather than finding out when garbage showed up on your screen). I suppose that if the number of supported character sets was limited (unlikely), clients could support them all, translating through Unicode. This doesn't seem like it's a probably outcome. This feature is probably necessary to support existing practice, but I worry it will lead to "Balkanization" of the WWW, namely clients and servers that don't speak to each other. Of course, even with Unicode, if your server is serving up Thai, and I don't have any Thai fonts, I'm out of luck. It would be nice, though, if that didn't happen just because you and I use different character sets. Section 3.2.4 This is probably the section I have the most problem with. Unicode specifically was designed with the idea that attributes such as language, fonts, etc. would be encoded out-of-band, via high level tags or even out of the character stream entirely (see section 2 of the Unicode Standard, volume 1). In particular, the private use area is specifically reserved for use by corporations and end users, by private agreement, and trying to assign a code in the private use area for general public use is contrary to the spirit and probably the letter of the standard. I know that there has been considerable resistance to other attempts to include formatting codes in the standard, so I don't think this is the right approach. On the other hand, using embedded escape sequences (like HTML) is perfectly all right, and in fact on page 15 of volume 1 there is an example of this kind. The only difference in this example is that a character code is not assigned as an escape sequence. Although the proposal could be amended to be compatible by using existing character codes, I think this would duplicate work being done for HTML, so I think this problem should be dealt with at a higher level. Furthermore, I don't think it needs to be dealt with before Unicode can start being used for WWW purposes. The only time such a tag will be used is (for example) if a client could display both Japanese and Chinese; the tag would then help specify which one to choose. For the vast majority of monolingual clients, you'll use whatever font is available, so the tag would be ignored. Plain Unicode is more than capable of basic legibility when displaying text, without additional language or font tags (again, see the design principles in section 2 of the standard). Such display may not be typographically ideal, but then HTML is nowhere near rich enough for advanced typography anyway (right now, anyway). For now, then I propose that this issue be deferred. High-level tags are the right way to deal with it, because like other font, size, style, etc. information, it's not critical to the basic legibility of the information. If HTML 3.0 has a language tag, all the more reason not to invent another mechanism. It certainly doesn't belong at the level of character codes. Section 5.1 I'm pleased to hear that the issue on not requiring stict MIME line break rules has been dealt with. It would be nice if there were a reference of some sort here so that people could see the decision in writing (assuming it's written down somewhere). Section 6 It's worth adding the URL for the Unicode WWW page, http://unicode.org/, in addition to the ISBN numbers for the two books. Again, thanks to Gavin for putting the work into writing up this proposal. Does anyone else have comments? ---------------------------- David Goldsmith david_goldsmith@taligent.com Senior Scientist Taligent, Inc. 10201 N. DeAnza Blvd. Cupertino, CA 95014-2233
Received on Monday, 16 January 1995 16:49:29 UTC