- From: Gavin Nicol <gtn@ebt.com>
- Date: Sun, 25 Dec 1994 10:56:45 -0500
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
- Cc: html-wg@oclc.org, rick0@allette.com.au, alb@ebt-inc.ebt.com, www@unicode.org
Below is a paper that I have put together proposing a method for making the WWW a truly international information repository. It is submitted in the hope that it will raise discussion of the issues involved, and perhaps to serve as a goal for which we should all strive. Certainly, the architecture outlined will not come to pass overnight, but both client and server implementors could start implementing the changes now, and as the tools spread, so the architecture will too. I would appreciate any and all comments. All errors of language or fact are solely my responsibility (ie. if I am making a fool of anyone, I am making it of myself ;-)), and I would appreciate correction. An HTML version of this paper is available upon request. Merry Christmas to all! ------------------------- HANDLING MULTILINGUAL DOCUMENTS IN THE WWW Gavin T. Nicol Electronic Book Technologies, Japan 1-29-9 Tsurumaki, Setagaya-ku, Tokyo 154, Japan +81-3-3706-7351 gtn@ebt.com ABSTRACT. The World Wide Web has enjoyed explosive growth in recent years, and there are now millions of people using it all around the world. Despite the fact that the Internet, and the World Wide Web, span the globe, there is, as yet, no well-defined way of handling documents that contain multiple languages, character sets, or encodings thereof. In this document, a method is proposed for cleanly handling such multilingual text on the WWW. 1. Requirements for multilingual applications There are many issues facing a system claiming to be multilingual, though all issues fall into one of 3 categories: 1. Data representation issues 2. Data manipulation issues 3. Data display issues This document is primarily concerned with data representation, though display issues are also discussed in some detail. Data manipulation is little discussed. 1.1 DATA REPRESENTATION ISSUES In general, the major data representation issues are character set selection, and character set encoding. The biggest problem with character set selection is that most are standards, and as Andy Tanenbaum once noted: The nice thing about standards is that there are so many to choose from. Character set encodings suffer in much the same way. There are a large number of character set encodings, and the number is *not* decreasing. For any application that claims to be multilingual, it must obviously support the character sets, and encodings, used to represent the information it is processing. It should be noted that multilingual data could conceivably contain multiple languages, character sets, and encodings, further complicating the problem. 1.2 DATA DISPLAY ISSUES Quite obviously, for a viewing application to be considered multilingual, it must be able to present multilingual data to the reader in a sensible manner. The common problem to overcome here is obviously font mapping, however, languages around the world have different writing directions as well, and some languages have mixed writing directions, which should also be handled "correctly." Note that in the above paragraph "in a sensible manner" does not necessarily mean "be able to rendered in it's native format." One other possiblilty would be to render the multilingual document as a phonetic rendering in ASCII. For example, if some Japanese text was sent to a person that cannot read Kanji, Hiragana, or Katakana, the browser could conceivably map the Japanese text into something like the following: Nihongo, tokuni Kanji, wa totemo muzukashii desu. possibly with some extra text indicating that this is Japanese. Another possibility is machine translation (which is becoming more viable year by year). 1.3 DATA MANIPULATION ISSUES In order to be multilingual, the application must be able to manipulate multilingual data (including such issues as collation, though this is not currently needed for browsing the WWW). One major issue here is the representation of a character within the application, and the representation of strings. In some applications, multibyte formats are used throughout, in others, fixed width, wide characters are used throughout. In others, a combination is used. 1.4 SUMMARY OF REQUIREMENTS A multilingual application must be able to: 1. Support the character sets and encodings used to represent the information being manipulated. 2. Present the data meaningfully if the application is required to display the data. 3. Manipulate multilingual data internally. 2. Bringing multilingual capabilities to the WWW Having established some basic requirements, it is now time to look at how the above fits into the World Wide Web. 2.1 MIME ISSUES One of the problems with representing multilingual documents in the WWW is that MIME explicitly merges the character set and character set encodings together. In fact, it is probably more accurate to say that MIME specifies only the character set encoding, which in turn defines a character set by implication. For example: charset=unicode-1-1-utf-7 actually specifies the UTF-7 encoding of Unicode explicitly, but only specifies Unicode implicitly. In addition, the MIME specification states that for the text/* data types, all line breaks must be indicated by a CRLF pair. This implies that certain encodings cannot be used within the text/* data types if the WWW is to be strictly MIME conformant. 2.2 SGML ISSUES SGML does not define the representation of characters. ISO 8879 defines a code set as a set of bit combinations of equal size, a character as an atom of information, a character repetoire as a group of characters, and a character set as a mapping of a character repetoire onto a code set such that each character is represented by a unique bit combination within the code set. As such, an SGML parser is independent of the physical representation of the data, and there is often an internal representation of characters that could be quite different to that used in data storage. ISO 8879 also defines some methods for handling things like ISO-2022, but some encodings for languages such a Thai cannot be handled by SGML, even if the SGML declaration is altered (though, it is possible for the application to deal with this within, or before, the entity manager). 2.3 BROWSERS ISSUES Basic requirements for a multilingual WWW browser were listed in section 1.4. Let's now look at each as they apply to WWW browser like Mosaic. 2.3.1 Support the character sets and encodings As noted above, there are a huge number of character sets, and encodings. If the above requirement is taken literally, it means that in order to have a multilingual WWW, each browser must potentially be able to understand this huge number of character sets and encodings. Taken at face value it also means that the SGML parser would need to handle a large number of character sets, and one would not be able to have multiple character sets within a document (or to be precise, SGML provides no way for the parser to handle a given bit combination representing more than one character within a given document). As such, the brute force approach would almost certainly have to map the multiple character sets and encodings to a single internal representation (though it is conceivable that character sets could be paged in and out, this approach would be very complicated to implement). This internal representation would almost certainly have to be 16 bits wide, or wider. SGML working group 8 has stated that multilingual systems should map data storage encodings to wide characters before, on in the entity manager (and the sp parser from James Clark serves as a good model of thier recommendations). 2.3.2 Present the data meaningfully One of the great benefits of SGML (and HTML to a lesser degree), is that it is independent of the display technology. As such, the presentation issue depends very much on the application (for example, rendering multilingual documents on a TTY might require the aforementioned phonetic rendering, while GUI based systems require font mapping). A common thread running through all display technologies is the need for some mapping from the character codes to some output representation. While not required, using 16 bit (or greater) codes will probably simplify the mapping task as it requires only a single lookup table. 2.3.3 Internal multilingual data manipulation There are a large number of issues here, but in general, having a fixed width character eases the task of manipulating that data, especially in languages such as C, because it makes memory management and indexing easier. In addition, fixed width codes require no sychronisation. 3. Is Unicode the answer? In a word, YES! Though there are a number of issues that need to be resolved in order for it to be used effectively. 3.1 WHAT ARE THE BENEFITS? First, let's step back and look at the overall architecture The desired features in a multilingual WWW system are: 1. Allow publishers to create and manipulate documents in the local character set and character encoding scheme 2. Allow readers to follow a URL to a document, and expect to be able to read it irrespective of the character set and encoding used for document creation. 3. Allow readers to save the document in their preferred local encoding. 4. Scalability. Will the solution work even when large numbers of languages, character sets, and encodings are used within a single document? 5. Keep implementations as simple as possible. Unicode solves most if these items quite nicely. The second and fourth points are covered because Unicode provides codes for most languages of the world. It does not cover all languages completely, but it certainly contains enough for most common uses, and there are ways for handling the uncommon cases. The last point is covered because the UCS-2 encoding of Unicode is a 16 bit wide, fixed width encoding. As pointed out above, using such fixed width characters simplifies SGML parsing, display, and data manipulation. The first and third points are not covered directly, but simple translation techniques can be used to achieve them as outlined below. 3.2 INCORPORATING UNICODE INTO THE WWW The following outlines a proposal for incorporating Unicode into the WWW in such a way that all of the above points are solved. Where the word server is used, it should be taken to mean a HyperText Transfer Protocol (HTTP) server unless specifically qualified otherwise. 3.2.1 Unicode incorportation architecture In order to make multilingual support as painless as possible, it is proposed that all HTTP servers for multilingual documents *should* be able to convert documents from the local character set encoding to UCS-2, UTF-8, and UTF-7 (16, 8 and 7 bit encodings of Unicode). It is also proposed that all HTTP clients *should* be able to parse UCS-2, UTF-8 and UTF-7. It is *recommended* that browsers allow the data to be saved as UTF-7, UTF-8, or UCS-2 (similar to the current ftp interface). If possible, a browser *should* also allow the data to be saved in the local character set encoding, but that might not always be possible (for example, saving a document containing Arabic on an ASCII based system). Documents sent from servers would then use a content type of: Content-Type: text/...; charset=UNICODE-1-1-UTF-7 Content-Type: text/...; charset=UNICODE-1-1-UTF-8 Content-Type: text/...; charset=UNICODE-1-1-UCS-2 Though UTF-8 and UCS-2 will need some additional encoding applied to them in order to be strictly MIME compliant. An alternative is to use an application/* type specifier instead. This architecture has the following benefits: Conceptual Simplicity The model is very clear conceptually: a document is created in the local encoding, which is then converted into a commonly understood form. Client-side processing occurs using this representation. Ease of client implementation The number of clients far exceeds that of servers. Thus, forcing clients to deal with a large number of characters sets and encodings requires more effort, and the lag in update period will be greater. Imagine for example, if someone wants to add support for some new character set encoding they invented. In order for support to be added to clients, one would have to update all clients that could potentially access such data. Compare this to having to update only the servers working with that encoding directly. In addition, because Unicode and UCS-2 are already well defined, it is possible to write an SGML parser, display subsystem, and data manipulation functions optimised for that representation, realising significant performance gains. While initial implementations will need to write the code for handling Unicode, and the encodings thereof, it is expected that freely distributable libraries for such things will appear MIME and ASCII compatibility UTF-7 and UTF-8 are largely ASCII compatible. In addition UTF-7 was designed as a method for encoding Unicode in MIME, so it is perfect for WWW and MIME uses HTML compatability Recent drafts of the HTML specification state that a MIME charset parameter will override the default IS0-8859-1, so this presents no problem. In addition, it should not require large changes to current HTML parsers. Truly multilingual As noted, this proposal uses Unicode. All languages defined within ISO-10646 can be used within a single document. SGML (HTML) parsers will work with UCS-2 directly, so different languages would be treated identically at the parsing level. While the cost of performing translations from the local encoding to one of the Unicode encodings might appear prohibitive, it is believed that intelligent servers will cache translated documents in a manner similar to current proxy caches. Between this, and the fact that most WWW documents are small, the performance hit should not be overly significant. 3.2.2 Extension to the basic architecture Requiring that *all* non-ASCII documents be converted to Unicode is probably a very poor idea as it would incur significant overhead. Instead, HTTP clients can indicate encoding preferences via the Accept: field in the request header. For example: Accept: text/html; charset=iso-2022-jp Accept: text/html; charset=unicode-1-1-ucs-2 Accept: text/*; charset=unicode-1-1-utf-7 If a server is able to deliver the document in one of the preferred encodings, it should do so. This will allow clients and servers sharing a common local encoding to transfer documents without the overhead of Unicode translation. Note that most encodings will need additional encoding to strictly conform to the text/* MIME types. Assuming that all clients are indeed able to parser UTF-7, UTF-8, and UCS-2, the server should default to delivering multilingual documents in one of these encodings as it will provide the greatest probability that the client recieves something it can meaningfully process. 3.2.3 Accept-charset While the charset parameter is sufficient for implementing the extended architecture, it requires that for each MIME type in which character set encoding negotiation is desired, a charset parameter must be defined. Also the server must be able to parse all the type specifications, and make meaningful decisions based upon them. As the number of deliverable types increases, so does the complexity of the server format negotiation subsystem. It seems desirable to be able to say to the server "for all data in which charset is meaningful, send me it encoded as xxxxx", as it would tend to simplify decoding the data into the applications' internal character representation. This can be accomplished via wildcarding the Accept: field, but a somewhat cleaner alternative would be to have an addition field Accept-charset: which sets the default encoding; requests like: Accept: text/html; charset=iso-2-22-jp could be used to decide the next best encoding if that of Accept-charset: cannot be delivered. Such cases would probably be uncommon if one assumes that multilingual data will be sent as Unicode. 3.2.4 Presentational hints for Unicode While Unicode certainly serves as an excellent lowest common denominator for multilingual documents, systems using Unicode require more information than that contained in the character codes themselves. Probably the most well known example of this is in the Han unification used in Unicode. Unicode defines codes for characters that are shared between Chinese, Korean, and Japanese, but the glyph images used in each language are different. Hence, we need to know the language in which the character code occurred in order to display it "correctly". Another interesting case occurs in corporate names in Japan. It is common for Japanese corporations to use a slightly different glyph image for the characters that make up the company name as a way of distinguishing themselves. Again, we need extra information to map the base code onto the correct glyph image. Such data are referred to as presentational hints within this document. Given that a conversion from the local character set to Unicode is being performed by the server, and that this conversion is automatic, it seems possible for the conversion process to automatically include presentation hints in the converted output. Applications that understand the hints can use them to improve the conversion resolution where necessary, while other applications can simply ignore them, or remove them from the data stream. (Strictly speaking, presentational hints are not necessary as in most (perhaps all) cases, the text will be legible, even if the glyph image is not quite correct. Rather, they are desirable for top quality, and especially, typograhic quality, output.) The problem of representing presentational hints is a difficult one. Obviously, it is better to represent such data as tags, rather than as codes, and HTML 3.0 includes a <LANG> tag in the DTD specifically for this. However, high-level tag use (eg. defining them in a DTD) fails for the following reasons: 1. It is not transparent. The application processing the data stream must be able to parse the tags, even if it can not do anything with them. This necessarily complicates the parser. 2. There are probably a huge number of presentation hints that could be used, and the list is dynamic as societal trends tend to alter languages. Good examples can be found by comparing almost any current written form of a language to that used 100 years ago. Some languages have even changed dramatically in the last 50 years. This argues for a low-level tag which is basically transparent to anything parsing the input data stream. This in turn implies that the presentational hints either take effect before the parser, or that they can be manipulated unambiguously as data (or that they can unambiguously be removed from the data stream). This paper will not attempt to define a format for presentation hints. Rather, 3 methods are outlined below, and it is hoped that subsequent discussion will lead to a decision as to which is more applicable to the WWWW. Method 1: Code-based presentation hints Here, codes from the Private Use Area are allocated to represent presentational hints. The advantages of this method is that hints and data can be treated identically, and that hints can be removed transparently. The disadvantages are that it stops other applications from using the Private Use Area, and also, the Private Use Area has a limited range. In addition, this "pollutes" the character set with non-character data. Method 2: Encoding-based presentation hints Here, an encoding is used which has space for presentation hints. An example of this is the ICODE encoding proposed by Masataka Ohta which uses 21 bits. Mr. Ohta also defines an encoding called IUTF which is upwardly compatible with UTF-2. This method has all of the advantages of method 1, but would require at least 21 bits of storage per character. Method 3: Tag-based presentation hints Here, tags are defined to represent presentational hints. One tag might potentially serve multiple purposes (for example, a LANG tag can serve to specify any language). The key difference between this method, and the high-level tag method is that tag interpretation here occurs before the application proper sees the data. As such, the tags can be removed transparently. This method would require a certain amount of API for handling tags: probably based on callbacks. In addition, it would appear that one code from the private use area will be needed in order to make tag identification completely unambiguous, for all data streams. 4. Summary To summarise, this document proposes the following: 1. All servers of multilingual documents should be able to convert documents from their local encoding into UCS-2, UTF-8, and UTF-7. 2. All clients should be able to at least parse such data. 3. Clients and servers should also be able to directly transfer data if they share local encodings 4. A method should be decided upon which allows presentational hints to be inserted into the data stream to aid in glyph image disambiguation. It is beleived that given such an architecture, the World Wide Web will become truly multilingual, and truly, a World Wide Web. 5. Discussion The following are a few notes on recent developments which have some bearing on the contents of this document. 5.1 NOTES ON RELAXED CONTENT PARSING IN HTTP A recent development is that the HTTP Working Group has basically decided that HTTP will not require strict MIME conformance for textual data types. In effect, the recent decision says that the parsing of the message data should be done in accordance with the specified character set encoding. This thereby allows multilngual servers to send any encoding the client claims to understand, including UCS-2, the 16 bit encoding of Unicode. This should simplify data processing enormously. 5.2 EXTENDED REFERENCE CONCRETE SYNTAX Recently, a proposal for an extended reference concrete syntax for SGML was sent to SGML Open. In this ERCS, the SGML declaration defines that the BASESET is "ISO 10646:199?//CHARSET UCS-2//EN", and that the DECSET parameter gives 16 bits for each character. 5.3 ISOLATING APPLICATIONS FROM LANGUAGE AND CHARACTER REPRESENTATION ISSUES The author beleives that most applications, and especially those based on SGML, are basically independent of the underlying language and character representations. Obviously, characters needs to be assigned to sets for parsing purposes (like LCNMSTRT for example), but most applications should never need to know anything more about a character until the glyph image, or information about the glyph image, is required. In other words, most parsers and processors needn't be language-aware, but a text display system probably does. This proposal tries to emphasise this as much as possible by trying to provide a uniform character code stream for the application to process. 5.3 Unicode data: data manipulation and font handling. For an excellect look at the issues involved, and one possible solution to them, it is well worth reading the papers about the Plan 9 system from Bell Laboratories. 6. Bibliography The Plan 9 Papers ftp://research.att.com/ East Asian Character Set Issues: A Proposal For An Extended Reference Concrete Syntax Rick Jelliffe Allette Systems Sydney, Australia The SGML Handbook Oxford University Press Written by Charles Goldfarb ISBN 0-19-853737-9 The Unicode Standard, Version 1.1 Version 1.0, Volume 1, ISBN 0-201-56788-1 Version 1.0, Volume 2, ISBN 0-201-60845-6 Using Unicode with MIME D. Goldsmith http://ds.internic.net/rfc/rfc1641 UTF-7: A Mail Safe Transformation Format of Unicode D. Goldsmith and M. Davis http://ds.internic.net/rfc/rfc1642 MIME (Multipurpose Internet Mail Extensions) Part 1 N. Borenstein and N. Freed http://ds.internic.net/rfc/rfc1521.ps MIME (Multipurpose Internet Mail Extensions) Part 2 K. Moore http://ds.internic.net/rfc/rfc1522.txt Hypertext Transfer Protocol -- HTTP/1.0 T. Berners-Lee, R. T. Fielding, H. Frystyk Nielsen ftp://ds.internic.net/internet-drafts/draft-fielding-http-spec-01.txt _______________________________________________________________________________ This document in no way reflects the opinions of Electronic Book Technologies. All opinions contained herein are solely those of the author. _______________________________________________________________________________
Received on Sunday, 25 December 1994 07:56:27 UTC