A truly multilingual WWW

Below is a paper that I have put together proposing a method for
making the WWW a truly international information repository. It is
submitted in the hope that it will raise discussion of the issues
involved, and perhaps to serve as a goal for which we should all
strive. Certainly, the architecture outlined will not come to pass
overnight, but both client and server implementors could start
implementing the changes now, and as the tools spread, so the
architecture will too.

I would appreciate any and all comments. All errors of language or
fact are solely my responsibility (ie. if I am making a fool of
anyone, I am making it of myself ;-)), and I would appreciate
correction.  

An HTML version of this paper is available upon request.

Merry Christmas to all!

-------------------------

                  HANDLING MULTILINGUAL DOCUMENTS IN THE WWW
                
                                           Gavin T. Nicol
                                           Electronic Book Technologies, Japan
                                           1-29-9 Tsurumaki, Setagaya-ku,
                                           Tokyo 154,
                                           Japan
                                           +81-3-3706-7351
                                           gtn@ebt.com 
    
ABSTRACT.
     The World Wide Web has enjoyed explosive growth in recent years, and
     there are now millions of people using it all around the world.
     Despite the fact that the Internet, and the World Wide Web, span
     the globe, there is, as yet, no well-defined way of handling
     documents that contain multiple languages, character sets, or
     encodings thereof. In this document, a method is proposed for
     cleanly handling such multilingual text on the WWW. 
     
1. Requirements for multilingual applications
   
   There are many issues facing a system claiming to be multilingual,
   though all issues fall into one of 3 categories:
    1. Data representation issues
    2. Data manipulation issues
    3. Data display issues
       
   This document is primarily concerned with data representation, though
   display issues are also discussed in some detail. Data manipulation is
   little discussed.
   
  1.1 DATA REPRESENTATION ISSUES
  
   In general, the major data representation issues are character set
   selection, and character set encoding. The biggest problem with
   character set selection is that most are standards, and as Andy
   Tanenbaum once noted:
   
     The nice thing about standards is that there are so many to choose
     from. 
     
   Character set encodings suffer in much the same way. There are a large
   number of character set encodings, and the number is *not* decreasing.
   
   For any application that claims to be multilingual, it must obviously
   support the character sets, and encodings, used to represent the
   information it is processing. It should be noted that multilingual
   data could conceivably contain multiple languages, character sets, and
   encodings, further complicating the problem.
   
  1.2 DATA DISPLAY ISSUES
   
   Quite obviously, for a viewing application to be considered
   multilingual, it must be able to present multilingual data to the
   reader in a sensible manner. The common problem to overcome here is
   obviously font mapping, however, languages around the world have
   different writing directions as well, and some languages have mixed
   writing directions, which should also be handled "correctly."

   Note that in the above paragraph "in a sensible manner" does not
   necessarily mean "be able to rendered in it's native format." One
   other possiblilty would be to render the multilingual document as a
   phonetic rendering in ASCII. For example, if some Japanese text was
   sent to a person that cannot read Kanji, Hiragana, or Katakana, the
   browser could conceivably map the Japanese text into something like
   the following:

       Nihongo, tokuni Kanji, wa totemo muzukashii desu.

   possibly with some extra text indicating that this is Japanese.
   Another possibility is machine translation (which is becoming more
   viable year by year).
   
  1.3 DATA MANIPULATION ISSUES
   
   In order to be multilingual, the application must be able to
   manipulate multilingual data (including such issues as collation,
   though this is not currently needed for browsing the WWW). One major
   issue here is the representation of a character within the
   application, and the representation of strings. In some applications,
   multibyte formats are used throughout, in others, fixed width, wide
   characters are used throughout. In others, a combination is used.
   
  1.4 SUMMARY OF REQUIREMENTS
   
   A multilingual application must be able to:
    1. Support the character sets and encodings used to represent the
       information being manipulated.
    2. Present the data meaningfully if the application is required to
       display the data.
    3. Manipulate multilingual data internally.
       
2. Bringing multilingual capabilities to the WWW

   Having established some basic requirements, it is now time to look at
   how the above fits into the World Wide Web.
   
  2.1 MIME ISSUES
   
   One of the problems with representing multilingual documents in the
   WWW is that MIME explicitly merges the character set and character set
   encodings together. In fact, it is probably more accurate to say that
   MIME specifies only the character set encoding, which in turn defines
   a character set by implication. For example:

    charset=unicode-1-1-utf-7

   actually specifies the UTF-7 encoding of Unicode explicitly, but only
   specifies Unicode implicitly.
   
   In addition, the MIME specification states that for the text/* data
   types, all line breaks must be indicated by a CRLF pair. This implies
   that certain encodings cannot be used within the text/* data types if
   the WWW is to be strictly MIME conformant.
   
  2.2 SGML ISSUES
   
   SGML does not define the representation of characters. ISO 8879
   defines a code set as a set of bit combinations of equal size, a
   character as an atom of information, a character repetoire as a group
   of characters, and a character set as a mapping of a character
   repetoire onto a code set such that each character is represented by a
   unique bit combination within the code set. As such, an SGML parser is
   independent of the physical representation of the data, and there is
   often an internal representation of characters that could be quite
   different to that used in data storage.
   
   ISO 8879 also defines some methods for handling things like ISO-2022,
   but some encodings for languages such a Thai cannot be handled by
   SGML, even if the SGML declaration is altered (though, it is possible
   for the application to deal with this within, or before, the entity
   manager).
   
  2.3 BROWSERS ISSUES
  
   Basic requirements for a multilingual WWW browser were listed in
   section 1.4. Let's now look at each as they apply to WWW browser like
   Mosaic.
   
    2.3.1 Support the character sets and encodings
   
   As noted above, there are a huge number of character sets, and
   encodings. If the above requirement is taken literally, it means that
   in order to have a multilingual WWW, each browser must potentially be
   able to understand this huge number of character sets and encodings.
   Taken at face value it also means that the SGML parser would need to
   handle a large number of character sets, and one would not be able to
   have multiple character sets within a document (or to be precise,
   SGML provides no way for the parser to handle a given bit combination
   representing more than one character within a given document). As
   such, the brute force approach would almost certainly have to map the
   multiple character sets and encodings to a single internal
   representation (though it is conceivable that character sets could be
   paged in and out, this approach would be very complicated to
   implement). This internal representation would almost certainly have
   to be 16 bits wide, or wider. SGML working group 8 has stated that
   multilingual systems should map data storage encodings to wide
   characters before, on in the entity manager (and the sp parser from
   James Clark serves as a good model of thier recommendations).
   
    2.3.2 Present the data meaningfully
   
   One of the great benefits of SGML (and HTML to a lesser degree), is
   that it is independent of the display technology. As such, the
   presentation issue depends very much on the application (for example,
   rendering multilingual documents on a TTY might require the
   aforementioned phonetic rendering, while GUI based systems require
   font mapping).
   
   A common thread running through all display technologies is the need
   for some mapping from the character codes to some output
   representation. While not required, using 16 bit (or greater) codes
   will probably simplify the mapping task as it requires only a single
   lookup table.
   
    2.3.3 Internal multilingual data manipulation
   
   There are a large number of issues here, but in general, having a
   fixed width character eases the task of manipulating that data,
   especially in languages such as C, because it makes memory management
   and indexing easier. In addition, fixed width codes require no
   sychronisation.
   
3. Is Unicode the answer?
   
   In a word, YES! Though there are a number of issues that need to be
   resolved in order for it to be used effectively.
   
  3.1 WHAT ARE THE BENEFITS?
   
   First, let's step back and look at the overall architecture
   
   The desired features in a multilingual WWW system are:
    1. Allow publishers to create and manipulate documents in the local
       character set and character encoding scheme
    2. Allow readers to follow a URL to a document, and expect to be able
       to read it irrespective of the character set and encoding used for
       document creation.
    3. Allow readers to save the document in their preferred local
       encoding.
    4. Scalability. Will the solution work even when large numbers of
       languages, character sets, and encodings are used within a single
       document?
    5. Keep implementations as simple as possible.
       
   Unicode solves most if these items quite nicely.
   
   The second and fourth points are covered because Unicode provides
   codes for most languages of the world. It does not cover all languages
   completely, but it certainly contains enough for most common uses, and
   there are ways for handling the uncommon cases.
   
   The last point is covered because the UCS-2 encoding of Unicode is a 16
   bit wide, fixed width encoding. As pointed out above, using such fixed
   width characters simplifies SGML parsing, display, and data
   manipulation.
   
   The first and third points are not covered directly, but simple
   translation techniques can be used to achieve them as outlined below. 
   
  3.2 INCORPORATING UNICODE INTO THE WWW
     
   The following outlines a proposal for incorporating Unicode into the
   WWW in such a way that all of the above points are solved. Where the
   word server is used, it should be taken to mean a HyperText Transfer
   Protocol (HTTP) server unless specifically qualified otherwise.
   
    3.2.1 Unicode incorportation architecture
   
   In order to make multilingual support as painless as possible, it is
   proposed that all HTTP servers for multilingual documents *should* be
   able to convert documents from the local character set encoding to
   UCS-2, UTF-8, and UTF-7 (16, 8 and 7 bit encodings of Unicode). It is
   also proposed that all HTTP clients *should* be able to parse UCS-2,
   UTF-8 and UTF-7. It is *recommended* that browsers allow the data to be
   saved as UTF-7, UTF-8, or UCS-2 (similar to the current ftp
   interface). If possible, a browser *should* also allow the data to be
   saved in the local character set encoding, but that might not always
   be possible (for example, saving a document containing Arabic on an
   ASCII based system). Documents sent from servers would then use a
   content type of:

     Content-Type: text/...; charset=UNICODE-1-1-UTF-7
     Content-Type: text/...; charset=UNICODE-1-1-UTF-8
     Content-Type: text/...; charset=UNICODE-1-1-UCS-2

   Though UTF-8 and UCS-2 will need some additional encoding applied to
   them in order to be strictly MIME compliant. An alternative is to use
   an application/* type specifier instead.
   
   This architecture has the following benefits:
   
   Conceptual Simplicity
          
          The model is very clear conceptually: a document is created in
          the local encoding, which is then converted into a commonly 
          understood form. Client-side processing occurs using this
          representation.
          
   Ease of client implementation
          
          The number of clients far exceeds that of servers. Thus,
          forcing clients to deal with a large number of characters sets
          and encodings requires more effort, and the lag in update
          period will be greater. Imagine for example, if someone wants
          to add support for some new character set encoding they
          invented. In order for support to be added to clients, one
          would have to update all clients that could potentially access
          such data. Compare this to having to update only the servers
          working with that encoding directly. In addition, because
          Unicode and UCS-2 are already well defined, it is possible to
          write an SGML parser, display subsystem, and data manipulation
          functions optimised for that representation, realising
          significant performance gains.
          
          While initial implementations will need to write the code for
          handling Unicode, and the encodings thereof, it is expected
          that freely distributable libraries for such things will appear
          
   MIME and ASCII compatibility
          
          UTF-7 and UTF-8 are largely ASCII compatible. In addition UTF-7
          was designed as a method for encoding Unicode in MIME, so it is
          perfect for WWW and MIME uses
          
   HTML compatability
          
          Recent drafts of the HTML specification state that a MIME
          charset parameter will override the default IS0-8859-1, so this
          presents no problem. In addition, it should not require large
          changes to current HTML parsers.
          
   Truly multilingual
          
          As noted, this proposal uses Unicode. All languages defined
          within ISO-10646 can be used within a single document. SGML
          (HTML) parsers will work with UCS-2 directly, so different
          languages would be treated identically at the parsing level.
          
   While the cost of performing translations from the local encoding to
   one of the Unicode encodings might appear prohibitive, it is believed
   that intelligent servers will cache translated documents in a manner
   similar to current proxy caches. Between this, and the fact that most
   WWW documents are small, the performance hit should not be overly
   significant.
   
    3.2.2 Extension to the basic architecture
   
   Requiring that *all* non-ASCII documents be converted to Unicode is
   probably a very poor idea as it would incur significant overhead.
   Instead, HTTP clients can indicate encoding preferences via the
   Accept: field in the request header. For example:

      Accept: text/html; charset=iso-2022-jp
      Accept: text/html; charset=unicode-1-1-ucs-2
      Accept: text/*;    charset=unicode-1-1-utf-7
   
   If a server is able to deliver the document in one of the preferred
   encodings, it should do so. This will allow clients and servers
   sharing a common local encoding to transfer documents without the
   overhead of Unicode translation. Note that most encodings will need
   additional encoding to strictly conform to the text/* MIME types.
   
   Assuming that all clients are indeed able to parser UTF-7, UTF-8, and
   UCS-2, the server should default to delivering multilingual documents
   in one of these encodings as it will provide the greatest probability
   that the client recieves something it can meaningfully process.
   
    3.2.3 Accept-charset
   
   While the charset parameter is sufficient for implementing the
   extended architecture, it requires that for each MIME type in which
   character set encoding negotiation is desired, a charset parameter
   must be defined. Also the server must be able to parse all the type
   specifications, and make meaningful decisions based upon them. As the
   number of deliverable types increases, so does the complexity of the
   server format negotiation subsystem.
   
   It seems desirable to be able to say to the server "for all data in
   which charset is meaningful, send me it encoded as xxxxx", as it
   would tend to simplify decoding the data into the applications'
   internal character representation. This can be accomplished via
   wildcarding the Accept: field, but a somewhat cleaner alternative
   would be to have an addition field Accept-charset: which sets the
   default encoding; requests like:

    Accept: text/html; charset=iso-2-22-jp

   could be used to decide the next best encoding if that of 
   Accept-charset: cannot be delivered. Such cases would probably be 
   uncommon if one assumes that multilingual data will be sent as Unicode.
   
    3.2.4 Presentational hints for Unicode
    
   While Unicode certainly serves as an excellent lowest common
   denominator for multilingual documents, systems using Unicode require
   more information than that contained in the character codes
   themselves. Probably the most well known example of this is in the Han
   unification used in Unicode. Unicode defines codes for characters that
   are shared between Chinese, Korean, and Japanese, but the glyph images
   used in each language are different. Hence, we need to know the
   language in which the character code occurred in order to display it
   "correctly". Another interesting case occurs in corporate names in
   Japan. It is common for Japanese corporations to use a slightly
   different glyph image for the characters that make up the company name
   as a way of distinguishing themselves. Again, we need extra
   information to map the base code onto the correct glyph image. Such
   data are referred to as presentational hints within this document.
   
   Given that a conversion from the local character set to Unicode is
   being performed by the server, and that this conversion is automatic,
   it seems possible for the conversion process to automatically include
   presentation hints in the converted output. Applications that
   understand the hints can use them to improve the conversion resolution
   where necessary, while other applications can simply ignore them, or
   remove them from the data stream. (Strictly speaking, presentational
   hints are not necessary as in most (perhaps all) cases, the text will
   be legible, even if the glyph image is not quite correct. Rather, they
   are desirable for top quality, and especially, typograhic quality,
   output.)
   
   The problem of representing presentational hints is a difficult one.
   Obviously, it is better to represent such data as tags, rather than as
   codes, and HTML 3.0 includes a <LANG> tag in the DTD specifically for
   this.
   
   However, high-level tag use (eg. defining them in a DTD) fails for
   the following reasons:
    1. It is not transparent. The application processing the data stream
       must be able to parse the tags, even if it can not do anything
       with them. This necessarily complicates the parser.
    2. There are probably a huge number of presentation hints that could
       be used, and the list is dynamic as societal trends tend to alter
       languages. Good examples can be found by comparing almost any
       current written form of a language to that used 100 years ago.
       Some languages have even changed dramatically in the last 50
       years.
       
   This argues for a low-level tag which is basically transparent to
   anything parsing the input data stream. This in turn implies that the
   presentational hints either take effect before the parser, or that
   they can be manipulated unambiguously as data (or that they can
   unambiguously be removed from the data stream).
   
   This paper will not attempt to define a format for presentation hints.
   Rather, 3 methods are outlined below, and it is hoped that subsequent
   discussion will lead to a decision as to which is more applicable to
   the WWWW.
   
      Method 1: Code-based presentation hints
   
   Here, codes from the Private Use Area are allocated to represent
   presentational hints. The advantages of this method is that hints and
   data can be treated identically, and that hints can be removed
   transparently. The disadvantages are that it stops other applications
   from using the Private Use Area, and also, the Private Use Area has a
   limited range. In addition, this "pollutes" the character set with
   non-character data.
   
      Method 2: Encoding-based presentation hints
   
   Here, an encoding is used which has space for presentation hints. An
   example of this is the ICODE encoding proposed by Masataka Ohta which
   uses 21 bits. Mr. Ohta also defines an encoding called IUTF which is
   upwardly compatible with UTF-2. This method has all of the advantages
   of method 1, but would require at least 21 bits of storage per
   character.
   
      Method 3: Tag-based presentation hints
   
   Here, tags are defined to represent presentational hints. One tag
   might potentially serve multiple purposes (for example, a LANG tag can
   serve to specify any language). The key difference between this
   method, and the high-level tag method is that tag interpretation here
   occurs before the application proper sees the data. As such, the tags
   can be removed transparently. This method would require a certain
   amount of API for handling tags: probably based on callbacks. In
   addition, it would appear that one code from the private use area will
   be needed in order to make tag identification completely unambiguous,
   for all data streams. 
   
4. Summary
   
   To summarise, this document proposes the following:
    1. All servers of multilingual documents should be able to convert
       documents from their local encoding into UCS-2, UTF-8, and UTF-7.
    2. All clients should be able to at least parse such data.
    3. Clients and servers should also be able to directly transfer data
       if they share local encodings
    4. A method should be decided upon which allows presentational hints
       to be inserted into the data stream to aid in glyph image
       disambiguation.
       
   It is beleived that given such an architecture, the World Wide Web
   will become truly multilingual, and truly, a World Wide Web.
   
5. Discussion
   
   The following are a few notes on recent developments which have some
   bearing on the contents of this document.
   
  5.1 NOTES ON RELAXED CONTENT PARSING IN HTTP
  
   A recent development is that the HTTP Working Group has basically
   decided that HTTP will not require strict MIME conformance for textual
   data types. In effect, the recent decision says that the parsing of
   the message data should be done in accordance with the specified
   character set encoding. This thereby allows multilngual servers to
   send any encoding the client claims to understand, including UCS-2,
   the 16 bit encoding of Unicode. This should simplify data processing
   enormously.
   
  5.2 EXTENDED REFERENCE CONCRETE SYNTAX
   
   Recently, a proposal for an extended reference concrete syntax for
   SGML was sent to SGML Open. In this ERCS, the SGML declaration defines
   that the BASESET is "ISO 10646:199?//CHARSET UCS-2//EN", and that the
   DECSET parameter gives 16 bits for each character.
   
  5.3 ISOLATING APPLICATIONS FROM LANGUAGE AND CHARACTER REPRESENTATION ISSUES
   
   The author beleives that most applications, and especially those based
   on SGML, are basically independent of the underlying language and
   character representations. Obviously, characters needs to be assigned
   to sets for parsing purposes (like LCNMSTRT for example), but most
   applications should never need to know anything more about a character
   until the glyph image, or information about the glyph image, is
   required. In other words, most parsers and processors needn't be 
   language-aware, but a text display system probably does. This proposal
   tries to emphasise this as much as possible by trying to provide
   a uniform character code stream for the application to process.
   
    5.3 Unicode data: data manipulation and font handling.
       
   For an excellect look at the issues involved, and one possible
   solution to them, it is well worth reading the papers about the Plan 9
   system from Bell Laboratories.
   
6. Bibliography

    The Plan 9 Papers
    ftp://research.att.com/
   
    East Asian Character Set Issues:
    A Proposal For An Extended Reference Concrete Syntax
    Rick Jelliffe
    Allette Systems
    Sydney, Australia
   
    The SGML Handbook
    Oxford University Press
    Written by Charles Goldfarb
    ISBN 0-19-853737-9
    
    The Unicode Standard, Version 1.1
    Version 1.0, Volume 1, ISBN 0-201-56788-1
    Version 1.0, Volume 2, ISBN 0-201-60845-6
   
    Using Unicode with MIME
    D. Goldsmith
    http://ds.internic.net/rfc/rfc1641
   
    UTF-7: A Mail Safe Transformation Format of Unicode
    D. Goldsmith and M. Davis
    http://ds.internic.net/rfc/rfc1642
   
    MIME (Multipurpose Internet Mail Extensions) Part 1
    N. Borenstein and N. Freed
    http://ds.internic.net/rfc/rfc1521.ps
   
    MIME (Multipurpose Internet Mail Extensions) Part 2
    K. Moore
    http://ds.internic.net/rfc/rfc1522.txt
   
    Hypertext Transfer Protocol -- HTTP/1.0
    T. Berners-Lee, R. T. Fielding, H. Frystyk Nielsen
    ftp://ds.internic.net/internet-drafts/draft-fielding-http-spec-01.txt
    
_______________________________________________________________________________

   This document in no way reflects the opinions of Electronic Book
   Technologies. All opinions contained herein are solely those of the
   author. 
_______________________________________________________________________________

Received on Sunday, 25 December 1994 07:56:27 UTC