- From: Pieter van Zee <piet@hpcvusm.cv.hp.com>
- Date: Thu, 23 Jun 1994 13:09:52 -0700
- To: www-html@www0.cern.ch
I've included an excerpt from a private e-mail conversation that relates to the question of charset encoding. ------------------------------------------------------------ > ... queried why one needed to specify the charset on each element. > Wouldn't it be suffient to specify the ISO 2022 mechanism at the > MIME level and leave it to the escape mechanism to specify shifts > between character sets? > I'll restate for clarity: My objective is to support multi-lingual content, i.e. to move away from the assumption that the entire content of an HTML file is in a single charset. Such documents are quite useful, e.g. cross-language dictionaries, newspapers, academic publications, etc. The proposal of putting charset on each element is one way to do this. Because there are many charsets that are appropriate for any given lang value, the charset is one way to uniquely identify the encoding. Assuming we agree that HTML documents need to support multi-lingual content, let's discuss how this might occur. I ran the following by our i18n guru to verify my comments. The phrase "specifying the ISO 2022 mechanism at the MIME level" isn't exactly clear to me. I'll take it to mean that whenever a HTML document is encapsulated as a MIME object for transport, the document must use ISO 2022 encoding for its content. Let's generalize and call this: Strategy (a): a HTML document has only ISO 2022-encoded content. And my proposal is: Strategy (b): every HTML element has optional LANG and CHARSET attibutes which specify the locale of the element's data. In other words...A HTML document uses 7-bit ASCII for markup but may use any charset for content, and charset is specified in two ways: (i) an optional default charset for the document, and (ii) an optional charset attribute on every element that overrides the document default. What are the relative merits and pitfalls? The short answer is that we can achieve the same end result with either strategy (a) the LANG attribute plus ISO-2022 encoding of the content or (b) LANG and CHARSET attributes on elements and content in that charset. The longer answer is that the difference in effort for someone coding up and maintaining a parser, viewer, or translator is substantial. The effort differential arises because the ISO-2022 approach isn't well suited to leverage the existing operating system infrastructure to support development. I guess I'll contend that we want to avoid making it hard for developers if reasonable alternatives exist. Basically, with strategy (a), every program must know how to parse a ISO-2022 byte stream and map that to something meaningful on their platform. This means on a per-program basis developing lots of tables and code to parse the byte stream and use the tables appropriately, such as to use the X11R4 mechanisms directly to load fonts. To support any new encodings or font sets, the tables and/or the program must be revised. Note that this approach is also problematic for PC-based clients. Also, although the ISO-2022 mechanism supports baseline charset specifications, it does not support higher-level specifications that combine two or more baseline charsets. These aggregate charsets, such as Japanese SJIS and EUC, are the charsets that users are exposed to and which have OS infrastructure support. With strategy (b), several advantages accrue. Because the LANG and CHARSET attributes can be combined to create a string suitable for setlocale(3C), a developer can then leverage all the infrastructure code inside the multibyte(3C) and X11R5 library code that knows how to parse byte strings in a given locale and work with font sets (families) to allow an aggregate charset to be rendered in full. Also, because ISO-2022 parsing and translation doesn't have to occur, there is a performance gain. In addition, other locale-specific OS capabilities can be accessed, such as for collating strings, and monetary, number, and time formats. Further, because these capabilities are in the OS and not the program, the program automatically benefits from infrastructure revisions and new capabilities. This seems especially useful given the evolving standards on both PC and workstation platforms. Finally, it seems to me that strategy (b) is a superset of strategy (a). Using strategy (b), for example, one could specify that the default charset of the document is ISO-2022 and achieve strategy (a) with no further effort, while strategy (a) does not accomodate strategy (b) at all. Using HTTP format negotiation and an appropriately equipped server, one could imagine servers that translate documents from one encoding to another (as best as possible) according to the capabilities of the viewer. Piet van Zee piet@cv.hp.com
Received on Thursday, 23 June 1994 22:10:04 UTC