Date: Tue, 26 Nov 1996 19:13:23 +0100 From: "Dirk.vanGulik" <Dirk.vanGulik@jrc.it> Message-Id: <9611261813.AA08437@ jrc.it> To: firstname.lastname@example.org Subject: HTML - i18n / NCR & charsets Cc: email@example.com Small bit of text on i18n-html and possible problems with the Numerical char-index/code references into a unicode rather than the announced charset in HTTP; and the lack of signalling out-of-band of this break with current practice. As HTML if often transported using HTTP, the current proposal for an internationalized version of HTML causes several conflicts with widespread existing problems and 'out-of-HTML- band' communicated charset information on HTTP level; or the default latin1 assumption. In the HTTP header, a resource send out can be labeled with a charset. This label is not part of the document stream, but send seperately in the MIME header of HTTP. If no charset is defined in such a way, latin1 is to be assumed. In the actual world people have taken to using so called Numerical Glyph/Character references within their HTML documents, such as   which are simply indexes into the 'defined' character set. In the il8n proposal these numerical references are taken to be indexes into the unicode set, so called 'codepoint's. This regardless of the character set announced in the header. (or in an http_equiv in the actual body). Currently most of these numerical references are intented by their authors to be indexes into latin1 or, if a charset is announced in the http header by the server, as in index into that set. Effectively HTML has been upgraded to a new and better version, which most certainly addresses, and has solved, some of the issues related to internationalized publishing. Although the i18n proposal is most certainly the way to go, and superior in every respect; it does break some widespread current practice. I acknowledge that the cases where it breaks practice are few and in between; and mainly concern just a few pi-font sybols such as the buller but the principle is just as important. Also I do realize that their is a 'godel' problem in that the actual message cannot know about the charset representation; and that thus the content-type cannouncement of the charset in the http header is dubious when it comes to NCRs. Some possible solutions are proposed: 1. An extended Content-type header is used. Content-type: text/html.i18n Content-type: text/html-i18n 2. An additional attribute to the charset is used Content-type: text/html; charset=iso-8859-1; ncr=iso-104.. 3. An additional (level) attribute to the text/html is used. Content-type: text/html; level=2; charset=iso8859-1 Content-type: text/html; version=2.0/i; charset=iso8859-1 4. An additional DTD specifier in the HTML is insisted upon. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 2.0i//EN"> 5. An additional header is added to signal that the site is internatialised. Content-Quality: i18n/v1.02 Please note that the effect accomplished by each of the above techniques are similar; they serve to inform the receiving end about the way any in-line numerical character references are to be treated. Option number 1 is by far the easiest to implement; and some of the deployed server and browser codes is able to tread this as an 'html' resource with a 'il8n; flavouring. If HTML-i18n is to go ahead, without any signaling about the NCRs target charset change (i.e in Unicode rather than the announced charset); then IMHO this should at least be mensioned in the draft as it break existing, widespread, practice, which prior to this i18n draft could not be signalled as 'wrong' or 'illegal'. Dw.