HTML - i18n / NCR & charsets

Dirk.vanGulik (Dirk.vanGulik@jrc.it)
Tue, 26 Nov 1996 19:13:23 +0100


Date: Tue, 26 Nov 1996 19:13:23 +0100
From: "Dirk.vanGulik" <Dirk.vanGulik@jrc.it>
Message-Id: <9611261813.AA08437@ jrc.it>
To: www-html@w3.org
Subject: HTML - i18n / NCR & charsets
Cc: dirkx@elec.jrc.it

Small bit of text on i18n-html and possible problems with
the Numerical char-index/code references into a unicode
rather than the announced charset in HTTP; and the lack
of signalling out-of-band of this break with current practice.

As HTML if often transported using HTTP, the current proposal
for an internationalized version of HTML causes several
conflicts with widespread existing problems and 'out-of-HTML-
band' communicated charset information on HTTP level; or the
default latin1 assumption.

In the HTTP header, a resource send out can be labeled with a charset.
This label is not part of the document stream, but send seperately in
the MIME header of HTTP. If no charset is defined in such a way, 
latin1 is to be assumed. 

In the actual world people have taken to using so called Numerical
Glyph/Character references within their HTML documents, such as &#160; 
which are simply indexes into the 'defined' character set.

In the il8n proposal these numerical references are taken to be
indexes into the unicode set, so called 'codepoint's. This regardless
of the character set announced in the header. (or in an http_equiv
in the actual body).

Currently most of these numerical references are intented by their
authors to be indexes into latin1 or, if a charset is announced in 
the http header by the server, as in index into that set. 

Effectively HTML has been upgraded to a new and better version, which
most certainly addresses, and has solved, some of the issues related
to internationalized publishing.

Although the i18n proposal is most certainly the way to go, and superior
in every respect; it does break some widespread current practice. 

  I acknowledge that the cases where it breaks practice are few and in 
  between; and mainly concern just a few pi-font sybols such as the buller
  but the principle is just as important. Also I do realize that their
  is a 'godel' problem in that the actual message cannot know about the
  charset representation; and that thus the content-type cannouncement
  of the charset in the http header is dubious when it comes to NCRs.

Some possible solutions are proposed:

1. An extended Content-type header is used.
	Content-type: text/html.i18n
	Content-type: text/html-i18n

2. An additional attribute to the charset is used
	Content-type: text/html; charset=iso-8859-1; ncr=iso-104..

3. An additional (level) attribute to the text/html is used.
	Content-type: text/html; level=2; charset=iso8859-1
	Content-type: text/html; version=2.0/i; charset=iso8859-1

4. An additional DTD specifier in the HTML is insisted upon.
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 2.0i//EN">

5. An additional header is added to signal that the site 
   is internatialised.
	Content-Quality: i18n/v1.02

Please note that the effect accomplished by each of the above techniques 
are similar; they serve to inform the receiving end about the way any
in-line numerical character references are to be treated.

Option number 1 is by far the easiest to implement; and some of
the deployed server and browser codes is able to tread this as
an 'html' resource with a 'il8n; flavouring.

If HTML-i18n is to go ahead, without any signaling about the NCRs
target charset change (i.e in Unicode rather than the announced
charset); then IMHO this should at least be mensioned in the draft
as it break existing, widespread, practice, which prior to this
i18n draft could not be signalled as 'wrong' or 'illegal'.

Dw.