Re: HTML - i18n / NCR & charsets from Misha Wolf on 1996-11-27 (www-international@w3.org from October to December 1996)

From: Misha Wolf <MISHA.WOLF@reuters.com>
Date: Tue, 26 Nov 1996 19:39:14 -0500 (EST)
To: www-html <www-html@w3.org>, www-international <www-international@w3.org>, Unicode <unicode@unicode.org>
Message-Id: <6814391926111996/A24242/RE6/11ABD4E70E00*@MHS>
Indeed, ISO 8859-1 is a *strict* subset of Unicode, hence there are *no* 
differences between the two.

Microsoft's Windows Code Page 1252 (often called Windows Latin 1) has 
characters in the range 80-9F (decimal 128-159), unlike either the ISO 
8859-X family of standards or Unicode.

The NBSP is at A0 (decimal 160) and so presents no problems.  WCP 1252 
has a bullet at 95 (decimal 149), not (as far as I can see) at decimal 
143.  The numeric character reference &#149; is illegal.

Chris Wendt, from Microsoft, agreed at Seville that the use of illegal 
numeric character references was unfortunate and asked for suggestions.
The consensus was that entity names should be used instead.  As entity 
names do not (appear to) exist for most of Microsoft's extra chars, it 
was suggested that some enterprising person write them up in an RFC.
I believe there was at least one volunteer: Chris Lilley of W3C.

Misha

---

There are just a few differences; mainly in the empty block which
has the funny chars such as th bullet (143) and non-breaking-space
(160) to name the popular offenders.

---

Hmmm...Is there actually a difference in the first 256 codes of Unicode
and ISO8859-1? I thought they were identical over that range?

---

Small bit of text on i18n-html and possible problems with
the Numerical char-index/code references into a unicode
rather than the announced charset in HTTP; and the lack
of signalling out-of-band of this break with current practice.

As HTML if often transported using HTTP, the current proposal
for an internationalized version of HTML causes several
conflicts with widespread existing problems and 'out-of-HTML-
band' communicated charset information on HTTP level; or the
default latin1 assumption.

In the HTTP header, a resource send out can be labeled with a charset.
This label is not part of the document stream, but send seperately in
the MIME header of HTTP. If no charset is defined in such a way, 
latin1 is to be assumed. 

In the actual world people have taken to using so called Numerical
Glyph/Character references within their HTML documents, such as &#160; 
which are simply indexes into the 'defined' character set.

In the il8n proposal these numerical references are taken to be
indexes into the unicode set, so called 'codepoint's. This regardless
of the character set announced in the header. (or in an http_equiv
in the actual body).

Currently most of these numerical references are intented by their
authors to be indexes into latin1 or, if a charset is announced in 
the http header by the server, as in index into that set. 

Effectively HTML has been upgraded to a new and better version, which
most certainly addresses, and has solved, some of the issues related
to internationalized publishing.

Although the i18n proposal is most certainly the way to go, and superior
in every respect; it does break some widespread current practice. 

  I acknowledge that the cases where it breaks practice are few and in 
  between; and mainly concern just a few pi-font sybols such as the buller
  but the principle is just as important. Also I do realize that their
  is a 'godel' problem in that the actual message cannot know about the
  charset representation; and that thus the content-type cannouncement
  of the charset in the http header is dubious when it comes to NCRs.

Some possible solutions are proposed:

1. An extended Content-type header is used.
	Content-type: text/html.i18n
	Content-type: text/html-i18n

2. An additional attribute to the charset is used
	Content-type: text/html; charset=iso-8859-1; ncr=iso-104..

3. An additional (level) attribute to the text/html is used.
	Content-type: text/html; level=2; charset=iso8859-1
	Content-type: text/html; version=2.0/i; charset=iso8859-1

4. An additional DTD specifier in the HTML is insisted upon.
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 2.0i//EN">

5. An additional header is added to signal that the site 
   is internatialised.
	Content-Quality: i18n/v1.02

Please note that the effect accomplished by each of the above techniques 
are similar; they serve to inform the receiving end about the way any
in-line numerical character references are to be treated.

Option number 1 is by far the easiest to implement; and some of
the deployed server and browser codes is able to tread this as
an 'html' resource with a 'il8n; flavouring.

If HTML-i18n is to go ahead, without any signaling about the NCRs
target charset change (i.e in Unicode rather than the announced
charset); then IMHO this should at least be mensioned in the draft
as it break existing, widespread, practice, which prior to this
i18n draft could not be signalled as 'wrong' or 'illegal'.

Dw.
Received on Tuesday, 26 November 1996 14:40:19 UTC