Re: HTML - i18n / NCR & charsets from Martin J. Duerst on 1996-11-28 (www-html@w3.org from November 1996)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 28 Nov 1996 12:07:22 +0100 (MET)
To: Misha Wolf <MISHA.WOLF@reuters.com>
cc: www-html <www-html@w3.org>, www-international <www-international@w3.org>, Unicode <unicode@unicode.org>
Message-ID: <Pine.SUN.3.95.961128105343.1006B-100000@enoshima>

On Tue, 26 Nov 1996, Misha Wolf wrote:

> Indeed, ISO 8859-1 is a *strict* subset of Unicode, hence there are *no* 
> differences between the two.
> 
> Microsoft's Windows Code Page 1252 (often called Windows Latin 1) has 
> characters in the range 80-9F (decimal 128-159), unlike either the ISO 
> 8859-X family of standards or Unicode.
> 
> The NBSP is at A0 (decimal 160) and so presents no problems.  WCP 1252 
> has a bullet at 95 (decimal 149), not (as far as I can see) at decimal 
> 143.  The numeric character reference &#149; is illegal.
> 
> Chris Wendt, from Microsoft, agreed at Seville that the use of illegal 
> numeric character references was unfortunate and asked for suggestions.
> The consensus was that entity names should be used instead.  As entity 
> names do not (appear to) exist for most of Microsoft's extra chars, it 
> was suggested that some enterprising person write them up in an RFC.
> I believe there was at least one volunteer: Chris Lilley of W3C.

I was there, but don't remember this part of the discussion.
Defining entity names for things such as "..." may not be that
bad an idea.
However, one has to be aware of a few related facts before
actually doing this:

- Using 8-bit data directly and correctly labeling the page as
	being in Windows Code Page 1252 encoding is an existing
	solution (as far as browsers support CP 1252, and as
	far as starting to use all kinds of proprietary encodings
	is not really ideal).
- Using the correct numeric character reference is also a
	solution. As this uses decimal values beyond 255,
	and I have not yet heard of any pages using such values
	for something else than Unicode, it should not cause
	compatibility problems. It works on all browsers
	that support this part of the i18n spec.
- When we developed the i18n draft, we were repeatedly asked
	from various parties to include more entities. This
	included all kinds of areas. We decided to complete
	Latin-1, but not to go beyond it to not delay our work
	further. I guess if anybody starts to work on additional
	character entities, (s)he won't be able to stop with
	the few characters that are in CP 1252. The list may
	quickly become so long as to not be feasible as a
	single list, also.

Regards,	Martin.

Received on Thursday, 28 November 1996 06:09:41 UTC