Re: HTML - i18n / NCR & charsets

Misha Wolf (MISHA.WOLF@reuters.com)
Fri, 29 Nov 1996 21:00:33 -0500 (EST)


Date: Fri, 29 Nov 1996 21:00:33 -0500 (EST)
From: Misha Wolf <MISHA.WOLF@reuters.com>
Subject: Re: HTML - i18n / NCR & charsets
In-Reply-To: <Pine.SUN.3.95.961128105343.1006B-100000@enoshima>
To: www-html <www-html@w3.org>, www-international <www-international@w3.org>,
        Unicode <unicode@unicode.org>
Message-Id: <2633002129111996/A67204/RE6/11ABED402000*@MHS>

On Tue, 26 Nov 1996, Misha Wolf wrote:

> Indeed, ISO 8859-1 is a *strict* subset of Unicode, hence there are *no* 
> differences between the two.
> 
> Microsoft's Windows Code Page 1252 (often called Windows Latin 1) has 
> characters in the range 80-9F (decimal 128-159), unlike either the ISO 
> 8859-X family of standards or Unicode.
> 
> The NBSP is at A0 (decimal 160) and so presents no problems.  WCP 1252 
> has a bullet at 95 (decimal 149), not (as far as I can see) at decimal 
> 143.  The numeric character reference &#149; is illegal.
> 
> Chris Wendt, from Microsoft, agreed at Seville that the use of illegal 
> numeric character references was unfortunate and asked for suggestions.
> The consensus was that entity names should be used instead.  As entity 
> names do not (appear to) exist for most of Microsoft's extra chars, it 
> was suggested that some enterprising person write them up in an RFC.
> I believe there was at least one volunteer: Chris Lilley of W3C.

On Thu, 28 Nov 1996, Martin Duerst wrote:

> I was there, but don't remember this part of the discussion.
> Defining entity names for things such as "..." may not be that
> bad an idea.
> However, one has to be aware of a few related facts before
> actually doing this:
> 
> - Using 8-bit data directly and correctly labeling the page as
> 	being in Windows Code Page 1252 encoding is an existing
> 	solution (as far as browsers support CP 1252, and as
> 	far as starting to use all kinds of proprietary encodings
> 	is not really ideal).
> - Using the correct numeric character reference is also a
> 	solution. As this uses decimal values beyond 255,
> 	and I have not yet heard of any pages using such values
> 	for something else than Unicode, it should not cause
> 	compatibility problems. It works on all browsers
> 	that support this part of the i18n spec.
> - When we developed the i18n draft, we were repeatedly asked
> 	from various parties to include more entities. This
> 	included all kinds of areas. We decided to complete
> 	Latin-1, but not to go beyond it to not delay our work
> 	further. I guess if anybody starts to work on additional
> 	character entities, (s)he won't be able to stop with
> 	the few characters that are in CP 1252. The list may
> 	quickly become so long as to not be feasible as a
> 	single list, also.
> 
> Regards,	Martin.

I take back my advocacy of the use of entity names to represent 
characters in the illegal range 80-9F.  As Martin implies, this 
approach would be difficult to deploy and would require growing 
tables of entity names in both browsers and authoring tools.  I 
agree with his view that the best, simplest and cheapest way to 
solve this is to:

1.  Implement support for Unicode-based numeric character 
    references in browsers and authoring tools.

2.  Start using the correct Unicode numeric character references 
    to represent all characters which would otherwise be illegal.

Note that point 1. does not require that the application fully 
supports Unicode.  For instance, an application could recognise 
the correct Unicode numeric character references for characters 
that appear in the various Windows Code Pages in the 80-9F area, 
without actually processing them as Unicode characters.

Take the example of the, so called, smart quotes.  In Windows 
Code Page 1252 (Windows Latin 1), the LEFT DOUBLE QUOTATION MARK 
appears at position 93 (147 decimal).  As the numeric character 
reference "&#147;" would be illegal, an authoring tool should 
use the Unicode character U+201C, represented, in decimal, as 
"&#8220;".  A browser would recognise "&#8220;" as representing 
the LEFT DOUBLE QUOTATION MARK and could convert it, for display 
purposes, to any encoding it liked, eg to character 93 in Windows 
Code Page 1252.

Regards to the turkeys,
Misha