Date: Thu, 28 Nov 1996 12:07:22 +0100 (MET) From: "Martin J. Duerst" <firstname.lastname@example.org> To: Misha Wolf <MISHA.WOLF@reuters.com> cc: www-html <email@example.com>, www-international <firstname.lastname@example.org>, Subject: Re: HTML - i18n / NCR & charsets In-Reply-To: <6814391926111996/A24242/RE6/11ABD4E70E00*@MHS> Message-ID: <Pine.SUN.3.95.961128105343.1006B-100000@enoshima> On Tue, 26 Nov 1996, Misha Wolf wrote: > Indeed, ISO 8859-1 is a *strict* subset of Unicode, hence there are *no* > differences between the two. > > Microsoft's Windows Code Page 1252 (often called Windows Latin 1) has > characters in the range 80-9F (decimal 128-159), unlike either the ISO > 8859-X family of standards or Unicode. > > The NBSP is at A0 (decimal 160) and so presents no problems. WCP 1252 > has a bullet at 95 (decimal 149), not (as far as I can see) at decimal > 143. The numeric character reference • is illegal. > > Chris Wendt, from Microsoft, agreed at Seville that the use of illegal > numeric character references was unfortunate and asked for suggestions. > The consensus was that entity names should be used instead. As entity > names do not (appear to) exist for most of Microsoft's extra chars, it > was suggested that some enterprising person write them up in an RFC. > I believe there was at least one volunteer: Chris Lilley of W3C. I was there, but don't remember this part of the discussion. Defining entity names for things such as "..." may not be that bad an idea. However, one has to be aware of a few related facts before actually doing this: - Using 8-bit data directly and correctly labeling the page as being in Windows Code Page 1252 encoding is an existing solution (as far as browsers support CP 1252, and as far as starting to use all kinds of proprietary encodings is not really ideal). - Using the correct numeric character reference is also a solution. As this uses decimal values beyond 255, and I have not yet heard of any pages using such values for something else than Unicode, it should not cause compatibility problems. It works on all browsers that support this part of the i18n spec. - When we developed the i18n draft, we were repeatedly asked from various parties to include more entities. This included all kinds of areas. We decided to complete Latin-1, but not to go beyond it to not delay our work further. I guess if anybody starts to work on additional character entities, (s)he won't be able to stop with the few characters that are in CP 1252. The list may quickly become so long as to not be feasible as a single list, also. Regards, Martin.