Re: HTML - i18n / NCR & charsets from Dirk.vanGulik on 1996-11-27 (www-html@w3.org from November 1996)

From: Dirk.vanGulik <Dirk.vanGulik@jrc.it>
Date: Wed, 27 Nov 1996 15:20:21 +0100
To: masinter@parc.xerox.com, www-html@w3.org, www-international@w3.org
Message-Id: <9611271420.AA10224@ jrc.it>
Larry Masinter <masinter@parc.xerox.com>, Wed, 27 Nov 1996 
03:57:36 PST wrote:

> # Some possible solutions are proposed:
> 
> If people have old documents with illegal numeric character references
> in them, they should change them to not use illegal numeric character
> references. All of your proposed solutions are inferior.

I quite agree they are inferior, if- and when people turn out to be
prepared to make those changes. 
 
> If people want to remain 'bugward compatible' with old browsers, they
> can use content negotiation based on the user agent string:

This was not the issue I was worried about; if, on the server side,
publisher are willing to update their documents, they do not need the
above 'bugward' switches so much. Nor do the browser vendors which
put a lot of heuristics into their software to guess. 

What I am after are those publishers who cannot be bothered to change; 
who mainly 'thingk' that they publish for their in house world, and thus
cause a lot of heuristics/trick/escape-hatches to be added to
the browser & search engine code. With a little extra efford
those pages using proper NCRs indexing into Unicode CP could be
marked unambigiously; this is no additional strains on vendors
willing to i18n their products; and such distingtion gives a
higher quality.

> I often use a mac-based web browser & find the windows codepage
> characters really annoying since they don't display properly anyway.

It is extremely annoying, but in a world where one vendor dominates
the desktops there is little incentive to change. 

Furthermore, looking at the various sets in the Britich, German, Dutch,
French, Italian and Spanish standards organization have; which all point
to iso-8859/1 and all claim that that is what they are, I do notice that
in quite a few one finds the W,w with ^, the ..., tm, percentiel and
multiplication/bullet dot in the undefined range. Doubtlessly this is wrong
from the ISO point of view; but in the national standards I can access from
here as well as in quite a few of the major editors addmitted to the european
marked I find rather misleading alphabet listings in the apendixes. And
the DOS code pages are even worse.

This behavour is quite widespread;of the 170K pages harvested around the 
EU servers and web directory servers

	20K ignored because to small/to few recognized words
	 3K signaled charset not latin1
	 2K signaled a lang/dialect
	11K had a better spellcheck score when the 
	    charset announcement was changed to latin2,...
	37K had &#num references
 	34K had 'illegal' references (i.e. not in the unicode 
            valid range) according to html-2/rfc and i18n draft
	 2K had symbolic refences.
	 9K had references which, when the guessed/set charset 
	    was taken into account, rather than latin1 would
	    improve the spellcheck score.

This survey is heavily biased towards european projects/countries.
But such an established usage habit in one of the target areas
of i18n, which for most people does not matter, cause their
windows box does it OK, is going to give a hell of a legacy
problem unless you add something to signal proper NCR use.

Dw.

> # If HTML-i18n is to go ahead, without any signaling about the NCRs
> # target charset change (i.e in Unicode rather than the announced
> # charset); then IMHO this should at least be mensioned in the draft
> # as it break existing, widespread, practice, which prior to this
> # i18n draft could not be signalled as 'wrong' or 'illegal'.
Received on Wednesday, 27 November 1996 09:20:43 UTC