Date: Wed, 27 Nov 1996 15:20:21 +0100 From: "Dirk.vanGulik" <Dirk.vanGulik@jrc.it> Message-Id: <9611271420.AA10224@ jrc.it> To: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Subject: Re: HTML - i18n / NCR & charsets Larry Masinter <email@example.com>, Wed, 27 Nov 1996 03:57:36 PST wrote: > # Some possible solutions are proposed: > > If people have old documents with illegal numeric character references > in them, they should change them to not use illegal numeric character > references. All of your proposed solutions are inferior. I quite agree they are inferior, if- and when people turn out to be prepared to make those changes. > If people want to remain 'bugward compatible' with old browsers, they > can use content negotiation based on the user agent string: This was not the issue I was worried about; if, on the server side, publisher are willing to update their documents, they do not need the above 'bugward' switches so much. Nor do the browser vendors which put a lot of heuristics into their software to guess. What I am after are those publishers who cannot be bothered to change; who mainly 'thingk' that they publish for their in house world, and thus cause a lot of heuristics/trick/escape-hatches to be added to the browser & search engine code. With a little extra efford those pages using proper NCRs indexing into Unicode CP could be marked unambigiously; this is no additional strains on vendors willing to i18n their products; and such distingtion gives a higher quality. > I often use a mac-based web browser & find the windows codepage > characters really annoying since they don't display properly anyway. It is extremely annoying, but in a world where one vendor dominates the desktops there is little incentive to change. Furthermore, looking at the various sets in the Britich, German, Dutch, French, Italian and Spanish standards organization have; which all point to iso-8859/1 and all claim that that is what they are, I do notice that in quite a few one finds the W,w with ^, the ..., tm, percentiel and multiplication/bullet dot in the undefined range. Doubtlessly this is wrong from the ISO point of view; but in the national standards I can access from here as well as in quite a few of the major editors addmitted to the european marked I find rather misleading alphabet listings in the apendixes. And the DOS code pages are even worse. This behavour is quite widespread;of the 170K pages harvested around the EU servers and web directory servers 20K ignored because to small/to few recognized words 3K signaled charset not latin1 2K signaled a lang/dialect 11K had a better spellcheck score when the charset announcement was changed to latin2,... 37K had &#num references 34K had 'illegal' references (i.e. not in the unicode valid range) according to html-2/rfc and i18n draft 2K had symbolic refences. 9K had references which, when the guessed/set charset was taken into account, rather than latin1 would improve the spellcheck score. This survey is heavily biased towards european projects/countries. But such an established usage habit in one of the target areas of i18n, which for most people does not matter, cause their windows box does it OK, is going to give a hell of a legacy problem unless you add something to signal proper NCR use. Dw. > # If HTML-i18n is to go ahead, without any signaling about the NCRs > # target charset change (i.e in Unicode rather than the announced > # charset); then IMHO this should at least be mensioned in the draft > # as it break existing, widespread, practice, which prior to this > # i18n draft could not be signalled as 'wrong' or 'illegal'.