Re: HTML - i18n / NCR & charsets from Martin J. Duerst on 1996-11-28 (www-international@w3.org from October to December 1996)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 28 Nov 1996 12:41:31 +0100 (MET)
To: "Dirk.vanGulik" <Dirk.vanGulik@jrc.it>
cc: masinter@parc.xerox.com, www-html@w3.org, www-international@w3.org
Message-ID: <Pine.SUN.3.95.961128121304.1006C-100000@enoshima>
On Wed, 27 Nov 1996, Dirk.vanGulik wrote:

> What I am after are those publishers who cannot be bothered to change; 
> who mainly 'thingk' that they publish for their in house world, and thus
> cause a lot of heuristics/trick/escape-hatches to be added to
> the browser & search engine code. With a little extra efford
> those pages using proper NCRs indexing into Unicode CP could be
> marked unambigiously; this is no additional strains on vendors
> willing to i18n their products; and such distingtion gives a
> higher quality.

The main problem today is that "charset", i.e. document encoding,
is not properly labeled. So search engines have to guess anyway.
Adding, for each "charset" to be guessed, one or two additional
variants:
- Uses numeric character references correctly or based on "charset".
- Uses character entity references correctly or based on "charset".
will not add much complication. For the guessing, the main problem
is how to assign feasibility values using heuristics such as
spell checkers, and not how to transform an incomming document
to various assumed character interpretations.


For documents labeled "iso-8859-1" but actually being something
else (be it iso-8859-X or CP 1252), we can't introduce a second
"charset" parameter. Either they got it right, or they lie.
Or would you propose to have something like:

Content-Type: text/html; charset=iso8859-1
Content-Real-Charset: iso8859-2

It obviously wouldn't help. For CP 1252, there in addition is
no practical problem, as explained in previous messages.


For the transitory period, I would suggest that browser
makers add a little checkmark option to their "document encoding"
menu, entitled: Nonstandard interpretation of numeric character references
and character entity references (maybe shortened :-).

That will help readers to view nonstandard things, but won't give
the impression that it is the way to go, and won't put the burden
on good network citicens.


> Furthermore, looking at the various sets in the Britich, German, Dutch,
> French, Italian and Spanish standards organization have; which all point
> to iso-8859/1 and all claim that that is what they are, I do notice that
> in quite a few one finds the W,w with ^, the ..., tm, percentiel and
> multiplication/bullet dot in the undefined range. Doubtlessly this is wrong
> from the ISO point of view; but in the national standards I can access from
> here as well as in quite a few of the major editors addmitted to the european
> marked I find rather misleading alphabet listings in the apendixes. And
> the DOS code pages are even worse.

I hope you are speaking about the web pages of these organizations,
and not about the actual standards they issue. I wouldn't have heard
of any national standard that violates C1. Of course, these organisations
use imperfect tools, or human beings, to create their web pages, and
there things may go wrong, not only for "charset" and numeric character
references.


> This behavour is quite widespread;of the 170K pages harvested around the 
> EU servers and web directory servers
> 
> 	20K ignored because to small/to few recognized words
> 	 3K signaled charset not latin1
> 	 2K signaled a lang/dialect
> 	11K had a better spellcheck score when the 
> 	    charset announcement was changed to latin2,...
> 	37K had &#num references
>  	34K had 'illegal' references (i.e. not in the unicode 
>             valid range) according to html-2/rfc and i18n draft

How many had &#num references outside the 'illegal' range?

> 	 2K had symbolic refences.
> 	 9K had references which, when the guessed/set charset 
> 	    was taken into account, rather than latin1 would
> 	    improve the spellcheck score.

How many of these remain if you ignore the CP1252 case, for
which there is an easy solution for tolerant browsers?

> This survey is heavily biased towards european projects/countries.

It's probably very much an European problem. I don't know any
Japanese pages that would use &#num; or &uuml; or such.
They use binary values directly.


> But such an established usage habit in one of the target areas
> of i18n, which for most people does not matter, cause their
> windows box does it OK, is going to give a hell of a legacy
> problem unless you add something to signal proper NCR use.

As I said, add an option to the browser. Why should it be
more difficult for the user to distinguish unlabeled iso-8859-2
pages from iso-8859-1 pages than to distinguish nonconforming
iso-8859-2 pages from those that do the right thing?


Regards,	Martin.
Received on Thursday, 28 November 1996 06:41:52 UTC