Re: Accept-Charset support from Drazen Kacar on 1996-12-04 (www-international@w3.org from October to December 1996)

From: Drazen Kacar <Drazen.Kacar@public.srce.hr>
Date: Thu, 5 Dec 1996 00:09:09 +0100 (MET)
To: erik@netscape.com
Cc: Alan_Barrett/DUB/Lotus.LOTUSINT@crd.lotus.com, www-international@w3.org, bobj@netscape.com, wjs@netscape.com, Chris.Lilley@sophia.inria.fr, Ed_Batutis/CAM/Lotus@crd.lotus.com
Message-Id: <199612042309.AAA21094@jagor.srce.hr>
Erik van der Poel wrote:

> How about using a more compact representation of Accept-Charset. E.g.
> bit masks corresponding to the number in the charset registry. This
> would omit the "q" parameter, but I'm not sure this is needed in the
> Accept-Charset case anyway. (It's probably needed for Accept-Language.)

I'd say it's mandatory for accept-language with transparent negotiation
(unless it's 1.0 which is the default). Whether it's needed for
accept-charset or not is up to the information provider to decide.
I'd say yes.

> > (1) If the user, though the UI, says they want to "Request Multi-Lingual
> > Documents" then the browser should send:-
> 
> I don't think we should have UI for the Accept-Charset. Think about
> novice users. Will they understand it?

Yes. Perhaps people who live in Latin 1 world won't, but everything works
for them anyway. I live in Latin 2 world and I have reasonable technical
background, so I can hardly be called a novice user. But I can tell you
how it looks to novice users.

I'll take Usenet as an example, the web is even more confusing because
there is no interaction with the person who set up the page.

My native language needs 5 letters from Latin 2 code page, the rest is in
US-ASCII. Latin 2 on Unix means ISO 8859-2 and Unix host were connected
before anything else. Some people used ISO 8859-2 and some used ASCII
approximations. Then Windows came. Latin 2 on Windows today means
windows-1250 code page, at the time that CP was unregistered with IANA
and didn't mean anything on Internet. 1250 has the characters we need,
but some of them are not at the same position as in ISO 8859-2.
NSN 2.0 was unaware of this and was showing iso-8859-2 documents with 1250
code page. So, our novice user sees some very strange characters and is
completely clueless about what's going on. Then he tries to post something
with the national characters in his local configuration and gets a flame
or two back because Usenet is 7 bit and he did not send charset parameter
in content-type header, nor did he encode his post. At this point we
have frustrated novice user.

A slight digression here. It didn't start with flaming. Unix people were
helpfull at first, trying to explain problems and trying to find solutions.
As a result, some of Windows people installed ISO 8859-2. Majority didn't.
Those that didn't were unable to read or write ISO code page. Their GUIs
were useless. Even if they telneted to Unix host, they still had 1250
on terminal emulator, so it was the same. They had their own experience
of which Unix people did not know much. National characters were used
much before it dawned to someone in Microsoft that those are needed.
People usually used 7 bit ISO 646 code page. There were screen fonts,
printer fonts, keyboard drivers and everything worked. Then Microsoft
decided to ship IBM 852 code page with DOS. Some people switched, but not
many. Then there were Windows and Microsoft shiped CP1250. Again some
people switched. There was incompatibility on the same platform and
people were actively fighting for "their standard". Computer magazines
were pretty bad and they still are. The situation, as they understood it,
was calling for one standard which had to be enforced upon everyone.
With their background on Microsoft platforms, they were completely
unaware of MIME, the little addition to Content-type in HTTP and localization
already present on Unix. Unix people, on the other hand, got tired with
the same questions, flames from those who were reading about "Unix lunatics",
people who would not RTFM when kindly directed to it. The war started.
ISO 8859-2 won on Usenet.

On the web CP1250 won the majority of pages. Largely because there were no
authoring tools which knew about difference between ISO 8859-2 and CP1250.

There are languages that need Latin 2 code page, but don't suffer because
of this. People who use them might be ignorant about code page problems.
But those who have problems know about charset issues. There can't be
novice users. Not after the first post with wrong charset or after viewing
the first web page with another charset.

Now, why is accept-charset needed? Take NSN 3.0 which can translate
ISO 8859-2 to windows-1250. What happens when there is no Latin 2 font
available? NSN uses ISO 8859-1. Do you know how it looks? If you can
understand any Latin 1 language that needs accented characters, you can
get a picture. Take a page in French, German or whichever and display it
with Latin 2 code page. That's how it looks. You can read it if you really
have to, but it's not pleasant in the least.

Should I mention search engines? Or proxies? Better not. :)

Some servers convert code pages on the fly. This takes resources, but it's
the only way which can ensure that information is readable. Not good
presented, not cool, just plain readable without a headache.
That's why.

-- 
Life is a sexually transmitted disease.

dave@fly.cc.fer.hr
dave@zemris.fer.hr
Received on Wednesday, 4 December 1996 18:09:44 UTC