Re: Comments on Charmod PR publications from Martin Duerst on 2005-02-09 (www-i18n-comments@w3.org from February 2005)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 09 Feb 2005 14:59:19 +0900
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org (I18N IG, for archiving only), member-i18n-core@w3.org, Chris Lilley <chris@w3.org>
Message-Id: <6.0.0.20.2.20050209143445.106bbf98@localhost>

Hello Bjoern,

At 09:48 05/02/09, Bjoern Hoehrmann wrote:
 >* Martin Duerst wrote:
 >>It is just the mention of iso-8859-1 that is crucial in this context,
 >>as it was most often misused. People put up a page in an arbitrary
 >>8-bit encoding, labeled it as iso-8859-1, and constructed a font that
 >>made things look right. So using iso-8859-1 was explicitly part of
 >>the misuse, and trying to avoid mentioning it just obscures the issue.
 >
 >Maybe you can cite an example web page and a freely available font that
 >demonstrates the misuse you have in mind?

These days, these examples are fortunately getting rarer, and aren't
advertised as much anymore, so it's difficult to find them, but
here is an old page that explains this:
http://www.fedu.uec.ac.jp/ThaiMac/thaibrowser.html

These days, pages explaining things correctly, and doing things correctly,
abound, so it's very difficult to find pages that don't. But believe
me, there was a lot of this stuff around. I even once had to diagnose
a file I got from a colleague, it turned out to be Thai text interpreted
as iso-8859-1 and from there converted to UTF-8.

I have cc'ed Chris, maybe he can point us to more examples.

 >Do you mean that it matters
 >that the web page is encoded using ISO-8859-1?

In theory, it doesn't. In practice, that was the encoding available
on all browsers, so that's what everybody misused.

 >That would be weird as
 >HTML/XHTML require that text processing happens essentially independend
 >of the character encoding.

The whole thing is weird. That's why we are prohibiting it :-).

 >So, as far as I understand the comment in
 >the current document, it refers to a font that is defined in terms of
 >ISO-8859-1; maybe you can cite font technology that enables such mis-
 >use?

It's very easy. You take a font editor, take a font made to cover
the repertoire encoded by iso-8859-1, and change the accented Latin
characters and so on to something else, e.g. Thai. Any font technology
enables such misuse, Font technology has no way to check whether
the glyph e.g. for codepoint U+00F6 is what people might expect
in that font for an o-Umlaut or not.

 >What I do not understand so far is why a character encoding is of
 >any significance in this context.

As I said, theoretically, any character encoding would do, but
in practice, it was iso-8859-1 that got misused.

 >>If you have any ideas of how to express things with mentioning
 >>iso-8859-1 (and again, not being overly complicated), that would
 >>be appreciated.
 >
 >Well, to me the current text does not make any sense, so I can't really
 >make a suggestion that involves ISO-8859-1. The conformance requirement
 >now only discusses code points and coded character sets, not character
 >encodings, so the requirement and the mention of ISO-8859-1 seem quite
 >orthogonal to each other.

In theory, yes, they are orthogonal. But in practice (mostly past
practice, fortunately), iso-8859-1 is very relevant, and it's much
easier for somebody who knew these kinds of misused to recognise
what we are talking about in the way it's described now.

You seem to be fortunate to have come to Web internationalization
at a time when such misuses were already less frequent.

Maybe we could change

"This prohibits the construction of fonts that misuse e.g. iso-8859-1
to represent different scripts, characters, or symbols than what is
actually encoded in iso-8859-1."

to something like

"This prohibits the formerly frequent construction of fonts that misused
e.g. iso-8859-1 to represent different scripts, characters, or symbols
than what was actually encoded in iso-8859-1."

Would that help?

Regards,     Martin.

Received on Wednesday, 9 February 2005 05:59:56 UTC