Re: (MAITS.496) html, http, urls and internationalisation

> 
> on my business card (there's a c-cedilla in there, for those who lost
> the 8th bit), how do you know what charset my business card uses?
> What bit pattern do you send my server to fetch that document?

I don't know which character encoding you have, but I know it is
a c-cedille - and then I can send you a c-cedille encoded in my
charset and labelled with my charset, and you can then translate
into your internal charset. It all works on the abstract character
level, as it should.

> There are not that many ways out of this problem: either URLs contain
> an indication of their charset, or the whole world agrees on a single
> implicit charset.  Today, the world *has* agreed on the charset for
> octets values < 127, and this must be taken into account for a wider
> solution.

Which one is that, and which encoding form is it? Is it ISO-8859-1
which is the HTML default, or ISO 10646 in one of its many forms:
UCS-2, USC-4, UTF-16, UTF-8 or UTF-7? Or is it one of the many
other standards that is used today on the web, such as the other
8859 parts?

> Personally, I like the implicit UTF-8 idea: any non-ASCII character
> must be sent to a server as its UTF-8 encoding, either URL-encoded
> (the %XX hack) or not.  A server receiving a non-ASCII octet (or its
> URL-encoding) must interpret it as part of the UTF-8 encoding of a
> character.  No ambiguity, no need to tag the charset, and good
> compatibility with today's situation (ASCII only, in practice).

I use 8859-1 all the time here, and many of the other pages
in Europe are having 8859-1 characters in them. So please say
"iso-8859-1" instead of "ASCII" - this is actually also the HTML
standard. 8859-1 and UTF-8 use are in conflict, as they both use
the 8th bit.
> 
> >I think that just using some kind of UCS would make it hard
> >when we have an environment where the html is in 8859-1 - that
> >would be mixing apples and oranges and thus very hard to maintain.
> 
> For one thing, there is plenty of HTML *not* in 8859-1.  Apart from
> that, most software (including HTML parsers) recognize only ASCII as
> syntax-significant, either passing 8-bit characters untouched and
> uninterpreted or chopping off the 8th bit, damaging 8859-1 just as
> surely as UTF-8.

I have been running in an 8 bit clean environment for years,
and do not recognize your assessment on a damaging environment.

I do agree that there are
a lot of pages out there in other 8-bit charsets than iso-8859-1.
Also for those pages having some kind of standard to say that
URLs are always encoded in some standard charset, say UTF-8,
would be mixing apples and oranges. That would be making requirements
on URL writing on the wrong level of abstraction; URLs are specified
with abstract characters, so it can be written in newspapers,
on business cards etc, we do not need to know or specify the
encoding (charset).

Keld

Received on Sunday, 28 January 1996 14:18:47 UTC