Re: Text on character encoding: single platfom vs web from Martin Duerst on 2005-10-11 (public-i18n-geo@w3.org from October 2005)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Tue, 11 Oct 2005 15:08:25 +0900
To: Jeremy Carroll <jjc@hplb.hpl.hp.com>, public-i18n-geo@w3.org
Message-Id: <6.0.0.20.2.20051011125549.0951c1e0@localhost>
Hello Jeremy,

I think this text could be very helpful with a bit of work.

However, I think the problem is that it assumes that each platform
has exactly one platform encoding, and only that one should ever
be used. The reality is a bit different: Most platforms had a single
encoding some time ago, and programming languages such as Java
picked this up. However, on Unix platforms, it is since ages
possible to change the 'platform encoding' very easily (a single
'set' or 'setenv' command), even for the same user (and different
users could of course have different settings). Of course, you then
have to be a bit careful with exchanging files (or even using your
own files).

When Java became popular and spread around the world, this model
was still quite 'en vogue', and it also was able to absorb the
very early behavior of using Latin-1 as an external encoding as
a special case. So Java adopted this model, to an extent that
these days may seem overkill. Local platforms, although they
still have a 'platform encoding' or something similar, and that
platform encoding is still used for a range of operations, on
average contain more data in different encodings than they did
a while ago. And for some encodings (in particular UTF-8), that's
not such a problem because heuristic detection works very well.
Even something as simple as Microsoft Windows Notepad these
days can handle UTF-8.

And then there are of course systems (in particular many recent
Linux distributions) where UTF-8 is the platform encoding!

Hope this helps,     Martin.

At 20:47 05/10/07, Jeremy Carroll wrote:
 >
 >
 >Today, I am reviewing some documentation that I wrote for the Jena 
Semantic Web platform. I wondered whether the text below might be helpful 
to GEO. It would need some work to remove the Java and Jena specifics, and 
to generalize from RDF/XML to Web content in general (or maybe just XML 
files), but the underlying issue - use Unicode on the Web, but your 
platform may have a different default encoding, seems a fundamental one. 
There is (always?) an encoding issue at that boundary between local files 
and Web content.
 >
 >In the Jena team, we got this wrong initially, causing us substantial 
costs when we had to migrate (we had been shipping incorrect code for quite 
a while, and so had a migration problem)
 >
 >Feel free to reuse, modify or ignore this documentation.
 >
 >Jeremy
 >
 >
 >Here is the text:
 >
 >
 ><h2><a name="encoding">2. Character Encoding Issues</a></h2>
 ><p>
 >The easiest way to not read or understand this section
 >is always to use InputStreams and OutputStreams with Jena,
 >and to never use Readers and Writers. If you do this, Jena
 >will do the right thing, for the vast majority of users.
 >If you have legacy code that uses Readers and Writers,
 >or you have special needs with respect to encodings, then
 >this section may be helpful.
 >The last part of this section summarizes the character encodings
 >supported by Jena.
 ></p>
 ><p>
 >Character encoding is the way that characters are mapped to bytes, shorts
 >or ints. There are many different character encodings.
 >Within Jena, character encodings are important in their relationship
 >to Web content, particularly RDF/XML files, which cannot be understood
 >without knowing the character encoding, and in relationship to Java,
 >which provides support for many character encodings.
 ></p>
 ><p>The Java approach to encodings is designed for ease of use on a single
 >machine, which uses a single encoding; often being a one-byte
 >encoding, e.g. for European languages which do not need thousands
 >of different characters.</p>
 ><p>The XML approach is designed for the Web which uses
 >multiple encodings, and some of them requiring thousands
 >of characters.</p>
 ><p>
 >On the Web, XML files, including RDF/XML files, are by default encoded
 >in "UTF-8" (Unicode). This is always a good choice for creating content,
 >and is the one used by Jena by default. Other encodings can be used,
 >but may be less interoperable. Other encodings should be named using the
 >canonical name registered at
 ><a href="http://www.iana.org/assignments/character-sets">IANA</a>,
 >but other systems have no obligations to support any of these, other than
 >UTF-8 and UTF-16.
 ></p>
 ><p>
 >Within Java, encodings appear primarily with the
 >InputStreamReader and OutputStreamWriter classes,
 >which convert between bytes and characters using a named encoding,
 >and with their subclasses,
 >FileReader and
 >FileWriter, which convert between bytes in the file
 >and characters using the
 >default encoding of the platform.
 >It is not possible to change the encoding used by a Reader or Writer
 >while it is being used.
 >The default encoding of the platform depends on a large range
 >of factors.
 >This default encoding may be useful for communicating with other
 >programs on the same platform. Sometimes the default encoding
 >is not registered at IANA, and so Jena application developers
 >should not use the default encoding for Web content, but use UTF-8.
 ></p>
 >
 >
 >
Received on Tuesday, 11 October 2005 06:37:54 UTC