Text on character encoding: single platfom vs web

Today, I am reviewing some documentation that I wrote for the Jena 
Semantic Web platform. I wondered whether the text below might be 
helpful to GEO. It would need some work to remove the Java and Jena 
specifics, and to generalize from RDF/XML to Web content in general (or 
maybe just XML files), but the underlying issue - use Unicode on the 
Web, but your platform may have a different default encoding, seems a 
fundamental one. There is (always?) an encoding issue at that boundary 
between local files and Web content.

In the Jena team, we got this wrong initially, causing us substantial 
costs when we had to migrate (we had been shipping incorrect code for 
quite a while, and so had a migration problem)

Feel free to reuse, modify or ignore this documentation.

Jeremy


Here is the text:


<h2><a name="encoding">2. Character Encoding Issues</a></h2>
<p>
The easiest way to not read or understand this section
is always to use InputStreams and OutputStreams with Jena,
and to never use Readers and Writers. If you do this, Jena
will do the right thing, for the vast majority of users.
If you have legacy code that uses Readers and Writers,
or you have special needs with respect to encodings, then
this section may be helpful.
The last part of this section summarizes the character encodings
supported by Jena.
</p>
<p>
Character encoding is the way that characters are mapped to bytes, shorts
or ints. There are many different character encodings.
Within Jena, character encodings are important in their relationship
to Web content, particularly RDF/XML files, which cannot be understood
without knowing the character encoding, and in relationship to Java,
which provides support for many character encodings.
</p>
<p>The Java approach to encodings is designed for ease of use on a single
machine, which uses a single encoding; often being a one-byte
encoding, e.g. for European languages which do not need thousands
of different characters.</p>
<p>The XML approach is designed for the Web which uses
multiple encodings, and some of them requiring thousands
of characters.</p>
<p>
On the Web, XML files, including RDF/XML files, are by default encoded
in "UTF-8" (Unicode). This is always a good choice for creating content,
and is the one used by Jena by default. Other encodings can be used,
but may be less interoperable. Other encodings should be named using the
canonical name registered at
<a href="http://www.iana.org/assignments/character-sets">IANA</a>,
but other systems have no obligations to support any of these, other than
UTF-8 and UTF-16.
</p>
<p>
Within Java, encodings appear primarily with the
InputStreamReader and OutputStreamWriter classes,
which convert between bytes and characters using a named encoding,
and with their subclasses,
FileReader and
FileWriter, which convert between bytes in the file
and characters using the
default encoding of the platform.
It is not possible to change the encoding used by a Reader or Writer
while it is being used.
The default encoding of the platform depends on a large range
of factors.
This default encoding may be useful for communicating with other
programs on the same platform. Sometimes the default encoding
is not registered at IANA, and so Jena application developers
should not use the default encoding for Web content, but use UTF-8.
</p>

Received on Friday, 7 October 2005 11:47:50 UTC