- From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- Date: Fri, 07 Oct 2005 12:47:37 +0100
- To: public-i18n-geo@w3.org
Today, I am reviewing some documentation that I wrote for the Jena Semantic Web platform. I wondered whether the text below might be helpful to GEO. It would need some work to remove the Java and Jena specifics, and to generalize from RDF/XML to Web content in general (or maybe just XML files), but the underlying issue - use Unicode on the Web, but your platform may have a different default encoding, seems a fundamental one. There is (always?) an encoding issue at that boundary between local files and Web content. In the Jena team, we got this wrong initially, causing us substantial costs when we had to migrate (we had been shipping incorrect code for quite a while, and so had a migration problem) Feel free to reuse, modify or ignore this documentation. Jeremy Here is the text: <h2><a name="encoding">2. Character Encoding Issues</a></h2> <p> The easiest way to not read or understand this section is always to use InputStreams and OutputStreams with Jena, and to never use Readers and Writers. If you do this, Jena will do the right thing, for the vast majority of users. If you have legacy code that uses Readers and Writers, or you have special needs with respect to encodings, then this section may be helpful. The last part of this section summarizes the character encodings supported by Jena. </p> <p> Character encoding is the way that characters are mapped to bytes, shorts or ints. There are many different character encodings. Within Jena, character encodings are important in their relationship to Web content, particularly RDF/XML files, which cannot be understood without knowing the character encoding, and in relationship to Java, which provides support for many character encodings. </p> <p>The Java approach to encodings is designed for ease of use on a single machine, which uses a single encoding; often being a one-byte encoding, e.g. for European languages which do not need thousands of different characters.</p> <p>The XML approach is designed for the Web which uses multiple encodings, and some of them requiring thousands of characters.</p> <p> On the Web, XML files, including RDF/XML files, are by default encoded in "UTF-8" (Unicode). This is always a good choice for creating content, and is the one used by Jena by default. Other encodings can be used, but may be less interoperable. Other encodings should be named using the canonical name registered at <a href="http://www.iana.org/assignments/character-sets">IANA</a>, but other systems have no obligations to support any of these, other than UTF-8 and UTF-16. </p> <p> Within Java, encodings appear primarily with the InputStreamReader and OutputStreamWriter classes, which convert between bytes and characters using a named encoding, and with their subclasses, FileReader and FileWriter, which convert between bytes in the file and characters using the default encoding of the platform. It is not possible to change the encoding used by a Reader or Writer while it is being used. The default encoding of the platform depends on a large range of factors. This default encoding may be useful for communicating with other programs on the same platform. Sometimes the default encoding is not registered at IANA, and so Jena application developers should not use the default encoding for Web content, but use UTF-8. </p>
Received on Friday, 7 October 2005 11:47:50 UTC