Re: Creating Japanese Document in memory from keshlam@us.ibm.com on 2000-01-05 (www-dom@w3.org from January to March 2000)

From: <keshlam@us.ibm.com>
Date: Tue, 4 Jan 2000 21:22:09 -0500
To: "H.Ozawa" <h-ozawa@hitachi-system.co.jp>
cc: www-dom@w3.org
Message-ID: <8525685D.000CFA1C.00@D51MTA03.pok.ibm.com>

Creating the document in memory shouldn't be a problem. All strings in the
DOM, by definition, are expressed in UTF-16, which should be able to handle
Japanese characters.

As you point out, writing that document out and reading it back in are
somewhat more complicated. The serializer and parser have to understand how
to translate between UTF-16 and your preferred encoding, and you have to
figure out how to tell them which encoding to use.

>Thus, to change document encoding, I would only have to change
>setEncoding() method parameter instead of adding new procedures

Unfortunately, setEncoding() is not part of the standardized DOM API.

The standard DOM does not have any representation of the XML Declaration
(<?xml?>), and so does not store the encoding. Some tools express this as a
Processing Instruction, but the XML specification and the Infoset both say
that this isn't really the right answer.

Some parsers make the encoding name available as a separate piece of
information, and some serializers accept the encoding as a parameter along
with the top-level DOM node; that's probably a better design than the PI
approach.

We're aware that this is probably an oversight in the DOM. It's on our Open
Issues list for future DOM development, and I expect it will be addressed
as part of the DOM Level 3  Serialization chapter.

Meanwhile, I'm afraid you're stuck with nonportable solutions... and with
hunting for parsers that support the encodings you want to use.

(Obligatory marketing: Have you tried IBM's XML4J, or the Apache parser
based on that code? Since the first version of that parser was written by a
group in our Tokyo research center, I would be very surprised if it didn't
include support for Japanese documents!)

______________________________________
Joe Kesselman  / IBM Research

Received on Tuesday, 4 January 2000 21:22:29 UTC