Re: Creating Japanese Document in memory from David Brownell on 2000-01-05 (www-dom@w3.org from January to March 2000)

From: David Brownell <david-b@pacbell.net>
Date: Tue, 04 Jan 2000 19:47:12 -0800
To: "H.Ozawa" <h-ozawa@hitachi-system.co.jp>
Cc: www-dom@w3.org
Message-ID: <3872BEC0.BAB9770E@pacbell.net>

Hello,

Your description wasn't quite clear what you're doing, so it's
hard to say just what was going wrong.  For example, exactly
which encoding line was used?  I understand there are quite a few
encodings used in Japan, not all of which are widely supported.

I've seen ones like this work pretty consistently:

	<?xml version='1.0' encoding='EUC-JP'?>

"H.Ozawa" wrote:
>
> Problem arises because most parsers do not treat 'encoding' attribute as
> part of the <?xml?>.

Note that parsing isn't a DOM issue; DOM just represents documents
in memory.  And the difference between a DOM document with Japanese
text (or tags, or attributes, etc) and one with, say, English ones
is just the contents of some strings, ones which the DOM won't have
much reason to look at after the tree model is created.  (The strings
are invariably going to be encoded in UTF-16 or Unicode.)

See my XML.com reviews of XML parsers, linked from the bottom of

	http://home.pacbell.net/david-b/xml/

These show parsers which handle Japanese encodings. Look at the very
last section of the "Full Test Results" for any parser, and you'll see
that many do a good job of parsing the XML documents (in Japanese)
provided by Fuji Xerox.  I think the parsers provided by corporations
pass these tests pretty consistently, and the others didn't.  Sun's,
as one example, handled those Japanese test cases with no trouble.

> As a concrete example:
> 1. MS' s parser
>     I can't loadXML document containing Japanese tag names. I'm also
> unable to specify encoding in the document
>     because the document isn't loaded yet.

Which Microsoft parser?  Their Java parser isn't really worth
looking at; almost any other parser is miles ahead.  But the
IE5 "MSXML.DLL" is much better, even though it's got DTD troubles,
and I've observed it to load documents with Japanese encodings.

> 2. Oracle's parser
>     After createDocument(), I can't immediately issue setEncoding()
> method. I have to issue the method after creating
>     a dummy node. Encoding is necessary to load Japanese XML documents
> but encoding can not be specified on a null
>     document.

Again, it's not clear what you're doing.  The tests reported above
show that there were some problems with an older release of Oracle's
Java parser (2.0.0.2) and some Japanese encodings, but not with others.
I suspect Oracle has a much more up-to-date version than that.

> It would be nice if all attributes of processing instructions are
> REQUIRED to be treated as part of a PI node itself.

But an XML declaration, or a text declaration, isn't a PI.  And
even if it were modeled as one in DOM, that wouldn't help anything
happening in a parser -- since by the time you get a DOM model of
an XML document into memory, it's been parsed.

> Thus, to change document encoding, I would only have to change
> setEncoding() method parameter instead of adding new procedures.

Transcoding documents isn't a DOM functionality, neither is any
sort of setEncoding() method.  So I think you can see why I'm
puzzled exactly what you're doing, and what's going wrong!

- Dave

Received on Tuesday, 4 January 2000 22:47:09 UTC