W3C home > Mailing lists > Public > www-dom@w3.org > July to September 2000

Re: a question about entities and special characters

From: <keshlam@us.ibm.com>
Date: Tue, 8 Aug 2000 08:41:27 -0400
To: www-dom@w3.org
Message-ID: <85256935.0045B427.00@D51MTA03.pok.ibm.com>
(Lauren: FAQ?)


>I'm German, so I use a lot of special characters like &auml; in my
>pages. Are all these characters single nodes in the DOM?

This depends on whether you've asked your parser to retain Entity Reference
nodes or not.

These names are defined by your DTD rather than by XML, so they _are_
entities. If you've asked for Entity Reference nodes to be created in the
DOM, that means that each of these takes two nodes ---  one for the Entity
Reference, and one for the Text node containing the character.

If you've told the DOM generator (parser or other code) to "expand" the
entity references, then the Text node would appear... and then would be
merged with adjacent Text nodes when a normalize() operation is done. Note
that the DOM requires that XML processors -- parsers -- normalize documents
before they are presented to the user; other applications may not do so.

If you want to avoid the inefficiency of using two nodes for a single
character, there are ways to do so. One is to use a numeric character
reference (such as &#34;) to encode the character, rather than a declared
entity; these are always converted directly into characters. An easier
solution would be to use an encoding that directly supports the language
that you are writing in, so you can simply type the intended characters and
have the XML parser read them correctly; check whether your XML parser
supports the encoding used for German.

> And when I create a TextNode, can I use those entities there?

All DOM API calls operate in terms of UTF-16 character strings. These can
directly represent most characters in most languages, which would allow you
to say
   document.createTextNode("mäh")
When you write the document out as XML syntax, your serializer is
responsible for checking the output encoding and deciding whether it can
write the character directly or must use a numeric character reference. Do
_not_ write
   document.createTextNode("m&auml;h")
... that puts the ampersand into the document's contents, and is equivalent
to writing
     m&38;auml;h
in your XML file.

 An Entity node declares an entity; an Entity Reference node is a reference
to that entity. If you really want to create an entity reference (&auml;)
in your document and have it written out that way, you must create an
Entity Reference node
     document.createEntityReference("auml")
and insert that. You don't have to create the children of the
EntityReference;  they are copied automatically from the Entity previously
stored in the DocumentType. Note that in DOM Level 2 there is still no
portable way to manually declare new Entities; we expect to address that in
DOM Level 3's "Content Model" chapter.


Hope that helps...

______________________________________
Joe Kesselman  / IBM Research
Received on Tuesday, 8 August 2000 08:41:46 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 22 June 2012 06:13:47 GMT