Re: HTML entities.

Let me try to clarify how Amaya encodes characters when saving
a document. First, a short reminder about encoding and XML may
be useful.

us-ascii is a 7-bit code that represents 96 printable characters
(positions 32 through127 decimal).

iso-8859-1 is a 8-bit code that represents the same characters as
us-ascii, at the same positions, plus 95 additional printable characters.

utf-8 is a variable length encoding scheme for the Universal Character
Set - UCS (ISO10646 aka Unicode). UCS represents thousands of
characters. Note that the 96 characters from us-ascii have the same
positions in UCS as in us-ascii (and then as in iso-8859-1).

If an XML document contains a character that is not available in the
character set (charset) available with the encoding, a special
representation is required that uses only on the characters available
with this charset. XML offers two such representations of characters:

1 - Character references represent the position (in decimal or hexadecimal)
    of the character in the UCS. For example, the greek letter alpha is
    represented as
      α   (hexadecimal) or
      α     (decimal)

2 - Entity references use a name to represent a content (in that case
    a character).  For example the greek letter alpha may be represented as
      α

Character references may be used in any XML document, but entity references
are allowed only if the document itself contains a means to resolve the name.
Entity name resolution is provided by the Document type definition which
refers to the DTD where names and their associated contents are defined.
Practically, this means that you can use entity references only if the
<!DOCTYPE ...> is present and refers to a DTD that defines the names you use.

With that in mind, it is easier to understand how Amaya works.

By default, Amaya preserves the initial encoding of the document,
that is the encoding that was associated with the document at loading
time. You can check this encoding with command
File/Document_information (Charset field). The Save command saves
the document with that encoding, while the Save_As command allows you
to choose another encoding (Charset field).

When saving a document (Save or Save_As commands), all characters
that are available in the charset of the encoding are just written using the
encoding. Only the other characters are written using character or entity
references. The choice between these two options is made according to
the DOCTYPE. If there is a DOCTYPE that refers to a DTD that defines a
name for the character, an entity reference is used (i.e. a name), otherwise
Amaya generates a character reference in hexadecimal.

Note that command File/Change_Document_Type allows you to associate,
to change or to remove the DOCTYPE of a document at any time. This
allows you to generate either character references or entity references.

This is the basic principle, but it seems that a few bugs make Amaya behave
a bit differently. Irène has already fixed a bug recently. She is now checking
the whole process thoroughly.

Vincent.

On Fri, 25 Nov 2005 12:16:36 +0200 "Thomas Jedenfelt" <thomas_jedenfelt_1@operamail.com> wrote:
> 
> 
> It seems to me that Amaya 8.8.1 generates Hexadecimal entities for [Symbols] and [Internationalization] characters using Charset ISO-8859-1, UTF-8 and US-ASCII.
> 
> Amaya does not, it seems, generate numeric or name entities for [ISO 8859-1 characters], code position 161 through 255, using Charset US-ASCII. (Will be fixed in next Amaya release, according to Irène/INRIA, 24 Nov.)
> 
> (As previously mentioned, when using Charset ISO-8859-1 it is my Wish that Amaya will generate Decimal character references for characters of code position 161 through 8364, for the purpose of accurately transferring HTML-files over the Internet.)
> 
> Reference:
> 
> [ISO 8859-1 characters]
> http://www.w3.org/TR/html401/sgml/entities.html#h-24.2.1
> 
> [Symbols]
> http://www.w3.org/TR/html401/sgml/entities.html#h-24.3.1
> 
> [Internationalization] (i18n)
> http://www.w3.org/TR/html401/sgml/entities.html#h-24.4.1
> 
> Regards,
> Thomas Jedenfelt
> 
> 
> ----- Original Message -----
> From: "Peter Kerr"
> Date: Fri, 25 Nov 2005
> > 
> > I've followed this subject with interest as our own institution
> > comes to grips with MS Code Pages  :-(
> > 
> > If I understand correctly what is happening,
> > 
> > Amaya will use named entities for US-ASCII charset,
> > hex entities for ISO-8859, and
> > numerical entities for UTF-8 or UTF-16
> > 
> > Perhaps this should be made clear in the documentation,
> > that the charset and/or encoding declaration defines the entity format.
> > That seems logical enough behaviour to me.
> > 
> > 
> > Peter Kerr                       Mail.app 2.0.3 (734)
> > Snr Technician
> > School of Music,  University of Auckland,  New Zealand
> 
> 
> -- 
> _______________________________________________
> Surf the Web in a faster, safer and easier way:
> Download Opera 8 at http://www.opera.com
> 
> Powered by Outblaze
> 
> 
> 


--------------
Vincent Quint                       INRIA Rhône-Alpes
INRIA                               ZIRST
e-mail: Vincent.Quint@inria.fr      655 avenue de l'Europe
Tel.: +33 4 76 61 53 62             Montbonnot
Fax:  +33 4 76 61 52 07             38334 Saint Ismier Cedex
                                    France

Received on Friday, 25 November 2005 15:51:24 UTC