- From: Thomas Jedenfelt <thomas_jedenfelt_1@operamail.com>
- Date: Fri, 02 Dec 2005 11:34:48 +0200
- To: www-amaya@w3.org
Many thanks to Vincent for his detailed explanation on the character reference issue and Amaya behaviour. I now understand that character representation is used for those characters that are not included in the chosen Character Encoding (e.g. us-ascii, iso-8859-1). Depending on the chosen DTD, character representation is made by either a Numeric character reference or a Character entity reference (the entity must be specified in the DTD). Due to a bug, which is about to be fixed, Amaya do not always/correctly convert these characters. My wish is that Amaya will use Numeric character references, because that seems to be the most reliable. Thanks, Thomas Jedenfelt P.S. I have written a page on this subject. It seems that there are but 78 characters that are safely transferred over the Internet and displayed on various computers. For those with an interest, I welcome corrections and comments. http://hem.bredband.net/thojed/webtech/char_representation.htm ----- Original Message ----- From: "Vincent Quint" <Vincent.Quint@inrialpes.fr> Date: Fri, 25 Nov 2005 > > Let me try to clarify how Amaya encodes characters when saving > a document. First, a short reminder about encoding and XML may > be useful. > > us-ascii is a 7-bit code that represents 96 printable characters > (positions 32 through127 decimal). > > iso-8859-1 is a 8-bit code that represents the same characters as > us-ascii, at the same positions, plus 95 additional printable characters. > > utf-8 is a variable length encoding scheme for the Universal Character > Set - UCS (ISO10646 aka Unicode). UCS represents thousands of > characters. Note that the 96 characters from us-ascii have the same > positions in UCS as in us-ascii (and then as in iso-8859-1). > > If an XML document contains a character that is not available in the > character set (charset) available with the encoding, a special > representation is required that uses only on the characters available > with this charset. XML offers two such representations of characters: > > 1 - Character references represent the position (in decimal or hexadecimal) > of the character in the UCS. For example, the greek letter alpha is > represented as > & #x3b1; (hexadecimal) or > & #945; (decimal) > > 2 - Entity references use a name to represent a content (in that case > a character). For example the greek letter alpha may be represented as > & alpha; > > Character references may be used in any XML document, but entity references > are allowed only if the document itself contains a means to resolve the name. > Entity name resolution is provided by the Document type definition which > refers to the DTD where names and their associated contents are defined. > Practically, this means that you can use entity references only if the > <!DOCTYPE ...> is present and refers to a DTD that defines the names you use. > > With that in mind, it is easier to understand how Amaya works. > > By default, Amaya preserves the initial encoding of the document, > that is the encoding that was associated with the document at loading > time. You can check this encoding with command > File/Document_information (Charset field). The Save command saves > the document with that encoding, while the Save_As command allows you > to choose another encoding (Charset field). > > When saving a document (Save or Save_As commands), all characters > that are available in the charset of the encoding are just written using the > encoding. Only the other characters are written using character or entity > references. The choice between these two options is made according to > the DOCTYPE. If there is a DOCTYPE that refers to a DTD that defines a > name for the character, an entity reference is used (i.e. a name), otherwise > Amaya generates a character reference in hexadecimal. > > Note that command File/Change_Document_Type allows you to associate, > to change or to remove the DOCTYPE of a document at any time. This > allows you to generate either character references or entity references. > > This is the basic principle, but it seems that a few bugs make Amaya behave > a bit differently. Irène has already fixed a bug recently. She is now checking > the whole process thoroughly. > > Vincent. > -------------- > Vincent Quint INRIA Rhône-Alpes > INRIA ZIRST > > > On Fri, 25 Nov 2005 > "Thomas Jedenfelt" wrote: > > > > It seems to me that Amaya 8.8.1 generates Hexadecimal entities > > for [Symbols] and [Internationalization] characters using Charset > > ISO-8859-1, UTF-8 and US-ASCII. > > > > Amaya does not, it seems, generate numeric or name entities for > > [ISO 8859-1 characters], code position 161 through 255, using > > Charset US-ASCII. (Will be fixed in next Amaya release, according > > to Irène/INRIA, 24 Nov.) > > > > (As previously mentioned, when using Charset ISO-8859-1 it is my > > Wish that Amaya will generate Decimal character references for > > characters of code position 161 through 8364, for the purpose of > > accurately transferring HTML-files over the Internet.) > > > > Reference: > > > > [ISO 8859-1 characters] > > http://www.w3.org/TR/html401/sgml/entities.html#h-24.2.1 > > > > [Symbols] > > http://www.w3.org/TR/html401/sgml/entities.html#h-24.3.1 > > > > [Internationalization] (i18n) > > http://www.w3.org/TR/html401/sgml/entities.html#h-24.4.1 > > > > Regards, > > Thomas Jedenfelt > > > > > > ----- Original Message ----- > > From: "Peter Kerr" > > Date: Fri, 25 Nov 2005 > > > > I've followed this subject with interest as our own institution > > > comes to grips with MS Code Pages :-( > > > > If I understand correctly what is happening, > > > > Amaya will use named entities for US-ASCII charset, > > > hex entities for ISO-8859, and > > > numerical entities for UTF-8 or UTF-16 > > > > Perhaps this should be made clear in the documentation, > > > that the charset and/or encoding declaration defines the entity format. > > > That seems logical enough behaviour to me. > > > > > Peter Kerr Mail.app 2.0.3 (734) > > > Snr Technician > > > School of Music, University of Auckland, New Zealand France -- _______________________________________________ Surf the Web in a faster, safer and easier way: Download Opera 8 at http://www.opera.com Powered by Outblaze
Received on Friday, 2 December 2005 09:34:54 UTC