RE: non-sgml characters from Jon Hanna on 2002-07-15 (w3c-wai-ig@w3.org from July to September 2002)

From: Jon Hanna <jon@spin.ie>
Date: Mon, 15 Jul 2002 17:23:33 +0100
To: <w3c-wai-ig@w3.org>
Message-ID: <NDBBLCBLIMDOPKMOPHLHAEIGEDAA.jon@spin.ie>
>      I too have been looking for a standard set of icons.
> Additionally, though, since I am blind I am really looking for a
> table that would be like:
> description:code - that is, two text columns.

Are http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
and http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent readable to you?

> I am unfamiliar with the methods to do this so probably confuse things.
> There are "entities" in HTML such as &sup for superscript that
> JFW 4.01 reads very well (JFW says "superscript").

&sup; means "superset of", a mathematical symbol that looks like a bit like
a capital U or a chicken-wire nail on its side. If JFW is reading that as
"superscript" it is in error.

There are entities &sup1; &sup2; and &sup3; for "superscript digit 1",
"superscript digit 2" and "superscript digit 3" respectively, is that what
you are referring to?

> Apparently there is also unicode and SGML.

Okay there seems to be some confusion here. SGML and XML both define rules
of syntax (amongst other things) that are used by other applications. HTML
used to be an SGML application until around 3 years ago. It is now an XML
application.

XML does not define any character set, character encoding, or any other way
of defining a relationship between an collection of bits and a character.
What it *does* do is firstly use one of these encodings in that it is
written as text, and hence must be written in some sort of character set,
and secondly define mechanisms for the author of the document to express
characters that are either outside of the character set being used, illegal
at the current position, or were it is simply more convenient for the author
to use the mechanism.

Now while XML can be encoded in any character set (there are problems with
the term "character set", but this mail is looking like it's going to be
long already so I'll skip that for now) that contains symbols for < > / and
at least one of " and ' it should always be thought of as being in the
Universal Character Set (UCS).

In other words if one XML document is encoded in UTF-16BE, another in
UTF-16LE, another in UTF-8 and another in the Windows Western character set
then to indicate a trademark sign directly the first could use the byte
values 33 followed by 34, the second could use bytes 34 the 33, the third
bytes of value 226 then 132, then 162 and the latter by a byte of value 153.
However all of these different ways of encoding the character are just
different conventions for the server indicating to the client that it means
UCS character 8482 - and hence that it means a trademark symbol. Once the
bytes are loaded into the browser it has had the character 8482 communicated
to it and how this happened doesn't matter, just like it doesn't matter if
you are reading the number 8482 in this mail on a screen, printed page,
Braille reader, or it is read to you by a screen reader; what matters is
that I have gotten 8482 from here to there.

Character entities are, as I said, a way of indicating a character point
that it is impossible or inconvenient to express in the character set that
is being used to encode the document. Because the author is here stating a
character point in code it MUST always be the UCS value that is used,
therefore you would use &#8482; or &#x2122; to mean the trademark symbol
(these are the same number, but the latter uses hexadecimal, the former
decimal) no matter what character set was used to encode the document.

Hence &#153; is completely wrong. However because there is no character at
all at position 153 it is so completely wrong that a browser can realise
that the author must have made a mistake and try to guess what the author
meant to do, which is why it works on some browsers.

The named entities like &trade; are another feature of XML (and SGML before
it). An author of a DTD can define various named entities that are to be
replaced by something else when it is encountered. The DTD for HTML points
to the three files I gave URLs for above. In one of these there is the code:

<!ENTITY trade    "&#8482;">

which means that whenever a browser comes across &trade; it should replace
it with &#8482; which as we noted above means the trademark symbol.
Received on Monday, 15 July 2002 12:23:26 UTC