- From: Jon Hanna <jon@spin.ie>
- Date: Mon, 15 Jul 2002 13:24:31 +0100
- To: <w3c-wai-ig@w3.org>
> I am no expert, and hopefully someone much more knowledgible will answer
> to you as well, but from what I understand, HTML documents are made up
> of 8-bit characters from the ISO 8859 Latin-1 character set.
HTML documents are made encoded in any character set. Latin-1 has the
advantage of being code-point compatible with both ASCII and Unicode, and
hence was once used as the default. UTF-8 has the advantage of being
code-point compatible with ASCII and capable of directly encoding all
Unicode code-points and hence was chosen as one of the defaults for XML and
hence HTML when it became XHTML in 1999 (UTF-16 is the other default - it
can safely have 2 defaults as it is easy to tell them apart from the first
couple of bytes).
One obvious disadvantage of Latin-1 here would be that it has no TM glyph :)
One way to think of this is to think of the Unicode code-point as a sort of
Platonic form, with it's encodings in various sets as a more "physical" (as
physical as you can get with a bunch of bits) reality of that form.
> ISO SGML entity definitions are used to include characters which are
> missing from the character set or which would otherwise be confused with
> markup elements and the formal symbol for a TM sign is,
>
> ™ (™)
>
> The one for registered trademark (R) is,
>
> ® (®)
>
> And so on...
>
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
and http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent are the entities
referenced by the DTDs, they define ™ as:
<!ENTITY trade "™"> <!-- trade mark sign, U+2122 ISOnum -->
They agree with you on ® though (I'm guessing you are going by the
Window's charset, which agrees with Unicode and Latin-1 on this one):
<!ENTITY reg "®"> <!-- registered sign = registered trade mark sign,
U+00AE ISOnum -->
Received on Monday, 15 July 2002 08:24:31 UTC