RE: non-sgml characters

> I am no expert, and hopefully someone much more knowledgible will answer
> to you as well, but from what I understand, HTML documents are made up
> of 8-bit characters from the ISO 8859 Latin-1 character set.

HTML documents are made encoded in any character set. Latin-1 has the
advantage of being code-point compatible with both ASCII and Unicode, and
hence was once used as the default. UTF-8 has the advantage of being
code-point compatible with ASCII and capable of directly encoding all
Unicode code-points and hence was chosen as one of the defaults for XML and
hence HTML when it became XHTML in 1999 (UTF-16 is the other default - it
can safely have 2 defaults as it is easy to tell them apart from the first
couple of bytes).

One obvious disadvantage of Latin-1 here would be that it has no TM glyph :)

One way to think of this is to think of the Unicode code-point as a sort of
Platonic form, with it's encodings in various sets as a more "physical" (as
physical as you can get with a bunch of bits) reality of that form.

> ISO SGML entity definitions are used to include characters which are
> missing from the character set or which would otherwise be confused with
> markup elements and the formal symbol for a TM sign is,
>
>     ™    (™)
>
> The one for  registered trademark (R)  is,
>
>     ®     (®)
>
> And so on...
>
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
and http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent are the entities
referenced by the DTDs, they define ™ as:
<!ENTITY trade    "&#8482;"> <!-- trade mark sign, U+2122 ISOnum -->
They agree with you on &reg; though (I'm guessing you are going by the
Window's charset, which agrees with Unicode and Latin-1 on this one):
<!ENTITY reg    "&#174;"> <!-- registered sign = registered trade mark sign,
                                  U+00AE ISOnum -->

Received on Monday, 15 July 2002 08:24:31 UTC