W3C home > Mailing lists > Public > www-validator@w3.org > May 2004

Re: [VE][139] New Error Message Suggestion

From: Terje Bless <link@pobox.com>
Date: Thu, 20 May 2004 04:54:21 +0200
To: www-validator@w3.org
Cc: scott.godin@comcast.net
Message-ID: <b02010203-1033-FF3FF098AA0811D8B4B40030657B83E8@[193.157.66.23]>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Scott R. Godin <scott.godin@comcast.net> wrote:

The short version; instead of typing "–" when you want an en-dash, type
"&ndash;" and you're good to go. The long version is below if you're so
inclined. :-)



>On Wed, 19 May 2004, Bjoern Hoehrmann wrote:
>
>>* Scott R. Godin wrote:
>>>I'm afraid I don't understand.. you're saying that the entity &ndash;
>>>is not part of ISO-8859-1 ?
>>
>>The character &ndash; represents is not, please have a look at
>><http://ppewww.ph.gla.ac.uk/~flavell/iso8859/isotable.html>.
>
>Thanks Bjoern, appreciate your responses.
>
>I'm mildly surprised that it is not, considering it's a bog-standard
>typographical mark.
>
>I've experimented with the validator and decided on windows-1250 as per
>your earlier suggestion, […]

Actually, I think you may have missed one point Björn made in his original
response.

Björn Höhrmann wrote:
>You actually don't use a character reference but rather the octet 0x96
>which is U+0096 or &#150; in ISO-8859-1, the encoding declared for that
>document. Either change the declaration to ...charset=Windows-1252... or
>properly escape the character.

The main point here isn't that the typographical en-dash isn't part of the
ISO-8859-1 Character Repertoire — it isn't, but that's a separate issue — but
that you are including the en-dash literally in your document instead of
properly encoding it.

For HTML and XHTML documents, the raw bits and bytes of the file get
interpreted according to the Character Encoding specified in the HTTP headers
(or possibly through the use of a <meta> charset specification). If you insert
the literal byte 0x96 — which is the en-dash in Windows-1252, but is
unassigned in ISO-88591-1 — and label your document as ISO-8859-1 you'll end
up with a garbage character (and the Validator will warn you of this).


What Björn suggested was that you /either/ switch the Character Encoding
labeling to reflect your use of Windows-1252; **or properly escape the
character!**


In HTML or XHTML documents, in addition to the Character Encoding of the
“physical” file being served, there is an abstract “Universal Character
Repertoire” to which all documents can refer, regardless of their physical
encoding. This is of course the UNICODE Character Repertoire; in which the
en-dash has the code point U+2013.

To refer to characters from this repertoire in your document you would use
either a Character Reference (&#8211) or a Named Entity Reference (&ndash;).

Instead of entering the en-dash directly from the keyboard — or the specific
byte value 0x96 if you're doing this programatically — you would include the
literal text string "&ndash;" or "&#8211;". A browser, when parsing this file,
will interpret this string and display the referenced character instead.


A few of the more useful typographical characters:

Latin 1 | CP1252 | UNICODE |   HTML   | Char | Description
- --------|--------|---------|----------|------|-------------------------
  ----  |  0x96  |  U+2013 | &ndash;  |  –   | en-dash
  ----  |  0x97  |  U+2014 | &mdash;  |  —   | em-dash
  ----  |  0x8B  |  U+2039 | &lsaquo; |  ‹   | Left Single Angle Quote
  ----  |  0x9B  |  U+203A | &rsaquo; |  ›   | Right Single Angle Quote
  0xAB  |  0xAB  |  U+00AB | &laquo;  |  «   | Left Double Angle Quote
  0xAB  |  0xBB  |  U+00BB | &raquo;  |  »   | Right Double Angle Quote
  ----  |  0x91  |  U+2018 | &ldquo;  |  “   | Left Double Quote
  ----  |  0x92  |  U+2019 | &rdquo;  |  ”   | Right Double Quote

As you can see, of these only the left and right double angled quotation marks
are actually present in ISO-8859-1 (ISO Latin 1), and the rest have different
positions in UNICODE compared to Windows CP1252. They are all available as
named entity references though, or by numeric reference to their code point.

You can also find a quite useful table at
<http://www.natural-innovations.com/wa/doc-charset.html>, or by wandering
around for a while on Alan J. Flavell's excellent pages at
<http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/>.

- -- 
By definition there is _no_way_ any problem can be my fault. Any problems
you think you can find in my code are in your imagination. If you continue
with such derranged imaginings then I may be forced to perform corrective
brain surgery... with an axe!                            -- Stephen Harris

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.3

iQA/AwUBQKwd2qPyPrIkdfXsEQLCUgCg+E5DTlZt60S7AwWINsO0FdV7OrIAnipq
KQmzM0LanV3JIkN5Z45ayffK
=fYUr
-----END PGP SIGNATURE-----
Received on Wednesday, 19 May 2004 22:54:36 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:13 GMT