- From: Terje Bless <link@pobox.com>
- Date: Thu, 20 May 2004 04:54:21 +0200
- To: www-validator@w3.org
- Cc: scott.godin@comcast.net
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Scott R. Godin <scott.godin@comcast.net> wrote: The short version; instead of typing "–" when you want an en-dash, type "–" and you're good to go. The long version is below if you're so inclined. :-) >On Wed, 19 May 2004, Bjoern Hoehrmann wrote: > >>* Scott R. Godin wrote: >>>I'm afraid I don't understand.. you're saying that the entity – >>>is not part of ISO-8859-1 ? >> >>The character – represents is not, please have a look at >><http://ppewww.ph.gla.ac.uk/~flavell/iso8859/isotable.html>. > >Thanks Bjoern, appreciate your responses. > >I'm mildly surprised that it is not, considering it's a bog-standard >typographical mark. > >I've experimented with the validator and decided on windows-1250 as per >your earlier suggestion, […] Actually, I think you may have missed one point Björn made in his original response. Björn Höhrmann wrote: >You actually don't use a character reference but rather the octet 0x96 >which is U+0096 or – in ISO-8859-1, the encoding declared for that >document. Either change the declaration to ...charset=Windows-1252... or >properly escape the character. The main point here isn't that the typographical en-dash isn't part of the ISO-8859-1 Character Repertoire — it isn't, but that's a separate issue — but that you are including the en-dash literally in your document instead of properly encoding it. For HTML and XHTML documents, the raw bits and bytes of the file get interpreted according to the Character Encoding specified in the HTTP headers (or possibly through the use of a <meta> charset specification). If you insert the literal byte 0x96 — which is the en-dash in Windows-1252, but is unassigned in ISO-88591-1 — and label your document as ISO-8859-1 you'll end up with a garbage character (and the Validator will warn you of this). What Björn suggested was that you /either/ switch the Character Encoding labeling to reflect your use of Windows-1252; **or properly escape the character!** In HTML or XHTML documents, in addition to the Character Encoding of the “physical” file being served, there is an abstract “Universal Character Repertoire” to which all documents can refer, regardless of their physical encoding. This is of course the UNICODE Character Repertoire; in which the en-dash has the code point U+2013. To refer to characters from this repertoire in your document you would use either a Character Reference (–) or a Named Entity Reference (–). Instead of entering the en-dash directly from the keyboard — or the specific byte value 0x96 if you're doing this programatically — you would include the literal text string "–" or "–". A browser, when parsing this file, will interpret this string and display the referenced character instead. A few of the more useful typographical characters: Latin 1 | CP1252 | UNICODE | HTML | Char | Description - --------|--------|---------|----------|------|------------------------- ---- | 0x96 | U+2013 | – | – | en-dash ---- | 0x97 | U+2014 | — | — | em-dash ---- | 0x8B | U+2039 | ‹ | ‹ | Left Single Angle Quote ---- | 0x9B | U+203A | › | › | Right Single Angle Quote 0xAB | 0xAB | U+00AB | « | « | Left Double Angle Quote 0xAB | 0xBB | U+00BB | » | » | Right Double Angle Quote ---- | 0x91 | U+2018 | “ | “ | Left Double Quote ---- | 0x92 | U+2019 | ” | ” | Right Double Quote As you can see, of these only the left and right double angled quotation marks are actually present in ISO-8859-1 (ISO Latin 1), and the rest have different positions in UNICODE compared to Windows CP1252. They are all available as named entity references though, or by numeric reference to their code point. You can also find a quite useful table at <http://www.natural-innovations.com/wa/doc-charset.html>, or by wandering around for a while on Alan J. Flavell's excellent pages at <http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/>. - -- By definition there is _no_way_ any problem can be my fault. Any problems you think you can find in my code are in your imagination. If you continue with such derranged imaginings then I may be forced to perform corrective brain surgery... with an axe! -- Stephen Harris -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0.3 iQA/AwUBQKwd2qPyPrIkdfXsEQLCUgCg+E5DTlZt60S7AwWINsO0FdV7OrIAnipq KQmzM0LanV3JIkN5Z45ayffK =fYUr -----END PGP SIGNATURE-----
Received on Wednesday, 19 May 2004 22:54:36 UTC