Re: Utf-8 problem from Jukka K. Korpela on 2007-04-11 (www-validator@w3.org from April 2007)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Wed, 11 Apr 2007 09:25:01 +0300 (EEST)
To: www-validator@w3.org
cc: laura brewer <lauralee1341@msn.com>
Message-ID: <Pine.GSO.4.64.0704110852250.23830@mustatilhi.cs.tut.fi>
On Wed, 11 Apr 2007, Rui del-Negro wrote:

> My guess it the pages were created in a word processor, with "smart quotes" / 
> "smart apostrophes" enabled. These are the sections with the problems:

It indeed seems to be the "smart quotes" or "smart apostrophes" that cause 
the problem, pragmatically speaking. However the page as a whole was 
probably not created in a word processor, since it also contains ASCII 
apostrophes and ASCII quotation marks (i.e., ' and "). I guess _parts_ of 
the text were written using MS Word, which converted, on input, the ASCII 
characters into "smart" or "curly" punctuation characters, and these parts 
were cut & pasted into a web page editor.

The problem is that the page declares UTF-8 encoding but the "smart" 
characters are in windows-1252 (Windows Latin 1) encoding.

Note that the "smart" punctuation characters are not smart at all in 
the present situation: they appear as small rectangles on IE and as as 
question marks in black lozenges on Firefox. Those are the browsers' ways 
of indicating malformed character data.

Technically, e.g. the "smart" apostrophe in windows-1252 encoding (as on 
the page) is octet 9B (hexadecimal), which cannot occur in a utf-8 
datastream except after an octet in the range C2..DF. Here it appears 
after an ASCII character, in the range 0..7F. This technical babbling is 
meant to prove that the data is malformed even at the character level when 
declared as utf-8, so it's reasonable that the validator _only_ reports 
the character level issues. I think I need to warn you that after fixing 
these issues, there will be 130 error messages about markup errors, so 
there's still some work to be done.

You could use the validator's extended interface
http://validator.w3.org/detailed.html
so that you explicitly specify the encoding. Similarly, in a browser, you 
can use View/Encoding to make the browser use these windows-1252 (often 
called "Western European, Windows" or something like that) encoding, and 
this would make the "smart" punctuation characters appear. These 
overrides, however, are not a _solution_.

The solution is to change the encoding declared by the server to match 
the encoding actually used. (You should then remove or modify the <meta> 
tag to avoid confusion when someone sees it in the source code.) If you 
cannot change the HTTP headers that the server sends, you can change utf-8 
in the meta tag to windows-1252. This works too since currently the HTTP 
headers do not specify the encoding.

Alternatively, you could change the "smart" punctuation characters to
utf-8 representation, if you have a suitable editor, or to character 
references like &#8217; or entity references like &rsquo;, which work no 
matter what document's character encoding is.

This is a long reply but I'd also like to remark that the page contains 
some constructs that violate the compatibility guidelines in the XHTML 1.0 
specification (Appendix C) but are not reported by the validator, since 
they do not violate the formalized XHTML syntax. in the document, there 
are "self-closing" tags like <a name="..." /> which should be <a 
name="..."></a> (or replaced by the use of an id="..." attribute for a 
suitable element, to be more modern) and there are separate end tags for 
empty elements like <img ...></img>. These violate, in opposite ways, the 
rule that the <tagname ... /> syntax should be used if and only if the 
element has EMPTY declared content, i.e. if the element _cannot_ have any 
content according to syntax rules.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 11 April 2007 07:03:12 UTC