- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Wed, 11 Apr 2007 09:25:01 +0300 (EEST)
- To: www-validator@w3.org
- cc: laura brewer <lauralee1341@msn.com>
On Wed, 11 Apr 2007, Rui del-Negro wrote: > My guess it the pages were created in a word processor, with "smart quotes" / > "smart apostrophes" enabled. These are the sections with the problems: It indeed seems to be the "smart quotes" or "smart apostrophes" that cause the problem, pragmatically speaking. However the page as a whole was probably not created in a word processor, since it also contains ASCII apostrophes and ASCII quotation marks (i.e., ' and "). I guess _parts_ of the text were written using MS Word, which converted, on input, the ASCII characters into "smart" or "curly" punctuation characters, and these parts were cut & pasted into a web page editor. The problem is that the page declares UTF-8 encoding but the "smart" characters are in windows-1252 (Windows Latin 1) encoding. Note that the "smart" punctuation characters are not smart at all in the present situation: they appear as small rectangles on IE and as as question marks in black lozenges on Firefox. Those are the browsers' ways of indicating malformed character data. Technically, e.g. the "smart" apostrophe in windows-1252 encoding (as on the page) is octet 9B (hexadecimal), which cannot occur in a utf-8 datastream except after an octet in the range C2..DF. Here it appears after an ASCII character, in the range 0..7F. This technical babbling is meant to prove that the data is malformed even at the character level when declared as utf-8, so it's reasonable that the validator _only_ reports the character level issues. I think I need to warn you that after fixing these issues, there will be 130 error messages about markup errors, so there's still some work to be done. You could use the validator's extended interface http://validator.w3.org/detailed.html so that you explicitly specify the encoding. Similarly, in a browser, you can use View/Encoding to make the browser use these windows-1252 (often called "Western European, Windows" or something like that) encoding, and this would make the "smart" punctuation characters appear. These overrides, however, are not a _solution_. The solution is to change the encoding declared by the server to match the encoding actually used. (You should then remove or modify the <meta> tag to avoid confusion when someone sees it in the source code.) If you cannot change the HTTP headers that the server sends, you can change utf-8 in the meta tag to windows-1252. This works too since currently the HTTP headers do not specify the encoding. Alternatively, you could change the "smart" punctuation characters to utf-8 representation, if you have a suitable editor, or to character references like ’ or entity references like ’, which work no matter what document's character encoding is. This is a long reply but I'd also like to remark that the page contains some constructs that violate the compatibility guidelines in the XHTML 1.0 specification (Appendix C) but are not reported by the validator, since they do not violate the formalized XHTML syntax. in the document, there are "self-closing" tags like <a name="..." /> which should be <a name="..."></a> (or replaced by the use of an id="..." attribute for a suitable element, to be more modern) and there are separate end tags for empty elements like <img ...></img>. These violate, in opposite ways, the rule that the <tagname ... /> syntax should be used if and only if the element has EMPTY declared content, i.e. if the element _cannot_ have any content according to syntax rules. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 11 April 2007 07:03:12 UTC