W3C home > Mailing lists > Public > www-validator@w3.org > October 2005

Wrong handling of non-ASCII characters

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Sun, 30 Oct 2005 15:02:43 +0200 (EET)
To: www-validator@w3.org
Cc: Naturally Naomi <naturallynaomi@yahoo.com>
Message-ID: <Pine.GSO.4.63.0510301441550.10081@korppi.cs.tut.fi>
On Sun, 30 Oct 2005, Jukka K. Korpela wrote:

> I created a trivial test document
> http://www.cs.tut.fi/~jkorpela/test/nbsp.html
> that has a <ul> element with one <li> element inside it but
> with a no-break space before the <li> tag. Here's what the
> W3C validator says:
>
> 1. Error Line 5 column 0: start tag for "LI" omitted, but its declaration 
> does not permit this.
> /strong>?<li></li>
>
> There's something very strange in the report's source.

I was able to reduce the problem to an even more trivial case:
- write a document in ISO-8859-1 encoding
- declare HTML 4.01 Strict DOCTYPE
- use a body part of <body></body> (or with any non-ASCII
   character inside the body)

The validator reports "character data is not allowed here",
which is correct, but shows the element oddly:

<body>/strong>?</body>

If I manually change the encoding of the report page to ISO-8859-1, I get:

<body>é</body>

This is still wrong, but I guess we can now see what goes wrong.
Here's the source of the error message page (viewed as if it were
Latin 1):

       <li class="msg_err">
<span class="err_type">Error</span>
         <em>Line 4 column 6</em>:
         <span class="msg">character data is not allowed 
here</span>.<pre><code class="input">&#60;body&#62;<strong title="Position 
where error was detected."></strong>&#60;/body&#62;</code></pre>

Thus, the validator has added <strong> markup in a manner that breaks
a sequence of two octets that is meant to be the UTF-8 representation
of a single character ("" in this case). This produces the octet pair
C3 3C (looks like < if interpreted as ISO-8859-1), and the rest is
a mess.

Deactivating the generation of <strong> markup to highlight the point
of error would be a quick fix.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Sunday, 30 October 2005 13:02:52 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:20 GMT