Wrong handling of non-ASCII characters

On Sun, 30 Oct 2005, Jukka K. Korpela wrote:

> I created a trivial test document
> http://www.cs.tut.fi/~jkorpela/test/nbsp.html
> that has a <ul> element with one <li> element inside it but
> with a no-break space before the <li> tag. Here's what the
> W3C validator says:
>
> 1. Error Line 5 column 0: start tag for "LI" omitted, but its declaration 
> does not permit this.
> ¼/strong>?<li></li>
>
> There's something very strange in the report's source.

I was able to reduce the problem to an even more trivial case:
- write a document in ISO-8859-1 encoding
- declare HTML 4.01 Strict DOCTYPE
- use a body part of <body>é</body> (or with any non-ASCII
   character inside the body)

The validator reports "character data is not allowed here",
which is correct, but shows the element oddly:

<body>ü/strong>?</body>

If I manually change the encoding of the report page to ISO-8859-1, I get:

<body>é</body>

This is still wrong, but I guess we can now see what goes wrong.
Here's the source of the error message page (viewed as if it were
Latin 1):

       <li class="msg_err">
<span class="err_type">Error</span>
         <em>Line 4 column 6</em>:
         <span class="msg">character data is not allowed 
here</span>.<pre><code class="input">&#60;body&#62;<strong title="Position 
where error was detected.">Ã</strong>©&#60;/body&#62;</code></pre>

Thus, the validator has added <strong> markup in a manner that breaks
a sequence of two octets that is meant to be the UTF-8 representation
of a single character ("é" in this case). This produces the octet pair
C3 3C (looks like Ã< if interpreted as ISO-8859-1), and the rest is
a mess.

Deactivating the generation of <strong> markup to highlight the point
of error would be a quick fix.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Sunday, 30 October 2005 13:02:52 UTC