Re: Invalid validation report!

Dag Øystein Johansen <dag.o.johansen@gmail.com> wrote:

>Result for check.htm: "Sorry, I am unable to validate this document
>because on line 1022 it contained one or more bytes that I cannot
>interpret as utf-8 (in other words, the bytes found are not valid values
>in the specified Character Encoding)."

This is an instance of a known bug in the Validator (actually, it's
demonstration two separate symptoms of the same bug). To recreate:

  * Validate the page in question.
  * Save the validation resulst page to a file.
  * Upload the file to the validator.

In the original page there is a markup error close to the text that reads «med
et snitt på 22 minutter per kamp» (line 499).

One symptom of the bug in question is that the markup error is reported at an
incorrect character offset; it's reported as being close to the word “på”, but
should have been reported earlier.

The other symptom is that that validator tries to indicate the position of the
error to be within the multi-byte sequence comprising the character “å” in “på”
above. Since it inserts markup between the characters — actually between the
bytes of the multi-byte sequence comprising a single character — the resulting
page will contain an invalid multi-byte sequence.


These are both symptoms of the validator internally converting documents to
UTF-8, but operating with byte semantics instead of character semantics.

This bug should be fixed in the next major revision (whenever we switch to using
character semantics).


This may or may not explain the errors you originally spotted.


Thanks for your feedback on this!


-- 
“It's not the mere technical details of inserting the live round into the
 chamber, pointing the weapon at one's foot, and pulling the trigger, but
 rather, it's about the advisability of doing that in the first place.”
                                             -- Alan J. Flavell on ciwah

Received on Tuesday, 29 November 2005 14:50:13 UTC