Faulty character data in US-ASCII encoded document reported inadequately

When a page declared to be US-ASCII encoding but actually containing bytes
outside the US-ASCII range is submitted to the validator, it reports:
“Sorry, I am unable to validate this document because on line *1* it
contained one or more bytes that I cannot interpret as us-ascii (in other
words, the bytes found are not valid values in the specified Character
Encoding). Please check both the content of the file and the character
encoding indication.

The error was: Modification of a read-only value attempted”


The report is correct in the sense that it properly indicates the type of
error, but it is incorrect and unhelpful when it refers to line 1,
independently of the location of the (first) erroneous byte. In general,
the validator reports this class of errors with correct line number
reference and with information about the offending byte (in hex.), which
helps a lot. The last line of the message probably reflects some internal
error in the validator.

A simple example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<title>Ascii</title>
Intentionally non-Ascii: ü.

Also available at
http://jkorpela.fi/test/ascii.htma
(served with the HTTP header  Content-Type: text/html; charset=us-ascii)

Just in case the problem looks irrelevant: there can be reasons to use
US-ASCII and declare it for an HTML document, for example because the
document also needs to be processed with software that can only handle
US-ASCII. Since HTML provides ways to represent all character data using
just US-ASCII at the character encoding level, it should be supported. And
the validator would be a valuable too in checking, among other things, that
the data is indeed just US-ASCII, with useful information about the first
occurrence when it is not.

Jukka “Yucca” Korpela

Received on Thursday, 22 August 2019 13:18:35 UTC