Possible bug in HTML 4.01 transitional Validator logic from David Bryant on 2003-03-03 (www-validator@w3.org from March 2003)

From: David Bryant <davidbryant@att.net>
Date: Mon, 03 Mar 2003 11:25:18 -0700
To: www-validator@w3.org
Message-ID: <3E639E0E.8050408@att.net>
Summary: The string </ is always interpreted as the start
of an HTML end tag, even when it's inside a scripted string
constant.

-----------------------------------------------------------

Hi! I'm new to this list. I live in Denver, Colorado.

I've been using the markup validation service extensively to
check my HTML coding. You can see my pages at

http://davidbryant.home.att.net

if you want to.

Every page on my site starts with

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

Anyway, I was recently checking a page entitled "prog105.htm"
which you can locate on my site, and I got a number of errors
back that really should not have been errors.

This page displays inside a frameset. I'm using the bottom
16% of the screen as a display area for definitions, and I use
JavaScript with the document.write method to display the
requested definition for certain words in that bottom frame.
The information to be displayed is (was) encoded like this:

<script ...>
... some code omitted ...
var dfnhead = '<font size="2"
face="Arial,Helvetica,sans-serif"><b>';
var dfntail = '<br></font>'
var dfn = new Array()  // definitions;
dfn[0] = "Assembly Language:</b> A computer language consisting
of <i>mnemonics</i> that translate directly into individual
machine instructions, <i>macro instructions</i> that generate
one or many individual assembly language statements, and
<i>assembler directives</i> that control the assembly process.";
  ... (more array entries) ...

and the logic to display the requested definition looks
like this:

top.frames[1].write(dfnhead + dfn[x] + dfntail);
   ... more code omitted ...
</script>

where x is a variable passed at the time this particular
function is invoked (when the user clicks on a link).

So anyway the page was displaying beautifully, but it would
not validate on your service because the parser identified the
</b> and </i> strings inside the JavaScript code quoted above
as being unmatched HTML end tags. But it didn't complain about
the <font ...> or the </font> tags embedded inside the variables
dfnhead and dfntail. And it apparently didn't even see any of
the <i> tags, because all of those are paired up perfectly with
the </i> tags in the definitions the way I coded them, and it
still said all the </i> tags were unmatched.

I don't need advice on how to fix this. I already fooled the
parser by breaking up my simple long strings into several
shorter strings and then concatenating them later with my
JavaScript code. Now the page validates OK. But I still think
the parser contains a bug.

Why does the parser look at stuff in between <script> and
</script> tags at all? There isn't going to be any real HTML
in between those two tags, anyway. And if there is some HTML
embedded inside scripted code, your parser would have to
emulate the scripting interpreter and actually generate all
the possible strings of output that eventually become HTML
when a user interacts with the script before it could perform
a real validation check. That sounds like a very tall order,
especially considering that there are several different
scripting languages in use today.

Just curious. Thanks for your time!  dcb
Received on Monday, 3 March 2003 13:50:00 UTC