Re: Validator error from Michael[tm] Smith on 2014-11-10 (www-validator@w3.org from November 2014)

From: Michael[tm] Smith <mike@w3.org>
Date: Mon, 10 Nov 2014 12:07:24 +0900
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>, Roman Grinyov <w3lifer@gmail.com>
Cc: www-validator@w3.org
Message-ID: <20141110030724.GQ4173@jay.w3.org>

Hi Jukka,

> Date: Mon, 27 Oct 2014 00:29:57 +0200
> From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
> Archived-At: <http://www.w3.org/mid/544D75E5.9030602@cs.tut.fi>
...
> The culprit appears to be on line 48:
> 
>   &lt;p&gt;10 строка | &w3 &lt;/p&gt;
> 
> Validating this line in isolation, with a minimal document around it,
> results in a correct message that points to the “&w3” construct.
> 
> The bug in the validator is that it does not report this properly at
> all in the given context but instead flags completely correct
> character references *before* it as erroneous.
> 
> The bug is reproducible at http://validator.nu too.

Thanks for examining this, and thanks to Roman for reporting it. It's
definitely a bug.

The message is this case is coming from the HTML parser but I can't
reproduce it in "View source" in Firefox (which uses the same HTML parser):

  view-source:http://websnippets.ru/article.php?id=30
  (mouse over the "&w3")

...so it seems a problem specific to the validator usage of the HTML parser.

This is minimally reproducible with the following document:

  <!doctype html><title>test</title>&gt;<textarea>&w3</textarea>

  http://validator.w3.org/nu/?showsource=yes&doc=data%3Atext%2Fhtml%3Bcharset%3Dutf-8%2C%3C%2521doctype%2520html%3E%3Ctitle%3Etest%3C%252Ftitle%3E%2526gt%253B%3Ctextarea%3E%2526w3%3C%252Ftextarea%3E

If you replace the `texarea` with a `span` or whatever, you can't reproduce
it. That makes some sense because the `textarea` elements have special code
path in the parser, along with `title` elements.

So I kinda expect the core problem here is, the validator code isn't
passing on line-number info correctly to the parser when processing
`textarea` and `title` elements. Here's an even more minimal case:

  <!doctype html><title>&w3</title>

  http://validator.w3.org/nu/?showsource=yes&doc=data%3Atext%2Fhtml%3Bcharset%3Dutf-8%2C%3C%2521doctype%2520html%3E%3Ctitle%3E%2526w3%3C%252Ftitle%3E

For that case, the validator just reports "Error: & did not start a
character reference. (& probably should have been escaped as &amp;.)",
without reporting line+col numbers at all or flagging the position.

So I think the root cause of the problem Roman ran into is that the
validator doesn't have any line-number info to report in this case, and
then the parser's character-reference reporting isn't getting
re-initialized correctly, so it reports the position of the last character
reference it checked that did have a line+col numbers.

Anyway, I've filed a bug http://bugzilla.validator.nu/show_bug.cgi?id=1010
and I'll try to make some time soon to investigate the code around this.

  --Mike

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Monday, 10 November 2014 03:07:26 UTC