Re: Let's fix, not remove, the fuzzy matching feature from Ville Skyttä on 2009-12-04 (www-validator@w3.org from December 2009)

From: Ville Skyttä <ville.skytta@iki.fi>
Date: Fri, 4 Dec 2009 19:22:08 +0200
To: www-validator@w3.org
Cc: Olivier Thereaux <ot@artbeat.me>
Message-Id: <200912041922.09524.ville.skytta@iki.fi>
On Thursday 03 December 2009, Olivier Thereaux wrote:
> Hi Ville, hi everyone,

Hi Olivier, nice to hear from you,

> One thing that surprised me however, was the line stating:
> [[  Removed feature: the "fuzzy matching" feature introduced in 0.8.5
> has been removed because it produced too many confusing and invalid
> suggestions.  ]]
> This sounds like a case of throwing the baby with the bathwater. Is
> there any way we could work together to help fix/improve the feature?

I think it'll require quite a bit of work, and I can't personally promise to 
be available for that.  If someone wants to spend time doing the initial fixes 
and be available for maintaining the feature, I have no problem with that.  
But I'm not at all convinced that this is feasible.

What needs to be done is that validator needs to have knowledge of the 
possible, _valid_ choices it suggests for each susceptible misspelling, I 
don't think it can ever work well enough if it uses simple flat lists of a 
bunch of tag and attribute names that can be valid when they occur somewhere 
in a document, which is how the feature was implemented.

For example, consider <p HREF="foo">foo</p> in a XHTML document (easy to test 
with direct fragment input with XHTML 1.0 Strict).  Validator suggests 'Did 
you mean "href"', which is just as much bogus as the original.  If it doesn't 
know exactly what attributes are valid for <p> in XHTML 1.0 Transitional but 
just looks up the closest match for HREF from its flat list, it will always 
continue to give bad suggestions.  

Similarly, <foo/> alone again using direct fragment input with XHTML 1.0 
results in 'Did you mean "tfoot" or "form"'.  The tfoot suggestion is bogus as 
it can't occur outside of a table, but as validator again uses a flat list of 
tag names that are valid somewhere without any context, it just suggests it.  
(The "form" suggestion here is a better one, but that's just lucky.)

Similarly, error message for <objetc> in a HTML 3.2 document is 'element 
"OBJETC" undefined. Did you mean "object"?', but the object element can not 
occur anywhere in an HTML 3.2 document.

I think the only way to fix this properly is to make validator know the valid 
possibilities at each position of a document where it is about to make the 
suggestions, and use only those.  It doesn't necessarily need to know /all/ 
possible valid alternatives for every position, but the ones it ends up 
suggesting must be valid.

It should also take other things into account, for example given <a href="..." 
ref="..."> it should not include "href" in its fix suggestions for "ref", 
because the href attribute was already specified.  Ditto for 
<table><the/><thead/></table> it shouldn't suggest "thead" for "the" because 
thead is already there (later), and there can be only one in a table.

And some kind of distance thresholds (where semantic similarity would 
preferably be taken into account, not just raw string distances) for 
suggestions should be applied as well so that validator doesn't suggest 
something completely different from what was written in a document, gems like 
for example this one simply cannot happen if you ask me (even if the 
suggestions were valid, which they obviously aren't in this case):  <html 
xmlns="http://www.w3.org/1999/xhtml"> in an HTML 3.2 document: 'Attribute 
"XMLNS" is not a valid attribute. Did you mean "onmouseup" or "onmouseover"?'

So we'd need these lists - one for each doctype for which this feature is 
supported for - and quite probably some kind of code changes at least for 
element name suggestions so there's enough context to look up the valid 
alternatives from.  I have a feeling that I'm missing even some more things 
that would need to be done for the feature to work acceptably, but these are 
already enough for me personally to consider trying to do it not worth my time 
in the foreseeable future.

> Were there other bugs reported?

Yes, there have been more than a few posts on this list about it, and IIRC 
some reports in Bugzilla too.

> Even if the feature, as it is, may not be perfect, I strongly believe
> that removing it goes strikingly against the effort made in the past
> years to make the validator more usable by newcomers to HTML (more
> suggestion, more help, fewer harsh messages) and it would hurt to remove
> it without trying to improve it, or replace it.

I've tried (and managed to fix a few cases), but those are just scratches in 
the surface of a bigger pile of problems in my opinion.  My strong opinion is 
that if there's a possibility for the feature to give an incorrect suggestion, 
its net effect is worse than if the feature did not exist at all, especially 
for newcomers.  And because of that and that I seem to be the only one working 
on the validator these days, my call was to remove the feature and I think 
it's the right one until patches that fix the fundamental issues prove me 
wrong :).  In any case, I also think that validator 0.8.6 should be released 
(soon) without this feature; it can be always brought back later if someone 
gets it to work.
Received on Friday, 4 December 2009 17:22:45 UTC