Re: misidentified language from Michael[tm] Smith on 2016-08-16 (www-validator@w3.org from August 2016)

From: Michael[tm] Smith <mike@w3.org>
Date: Tue, 16 Aug 2016 18:40:48 +0900
To: Erin O'Kelley Muck <erin@rubyslipper.com>
Cc: www-validator@w3.org
Message-ID: <20160816094048.6iqhet2eo3xniy6z@sideshowbarker.net>

Erin O'Kelley Muck <erin@rubyslipper.com>, 2016-08-11 10:29 -0700:
> Archived-At: <http://www.w3.org/mid/2D86FD8C7EDC45B681D15AC39E118E10@DESKTOPHF9LC9J>
> 
> Hello,
> 
> The HTML checker has misidentified my website as being in German, but it is written in English.
> 
> http://dev.lauriesager.com/index.html

Thanks for taking time to report it, and sorry it’s reporting the wrong language.

> I have added the <html lang=”en”> tag, and wanted to report this issue to
> see if there is anything else I can do to fix it?

There’s unfortunately nothing you can do for now to fix it from your side,
but I will work on getting it fixed in the validator asap.

Assuming that http://dev.lauriesager.com/index.html (which I can’t get to)
is essentially the same as http://lauriesager.com/index.html the problem is
the validator sees the text content of that page as consisting of about 700
characters, but of that 700 characters the only part the language detector
within the validator sees is this:

  Hecht Holzshu Galgon McIntyre Istel Rodriguez Lambert Dancin Lane Bernard
  Kliejunas-Lubliner Kaegi Day

The language detector doesn’t produce accurate results for text that’s less
than 200 characters or so—especially if the text isn’t actual prose
sentences—and the above test is only about 100 characters. And the language
detector doesn’t know anything about words or names; instead it basically
uses data on the frequency of certain combinations of letters in particular
languages, and in this case it is seeing combinations of letters that make
it guess that it looks more like German than anything else.

Because of the issue of the language detector not operating well on short
amounts of text, I have the validator configured to only try language
detection if the text is longer than a certain minimum number of characters
(currently 512 characters), but the problem is that also counts newlines
and some other whitespace character that rightly should be ignored.

Anyway, I’ll try to figure out soon how to make the checker a bit smarter
about, but in the in mean time you can safely ignore that warning.

  —Mike

P.S. If anybody on this mailing list has insight in dealing with inter-
element whitespace in SAX, the specific problem is have is in this code:

  https://github.com/validator/validator/blob/master/src/nu/validator/xml/LanguageDetectingXMLReaderWrapper.java#L171

I would like be able make that ContentHandler.characters method to ignore
inter-element whitespace. I know about ContentHandler.ignorableWhitespace
but as far as I understand that does not really help me in this case.

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Tuesday, 16 August 2016 09:42:08 UTC