- From: Michael[tm] Smith <mike@w3.org>
- Date: Tue, 16 Aug 2016 18:40:48 +0900
- To: Erin O'Kelley Muck <erin@rubyslipper.com>
- Cc: www-validator@w3.org
- Message-ID: <20160816094048.6iqhet2eo3xniy6z@sideshowbarker.net>
Erin O'Kelley Muck <erin@rubyslipper.com>, 2016-08-11 10:29 -0700: > Archived-At: <http://www.w3.org/mid/2D86FD8C7EDC45B681D15AC39E118E10@DESKTOPHF9LC9J> > > Hello, > > The HTML checker has misidentified my website as being in German, but it is written in English. > > http://dev.lauriesager.com/index.html Thanks for taking time to report it, and sorry it’s reporting the wrong language. > I have added the <html lang=”en”> tag, and wanted to report this issue to > see if there is anything else I can do to fix it? There’s unfortunately nothing you can do for now to fix it from your side, but I will work on getting it fixed in the validator asap. Assuming that http://dev.lauriesager.com/index.html (which I can’t get to) is essentially the same as http://lauriesager.com/index.html the problem is the validator sees the text content of that page as consisting of about 700 characters, but of that 700 characters the only part the language detector within the validator sees is this: Hecht Holzshu Galgon McIntyre Istel Rodriguez Lambert Dancin Lane Bernard Kliejunas-Lubliner Kaegi Day The language detector doesn’t produce accurate results for text that’s less than 200 characters or so—especially if the text isn’t actual prose sentences—and the above test is only about 100 characters. And the language detector doesn’t know anything about words or names; instead it basically uses data on the frequency of certain combinations of letters in particular languages, and in this case it is seeing combinations of letters that make it guess that it looks more like German than anything else. Because of the issue of the language detector not operating well on short amounts of text, I have the validator configured to only try language detection if the text is longer than a certain minimum number of characters (currently 512 characters), but the problem is that also counts newlines and some other whitespace character that rightly should be ignored. Anyway, I’ll try to figure out soon how to make the checker a bit smarter about, but in the in mean time you can safely ignore that warning. —Mike P.S. If anybody on this mailing list has insight in dealing with inter- element whitespace in SAX, the specific problem is have is in this code: https://github.com/validator/validator/blob/master/src/nu/validator/xml/LanguageDetectingXMLReaderWrapper.java#L171 I would like be able make that ContentHandler.characters method to ignore inter-element whitespace. I know about ContentHandler.ignorableWhitespace but as far as I understand that does not really help me in this case. -- Michael[tm] Smith https://people.w3.org/mike
Received on Tuesday, 16 August 2016 09:42:08 UTC