Re: Validation of 'http://ahangama.com/election/whatnext-s.htm' from Jukka K. Korpela on 2016-11-21 (www-validator@w3.org from November 2016)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Mon, 21 Nov 2016 15:11:17 +0200
To: www-validator@w3.org
Message-ID: <3baab449-68e5-ab04-eb95-02a160adf9bd@cs.tut.fi>

21.11.2016, 3:19, JC Ahangama wrote:

> That page is written in Romanized Singhala, and rendered in the native
> script using an Orthographic Smartfont.

The page http://ahangama.com/election/whatnext-s.htm appears to be 
written in Sinhala, using the Sinhala alphabet (script), code Sinh, but 
using a technique based on “font trickery”: a special 8-bit font, 
containing Ascii in the lower range (0..0xFF) and Sinhala letters 
(perhaps in the same order as in the Sinhala block in Unicode) in the 
upper range. This trickery is entirely based on the assumption that 
browsers will use that special font. These days, the assumption can be 
satisfied more often than in the old days, as you can use @font-face to 
embed it, reaching near (but not quite) 100% coverage.

> I feel the meta data, lang='si-Latn' conveys the correct information.

It does not, and neither does the element <meta charset="utf-8">. First, 
there is no defined Latin (Roman) writing system for Sinhala, I’m 
afraid, so lang="si-Latn" is misleading. Second, the data is not in fact 
UTF-8 encoded. Interpreted as UTF-8 data, its <body> content starts with
“[2016-11-14] (akuru loku karanna bravsara kavuLuvee ðakuNu agin 
allaagena mehi æþi”
I don’t think any official or unofficial writing system for Sinhala uses 
Icelandic letters “ð” and “þ”.

> However, I have not registered this notation. Please help me to do it
> properly.

I don’t think that’s the solution.

The options as I see them are:

1) Keep doing what you have done and ignore the warning. It is, after 
all, just a warning message from an experimental checker, caused by 
experimental language-guessing, which is known to guess wrong rather 
often (though here the reason is that the content, interpreted according 
to the metadata of the document, is not in any human language, and the 
guesser just makes a wild guess).

2) Switch to using UTF-8 encoded Sinhala characters (and use just 
lang="si"). This is nontrivial, as most of the page content needs to be 
recoded. If you think you need to embed a font them, try and find a 
Unicode font that contains them (properly assigned to the correct code 
points).

Yucca

Received on Monday, 21 November 2016 13:11:49 UTC