Re: Encoding hachek from Jukka K. Korpela on 2015-01-03 (www-validator@w3.org from January 2015)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Sat, 03 Jan 2015 17:14:55 +0200
To: webmaster@succulent-plant.com, www-validator@w3.org
Message-ID: <54A8076F.6010505@cs.tut.fi>
2015-01-02, 14:11, Richard J. Hodgkiss wrote:

> I'm trying to get an author's name with a hachek to validate e.g.
> Fri&ccaron;

The problem reported by the validator in HTML5 mode is not caused by the 
háček or by the entity reference &ccaron; used to denote it. The cause 
is the presence of invisible characters near (before) the name.

> as described here:
> http://www.w3schools.com/charsets/ref_utf_latin_extended_a.asp

Generally, the w3schools.com site is unreliable, though to a lesser 
degree than it used to be. It is not affiliated with the W3C in any way. 
This particular page is misleading in two ways. It does not mention the 
limitations on support to the character entities or references, and it 
suggests their use as the primary way.

> However, I can't get it to validate in either HTML 4.01 or HTML 5 using the
> validator although the character displays just fine.
> ||Fri&#269; doesn't validate either in either HTML 4.01 or HTML 5, although
> it also displays correctly.

The &#269; reference as such validates in both HTML versions, but 
completely different problems nearby may easily create a different 
impression. The &ccaron; reference is valid in HTML5 but not in any 
earlier version. More importantly in practice, browser support to 
&ccaron; is still limited: no support in IE 9 or earlier (the reference 
is rendered literally, not as č), and there are also old versions of 
other browsers in use that lack the support. Thus, &#269; is much safer. 
On the other hand, you can write č as such, since the page is UTF-8 
encoded. How you enter it depends on the authoring environment, but it’s 
surely possible.

> http://www.succulent-plant.com/families/cactaceae.html

The issue that causes even HTML5 validation to fail is quite independent 
of the č character. Looking at source code line 501 in a suitable editor 
shows that the data contains

F<NULL>r<NULL>i<NULL>&ccaron;

where I have denoted by <NULL> the character U+0000 NULL, i.e. the 
Unicode character with code number 0. It is an invisible control 
character, which is normally ignored by browsers, but its declared 
invalid in HTML, possibly because it might cause problems in some 
software that processes HTML documents, possibly just because there is 
no possible use for it in HTML.

If get rid of the NULL characters, the page validates OK as HTML5. You 
can probably achieve this in your authoring program simply by selecting 
the characters from F to & and deleting them and typing Fri instead.

(The most probable cause of those characters is that a string was 
copied, with copy and paste or programmatically, from UTF-16 encoded 
data. In UTF-16, each basic Latin letter is represented as two bytes, 
the second byte being all zeros. If such data is then interpreted as 
UTF-8, the second byte appears as the NULL character. UTF-16 is, among 
other things, the internal encoding of most character data in Windows.)

Yucca
Received on Saturday, 3 January 2015 15:15:25 UTC