- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Sat, 03 Jan 2015 17:14:55 +0200
- To: webmaster@succulent-plant.com, www-validator@w3.org
2015-01-02, 14:11, Richard J. Hodgkiss wrote: > I'm trying to get an author's name with a hachek to validate e.g. > Frič The problem reported by the validator in HTML5 mode is not caused by the háček or by the entity reference č used to denote it. The cause is the presence of invisible characters near (before) the name. > as described here: > http://www.w3schools.com/charsets/ref_utf_latin_extended_a.asp Generally, the w3schools.com site is unreliable, though to a lesser degree than it used to be. It is not affiliated with the W3C in any way. This particular page is misleading in two ways. It does not mention the limitations on support to the character entities or references, and it suggests their use as the primary way. > However, I can't get it to validate in either HTML 4.01 or HTML 5 using the > validator although the character displays just fine. > ||Frič doesn't validate either in either HTML 4.01 or HTML 5, although > it also displays correctly. The č reference as such validates in both HTML versions, but completely different problems nearby may easily create a different impression. The č reference is valid in HTML5 but not in any earlier version. More importantly in practice, browser support to č is still limited: no support in IE 9 or earlier (the reference is rendered literally, not as č), and there are also old versions of other browsers in use that lack the support. Thus, č is much safer. On the other hand, you can write č as such, since the page is UTF-8 encoded. How you enter it depends on the authoring environment, but it’s surely possible. > http://www.succulent-plant.com/families/cactaceae.html The issue that causes even HTML5 validation to fail is quite independent of the č character. Looking at source code line 501 in a suitable editor shows that the data contains F<NULL>r<NULL>i<NULL>č where I have denoted by <NULL> the character U+0000 NULL, i.e. the Unicode character with code number 0. It is an invisible control character, which is normally ignored by browsers, but its declared invalid in HTML, possibly because it might cause problems in some software that processes HTML documents, possibly just because there is no possible use for it in HTML. If get rid of the NULL characters, the page validates OK as HTML5. You can probably achieve this in your authoring program simply by selecting the characters from F to & and deleting them and typing Fri instead. (The most probable cause of those characters is that a string was copied, with copy and paste or programmatically, from UTF-16 encoded data. In UTF-16, each basic Latin letter is represented as two bytes, the second byte being all zeros. If such data is then interpreted as UTF-8, the second byte appears as the NULL character. UTF-16 is, among other things, the internal encoding of most character data in Windows.) Yucca
Received on Saturday, 3 January 2015 15:15:25 UTC