Re: Documentation of Character Coding from Jukka K. Korpela on 2005-10-20 (www-validator@w3.org from October 2005)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 20 Oct 2005 11:04:03 +0300 (EEST)
To: Meg Crockett <meg_crockett@yahoo.com>
Cc: www-validator@w3.org
Message-ID: <Pine.GSO.4.63.0510201037270.21037@korppi.cs.tut.fi>
On Wed, 19 Oct 2005, Meg Crockett wrote:

> I can only get a tentative because it cannot
> find character encoding.

The message about "tentatively valid" is very confusing, and also 
incorrect. A more appropriate wording would be that the document is valid 
when interpreted in the encoding such-and-such. The fact that the encoding 
should be specified for a document on the Web is external to the question 
of validation. When the encoding has not been specified, the validator 
cannot know whether it has correctly interpreted the task given to it.

I have no idea why the validator thinks it does not know the encoding of 
an _XHTML_ document, or more exactly an XML document, since for XML,
the encoding is defined by the specifications anyway. (It might not be 
the encoding that the author really meant, but that's a different 
story.) For good old HTML, or SGML, the situation is different.

> But I simply cannot
> understand your documentation.  I don't want long
> descriptions, or slide shows.  I want about 3 examples
> which are exactly how they should appear in the code
> including brackets.

Then you have really misunderstood the documentation. The FAQ entry at
http://validator.w3.org/docs/help.html#faq-charset
tries to tell you that this is _not_ a matter of adding some tags into 
your document. The correct way to handle the issue is to make the server 
(or, in the case of validation by file submit, the browser) specify the 
encoding, and this is inevitably a server-specific issue. The answer might 
be very simple (it usually is), or somewhat complicated, or even in the 
negative (you cannot do it because the server admin prevents it).

I'm pretty sure that the documentation intentionally avoids saying the 
following. I can understand the reasons behind this, and I mostly agree.
But people will find this information anyway, so maybe it would be better 
to include it, since then you could accompany it with warnings and 
caveats.

Add the following to your <head> element:

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">

Replace iso-8859-1 by whichever encoding you are using. Replace ">"
by " /"> if (and only if) you declare an XHTML document type; in that
case, also start your document with

<?xml version="1.0" encoding="iso-8859-1" ?>

> Then I'd like about 2 other
> examples of how you might slightly modify these if
> your situation were different than the first three.
> Then a table to use to find all the likely differences
> so you could tell what to put in if you live in outer
> Mongolia and have some obsure system, or whatever
> other considerations one needs to include.

Well, you need to know the encoding you are using. The validator cannot 
really you such things; _you_ need to tell _it_ the encoding. If you
need help with knowing what encoding your authoring tool produces,
then you could look at its documentation.

> I can't help but think that more people would validate
> their pages if this character encoding lack of
> documentation were not such a huge hurtle.

I doubt that. Most of the commonly used authoring tools actually spit
out a <meta> tag that makes the validator happy, though the information
in it could actually be wrong sometimes.

> I cannot
> believe character encoding is really that complicated.

It's actually much more complicated. However, once you've found out which 
encoding you are using and how to check that your server sends the right 
information about it, it's very simple. The problem is that the simple 
answers to these questions vary by the author's situation.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 20 October 2005 08:04:09 UTC