W3C home > Mailing lists > Public > www-html@w3.org > June 2006

Re: Problem in publishing multilingual HTML document on web in UTF-8 encoding

From: आशीष शुक्ला \ <wahjava@gmail.com>
Date: Fri, 2 Jun 2006 14:10:51 +0530
Message-ID: <d9a03f10606020140i7038da24t6b02313fd25bf81d@mail.gmail.com>
To: "W3C HTML Mailing List" <www-html@w3.org>
Hi,

On 6/2/06, David Woolley <david@djwhome.demon.co.uk> wrote:
> You've failed to specify what you think the problem is, so I've
> had to try and analyze from the thread you referenced.
Thanks for reading the thread.Before I tell you what I mean, let's
take an example:

-- begin example --
You're navigating through a book collection, and in the English
section, you came across a book, which is not in English, i.e. it is
misplaced in the English section. So how do you interpret contents of
the book ? You've three choices:

1. Assume it is English, whether you understand it or not.
2. Check its coverpage, may be author has mentioned the language of book.
3. Use your intelligence to guess language of book.

So, according to me, I'll go for 2nd choice, so that if author has
mentioned the language of book, I'll prefer that instead of assuming
it as English language book, just because it is placed in English
section.

And this is not just for this special case (where a book is misplaced
in the wrong section and somehow I detected it that it is misplaced),
but also everytime I'll check its coverpage to see, if author has
explicitly specified language of the book.
-- end example --

So, the problem I encountered is similar to the above problem, where
I'm hosting a website on a webserver, where I don't have any right to
influence HTTP headers. So, webserver always send my UTF-8 HTML
document as ISO-8859-1 document, i.e. in "Content-Type" HTTP header.
As a author of the document, I've properly tagged my document, and
followed
guidelines (given in HTML specification) to specify character set used
by my document.

But a webserver, which doesn't have any autodetection support or is
not able to detect document's encoding (probably document's encoding
doesn't have any special markers in the header), sends document as
default (ISO-8859-1) encoded document. And UA (user agent) instead of
inspecting document's "Content-Type" <meta> tag (if there is any in
document), where author might have placed proper character set
information, follows
web server's response (as specified in HTML 4.01 specification, which
is incorrect in this case), and displays it improperly.

So, as a document author, I've followed all guidelines, but as I don't
have any control over webserver my document looks horrible, when
served from webserver which are not able to detect my document's
character set properly. So this means that a document author, should
be a webmaster also.

Specifying the character encoding (HTML 4.01 specification)
http://www.w3.org/TR/html401/charset.html#idx-character_encoding-7

On the above URL, there is a priority list, followed by confirming UA in
determining document's character encoding. This priority list needs to
be modified, according to me.

That's all I want to say.

Sorry for my poor English grammar.

Thanks for reading this mail.
Ashish Shukla
--
Ashish Shukla "Wah Java !!"
आशीष शुक्ला

  ,= ,-_-. =.
 ((_/)o o(\_))
  `-'(. .)`-'
      \_/

My blah, blah, blah at http://wahjava.blogspot.com/
My webpages at http://www.geocities.com/wah_java_dotnet/

My GPG Fingerprint: BBA9 AD7D BA71 61EB BE46 8CF5 E44A C663 A03F 4261

My GPG keys at
http://keyserv.nic-se.se:11371/pks/lookup?op=get&search=0xA03F4261
--
All that looks C00L is not necessarily validable.

                              -- Ashish Shukla "Wah Java !!"
Received on Friday, 2 June 2006 08:40:59 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:16:06 GMT