Re: XHTML, Japanese text, non SGML character error message

Marty Cawthon <mrc@ChipChat.com> wrote:

>   When the XHTML validator checks a document (a 'forum' page) 
> that I am working on it reports error messages for some Japanese text:
> "non SGML character 130".
		(snip)
>   It may be that the document does contain non SGML characters, in which case
> I will appreciate a pointer to learn more to help me make the characters so that
> they conform to XHTML.  Or it may be a bug in the validator when examining
> documents containing Japanese text.

That's not a bug, that's because you didn't send correct charset
information.  To validate Japanese documents correctly, you MUST
explicitly specify character encoding of your documents.

In this case, the server only sends

   Content-Type: text/html

for <http://www.koga.org/letters.htm>, without charset parameter,
so the validator assumes that the character encoding of the document
is ISO-8859-1, according to HTTP/1.1 spec.  Section 3.7.1 of HTTP/1.1
spec [1] says:

   The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value. See
   section 3.4.1 for compatibility problems.

The document is actually encoded in Shift_JIS, so the validator
generated some strange error messages.  If the server sends

   Content-Type: text/html; charset=Shift_JIS

then the validator works fine.

Though you should specify charset information via HTTP Content-Type
header as described above, the validator also recognizes equivalent
information inside the document, namely, meta element.  But in this
case, this is also wrong.  The document includes the following line:

   <meta http-equiv="Content-Type" content="text/html; charset=SJIS-JP" />

but "SJIS-JP" is not the registered charset name.  "Shift_JIS" is the
corrent name.  Check the charset registory [2] for more detail.

And also, since an XHTML document is an XML document, you MUST also
include the the following XML declaration at the beginning of your
document.

   <?xml version="1.0" encoding="Shift_JIS" ?>

Hope this helps.  

[1] http://www.ietf.org/rfc/rfc2616.txt
[2] ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Wednesday, 18 August 1999 11:53:04 UTC