Re: validator.w3.org and utf-8 (fwd) from Martin J. Duerst on 1998-08-07 (www-validator@w3.org from August 1998)

From: Martin J. Duerst <duerst@w3.org>
Date: Fri, 07 Aug 1998 15:36:55 +0900
To: Gerald Oskoboiny <gerald@w3.org>
Cc: www-validator@w3.org
Message-Id: <199808070810.RAA02504@sh.w3.mag.keio.ac.jp>
At 15:38 98/08/06 -0400, Gerald Oskoboiny wrote:
> Here are some good tips from Alan Flavell...

Yes indeed.

> ---------- Forwarded message ----------
> From: "Alan J. Flavell" <flavell@mail.cern.ch>
> Date: Thu, 6 Aug 1998 13:42:01 +0200 (METDST)
> To: Gerald Oskoboiny <gerald@w3.org>
> Cc: Andreas Prilop <nhtcwenz@rrzn-user.uni-hannover.de>
> Subject: Re: validator.w3.org and utf-8
> 
> On Wed, 5 Aug 1998, Gerald Oskoboiny wrote:
> 
> [...]
> 
> > I'm definitely interested in fixing this bug, but I'm afraid I
> > don't know a lot about i18n issues myself, so I need advice from
> > others.
> 
> Well, I think the place that advice is needed would be on the actual
> mechanics of informing the SP software of what charset it should be
> working in, and then devising a way to pick that off the HTTP
> transaction and feed it to the validator.

To pick it off the HTTP transaction should be rather easy. But
the problem is that it can also turn up inside the document,
in the "<META>" construct. That may mean something like a
recursive call of SP :-).


> I have the impression that at
> the moment it isn't doing anything at all of that nature, meaning it
> processes every charset as if it were iso-8859-1.  But that's only my
> hunch from the outside; I have looked superficially at SP, but I haven't
> looked at all at the setup that your online validator is using. 

That's my hunch, too, not by looking at SP, but by looking about
who complains about the validator, and who doesn't :-).

And it indeed correctly (for iso-8859-1) masks out the range from
0x80 to 0x9F. So it at least prevents pages from containing
Microsoft-specific non-iso-8859-1 characters. But on the other
hand, that means that as is, it doesn't work on UTF-8
(try e.g. with http://www.unicode.org/unicode/iuc10/x-yi.html,
http://www.unicode.org/unicode/iuc10/languages.html contains
a few more pages in all kinds of charsets).


> http://www.jclark.com/sp/charset.htm  is somewhat baffling to the
> non-SGML-guru like myself.  I _think_ he is saying that one needs to
> turn on SP_CHARSET_FIXED and use the default SP_SYSTEM_CHARSET
> which is Unicode; then specify the encoding of the incoming document via
> SP_ENCODING.  But I could very well have got that wrong, and I don't
> understand the BCTF issue at all.

This is also my interpretation. In addition, I think BCTF is
irrelevant. It is only relevant if you want to pass something
through the parser without ever converting to Unicode, but with
correctly working on characters, and not on bytes.


> And presumably then there is the question of distinguishing between the
> encoding of the SGML declarations and DTDs, on the one hand, and the
> encoding of the HTML document to be validated, on the other. 

Yes, but this is rather academic, probably with the exception of
UTF-16.


> > I have (just now) sent you an invitation to join the
> > www-validator mailing list, where I'd like to discuss this
> > further.
> 
> I'm interested, but I'm sure there are others who are much more
> technically competent to address this problem.  I'm only dabbling. 

I guess we are all dabbing, and the more people we have doing
a bit of dabbing, the better.


Regards,   Martin.
Received on Friday, 7 August 1998 04:09:51 UTC