Re: validator.w3.org and utf-8 (fwd)

From: Alan J. Flavell <flavell@a5.ph.gla.ac.uk>
Date: Sun, 23 Aug 1998 19:46:50 -0400 (EDT)
To: www-validator@w3.org
Message-ID: <Pine.OSF.3.96.980824001640.15000A-100000@a5.ph.gla.ac.uk>

Please excuse the unorthodox style, I'm handling this from the
web archive...

> Well, I think the place that advice is needed would be on the actual
> mechanics of informing the SP software of what charset it should be
> working in, and then devising a way to pick that off the HTTP
> transaction and feed it to the validator.

> To pick it off the HTTP transaction should be rather easy. But
> the problem is that it can also turn up inside the document,
> in the "<META>" construct. 

That's true, unfortunately.  It's a pity that most HTML authors have
convinced themselves that they cannot tell their server to create
proper HTTP headers.  This may be true for some, but AFAIK all of the
people I've advised have found to their suprise (and in at least one
case to the surprise of his server admin ;-) that a simple AddType
directive in a .htaccess file worked wonders...  however, in a
practical sense, it's true that the validator has to be prepared for
this case.  You _could_ provide a pull-down or radio button, though,
it would be better than nothing. 

> I have looked superficially at SP, but I haven't
> looked at all at the setup that your online validator is using.

I downloaded the Win95 version of SP and played around, and verified
(as was only to be expected, after all) that it behaves correctly. 
Setting the appropriate environment variable SP_ENCODING to the
document encoding did the trick.  And, as the DTD's are all confined
to US-ASCII, which is a proper subset of almost all of the codings
under review, it's going to work (well, I don't suppose we had any
doubt about that).  I think in practice that the choice of encoding
for unicode on the WWW is going to fall on utf-8, don't you?

The only problem that I noticed was that there is no support for any
of the commonly used Cyrillic encodings.  The documentation implies
that it would support iso-8859-5, but I'm told that nobody actually
uses that, but it doesn't support koi8-r (Russian de facto code) nor
ECMA-Cyrillic/iso-ir-111 (non-Russian usage).  (This isn't my field,
I'm only reporting what I'm told).   Bear in mind that koi8-r, at
least, uses the range 128-159 for displayable characters, and to add
variety, it has its no-break space in a different place!

> http://www.jclark.com/sp/charset.htm  is somewhat baffling to the
> non-SGML-guru like myself.  I _think_ he is saying that one needs to
> turn on SP_CHARSET_FIXED and use the default SP_SYSTEM_CHARSET
> which is Unicode; then specify the encoding of the incoming document via
> SP_ENCODING.  But I could very well have got that wrong, and I don't
> understand the BCTF issue at all.

OK, in HTML usage it's necessary to set SP_CHARSET_FIXED on (=1 etc.),
and set SP_ENCODING to the input coding.  With tools like sgmlnorm or
spam, the -b (BCTF) command line option can then be used to specify
the desired output encoding, but this would seem irrelevant to the
validator.  Or would it?  Well, if the server's own output consists
only of ASCII, the subset property comes into play again, and the
server can send the same as it usually sends, and can advertise the
result as being in whatever encoding the original document claimed to
be, no?  Then any inclusions from the original document will come out
right in the end.  I think.

> I'm interested, but I'm sure there are others who are much more
> technically competent to address this problem.  I'm only dabbling.

OK, I dabbled a bit more.  HTH ;-)
