Re: validator.w3.org and utf-8 (fwd) from Martin J. Duerst on 1998-08-25 (www-validator@w3.org from August 1998)

From: Martin J. Duerst <duerst@w3.org>
Date: Tue, 25 Aug 1998 15:57:19 +0900
To: A.Flavell@physics.gla.ac.uk
Cc: www-validator@w3.org
Message-Id: <199808250655.PAA25964@sh.w3.mag.keio.ac.jp>
At 19:46 98/08/23 -0400, Alan J. Flavell wrote:

> Martin:
> > To pick it off the HTTP transaction should be rather easy. But
> > the problem is that it can also turn up inside the document,
> > in the "<META>" construct. 
> 
> That's true, unfortunately.  It's a pity that most HTML authors have
> convinced themselves that they cannot tell their server to create
> proper HTTP headers.  This may be true for some, but AFAIK all of the
> people I've advised have found to their suprise (and in at least one
> case to the surprise of his server admin ;-) that a simple AddType
> directive in a .htaccess file worked wonders...  however, in a
> practical sense, it's true that the validator has to be prepared for
> this case.  You _could_ provide a pull-down or radio button, though,
> it would be better than nothing.

I think we will have to provide a pull-down menu anyway (radio buttons
will take too much space if the list grows), because the validator
can be used to check pages of others, where you have no control over
the server.

But I also found that the validator actually keeps the whole file
in memory, and checks it to find doctypes. Please have a look
at the code that starts with:

# do several loops of increasing lengths to avoid iterating over
# the whole file if possible.
#
# these heuristics could be improved a lot.

That code takes much more time than looking for and analyzing META,
I guess. So what, for example, about:


$file = join(@file);
if
(/<META\s+(http-equiv\s*=\s*[\"\']?content-type[\"\']?\s+content=[\"\']text\
s*/\s*html;\s*charset\s*=([0-9a-zA-Z\-]+)\s*[\"\'])|(content=[\"\']text\s*/\
s*html;\s*charset\s*=([0-9a-zA-Z\-]+)\s*[\"\']\s+http-equiv\s*=\s*[\"\']?con
tent-type[\"\']?)\s*>/i) {
    $metaCharset = $2.$4;
}


This is probably still a bit buggy, but mostly would do the job, I guess.
Any improvements welcome.



> Me:
> > I have looked superficially at SP, but I haven't
> > looked at all at the setup that your online validator is using.
> 
> I downloaded the Win95 version of SP and played around, and verified
> (as was only to be expected, after all) that it behaves correctly. 
> Setting the appropriate environment variable SP_ENCODING to the
> document encoding did the trick.  And, as the DTD's are all confined
> to US-ASCII, which is a proper subset of almost all of the codings
> under review, it's going to work (well, I don't suppose we had any
> doubt about that).  I think in practice that the choice of encoding
> for unicode on the WWW is going to fall on utf-8, don't you?

I guess there will be quite some utf-8. But I think there will also
be some UTF-16. For that, it looks like we would need a separate
copy of the DTD with a few NULL bytes, or actually two copies to
deal with the endianness problems of utf-16. That can be done.



> The only problem that I noticed was that there is no support for any
> of the commonly used Cyrillic encodings.  The documentation implies
> that it would support iso-8859-5, but I'm told that nobody actually
> uses that, but it doesn't support koi8-r (Russian de facto code) nor
> ECMA-Cyrillic/iso-ir-111 (non-Russian usage).  (This isn't my field,
> I'm only reporting what I'm told).   Bear in mind that koi8-r, at
> least, uses the range 128-159 for displayable characters, and to add
> variety, it has its no-break space in a different place!

Is the non-break space a problem? As for the encodings used for Russian,
I agree. I guess it shouldn't be too much of a problem to add these
to SP.


> Me:
> > http://www.jclark.com/sp/charset.htm  is somewhat baffling to the
> > non-SGML-guru like myself.  I _think_ he is saying that one needs to
> > turn on SP_CHARSET_FIXED and use the default SP_SYSTEM_CHARSET
> > which is Unicode; then specify the encoding of the incoming document via
> > SP_ENCODING.  But I could very well have got that wrong, and I don't
> > understand the BCTF issue at all.
> 
> OK, in HTML usage it's necessary to set SP_CHARSET_FIXED on (=1 etc.),
> and set SP_ENCODING to the input coding.  With tools like sgmlnorm or
> spam, the -b (BCTF) command line option can then be used to specify
> the desired output encoding, but this would seem irrelevant to the
> validator.  Or would it?

It's not completely irrelevant, depending on how the error messages
that the validator sends back are composed. If the actual HTML text
comes from SP, we have to convert back (or clearly label the page
sent out as UTF-8 or whatever).


> Well, if the server's own output consists
> only of ASCII, the subset property comes into play again, and the
> server can send the same as it usually sends, and can advertise the
> result as being in whatever encoding the original document claimed to
> be, no?  Then any inclusions from the original document will come out
> right in the end.  I think.

Yes, this is one possibility.



Regards,   Martin.
Received on Tuesday, 25 August 1998 03:02:26 UTC