- From: Alan J. Flavell <flavell@a5.ph.gla.ac.uk>
- Date: Sun, 23 Aug 1998 19:46:50 -0400 (EDT)
- To: www-validator@w3.org
Please excuse the unorthodox style, I'm handling this from the web archive... Me: > Well, I think the place that advice is needed would be on the actual > mechanics of informing the SP software of what charset it should be > working in, and then devising a way to pick that off the HTTP > transaction and feed it to the validator. Martin: > To pick it off the HTTP transaction should be rather easy. But > the problem is that it can also turn up inside the document, > in the "<META>" construct. That's true, unfortunately. It's a pity that most HTML authors have convinced themselves that they cannot tell their server to create proper HTTP headers. This may be true for some, but AFAIK all of the people I've advised have found to their suprise (and in at least one case to the surprise of his server admin ;-) that a simple AddType directive in a .htaccess file worked wonders... however, in a practical sense, it's true that the validator has to be prepared for this case. You _could_ provide a pull-down or radio button, though, it would be better than nothing. Me: > I have looked superficially at SP, but I haven't > looked at all at the setup that your online validator is using. I downloaded the Win95 version of SP and played around, and verified (as was only to be expected, after all) that it behaves correctly. Setting the appropriate environment variable SP_ENCODING to the document encoding did the trick. And, as the DTD's are all confined to US-ASCII, which is a proper subset of almost all of the codings under review, it's going to work (well, I don't suppose we had any doubt about that). I think in practice that the choice of encoding for unicode on the WWW is going to fall on utf-8, don't you? The only problem that I noticed was that there is no support for any of the commonly used Cyrillic encodings. The documentation implies that it would support iso-8859-5, but I'm told that nobody actually uses that, but it doesn't support koi8-r (Russian de facto code) nor ECMA-Cyrillic/iso-ir-111 (non-Russian usage). (This isn't my field, I'm only reporting what I'm told). Bear in mind that koi8-r, at least, uses the range 128-159 for displayable characters, and to add variety, it has its no-break space in a different place! Me: > http://www.jclark.com/sp/charset.htm is somewhat baffling to the > non-SGML-guru like myself. I _think_ he is saying that one needs to > turn on SP_CHARSET_FIXED and use the default SP_SYSTEM_CHARSET > which is Unicode; then specify the encoding of the incoming document via > SP_ENCODING. But I could very well have got that wrong, and I don't > understand the BCTF issue at all. OK, in HTML usage it's necessary to set SP_CHARSET_FIXED on (=1 etc.), and set SP_ENCODING to the input coding. With tools like sgmlnorm or spam, the -b (BCTF) command line option can then be used to specify the desired output encoding, but this would seem irrelevant to the validator. Or would it? Well, if the server's own output consists only of ASCII, the subset property comes into play again, and the server can send the same as it usually sends, and can advertise the result as being in whatever encoding the original document claimed to be, no? Then any inclusions from the original document will come out right in the end. I think. Me: > I'm interested, but I'm sure there are others who are much more > technically competent to address this problem. I'm only dabbling. OK, I dabbled a bit more. HTH ;-)
Received on Monday, 24 August 1998 12:01:21 UTC