- From: Gerald Oskoboiny <gerald@w3.org>
- Date: Fri, 7 Aug 1998 01:50:28 -0400 (EDT)
- To: www-validator@w3.org
Here are some further ideas on the i18n issues from Martin Dürst: > As for I18N bugs, the problem is that the validator, by assuming the > input is something like iso-8859-1 works on bytes instead of working > on characters, and ignores any character encoding issues. For some > encodings (those where bytes in the 0x00-0x7F range always represent > ASCII characters), this works. For the other encodings, among else > two out of three Japanese encodings and the one used in Taiwan, this > fails, because these can contain bytes that the validator sees as > syntactically relevant characters such as "<", although they are > part of two-byte sequences that denote Japanese or Chinese characters. > > As a consequence, some truely valid HTML does not pass the validator. > > To make this work, the following steps should be taken: > > 1) Detect the character encoding ("charset") in the HTTP header or > the META construct (see > http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2). > This may involve "reloading" of the document if the META is used (the > browsers have to do this, too). > > 2) Issuing a warning and taking some safety assumption in case the > "charset" information is not present. This warning should point > to information such as the HTML 4.0 chapter above. > > 3) Once the "charset" is known (or assumed), convert from that "charset" > to Unicode. During that conversion, issue warnings for stuff that is > not allowed in that "charset" (the most frequent case being CP 1252 > declared as ISO-8859-1). It may be possible at this stage to take some > shortcuts (i.e. not really convert to Unicode, just move everything > to the 0x80-0xFF range that is not ASCII), but ideally, we wouldn't > do that. Conversion means that we need a conversion library. We will > need that also for other projects (the CSS validator should use it, > but probably doesn't, Amaya will need it,...). It probably makes sense > to work on a common W3C conversion library (if we don't find something > that meets our needs). I would want to contribute significantly to > such a library (have written one before), but would need some advice > in particular in the area of portability (Unix/Windows), connected > with dynamic linking (try to avoid that every process that needs > a (sometimes huge) conversion table loads it into memory separately). > Henrik / Daniel V., any comments? > > 4) Have the validator itself run in Unicode (if we use UTF-8, that > may not be too difficult). > > I hope this is enough information for the next few steps. To help more, > I would need more information on how exactly the validator is working. -- Gerald Oskoboiny <gerald@w3.org> +1 617 253 2920 System Administrator, W3C http://www.w3.org/People/Gerald/ World Wide Web Consortium, MIT Laboratory for Computer Science 545 Technology Square, Room NE43-353 Cambridge MA 02139 USA
Received on Friday, 7 August 1998 01:50:05 UTC