- From: Gerald Oskoboiny <gerald@w3.org>
- Date: Fri, 7 Aug 1998 01:50:28 -0400 (EDT)
- To: www-validator@w3.org
Here are some further ideas on the i18n issues from Martin Dürst:
> As for I18N bugs, the problem is that the validator, by assuming the
> input is something like iso-8859-1 works on bytes instead of working
> on characters, and ignores any character encoding issues. For some
> encodings (those where bytes in the 0x00-0x7F range always represent
> ASCII characters), this works. For the other encodings, among else
> two out of three Japanese encodings and the one used in Taiwan, this
> fails, because these can contain bytes that the validator sees as
> syntactically relevant characters such as "<", although they are
> part of two-byte sequences that denote Japanese or Chinese characters.
>
> As a consequence, some truely valid HTML does not pass the validator.
>
> To make this work, the following steps should be taken:
>
> 1) Detect the character encoding ("charset") in the HTTP header or
> the META construct (see
> http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2).
> This may involve "reloading" of the document if the META is used (the
> browsers have to do this, too).
>
> 2) Issuing a warning and taking some safety assumption in case the
> "charset" information is not present. This warning should point
> to information such as the HTML 4.0 chapter above.
>
> 3) Once the "charset" is known (or assumed), convert from that "charset"
> to Unicode. During that conversion, issue warnings for stuff that is
> not allowed in that "charset" (the most frequent case being CP 1252
> declared as ISO-8859-1). It may be possible at this stage to take some
> shortcuts (i.e. not really convert to Unicode, just move everything
> to the 0x80-0xFF range that is not ASCII), but ideally, we wouldn't
> do that. Conversion means that we need a conversion library. We will
> need that also for other projects (the CSS validator should use it,
> but probably doesn't, Amaya will need it,...). It probably makes sense
> to work on a common W3C conversion library (if we don't find something
> that meets our needs). I would want to contribute significantly to
> such a library (have written one before), but would need some advice
> in particular in the area of portability (Unix/Windows), connected
> with dynamic linking (try to avoid that every process that needs
> a (sometimes huge) conversion table loads it into memory separately).
> Henrik / Daniel V., any comments?
>
> 4) Have the validator itself run in Unicode (if we use UTF-8, that
> may not be too difficult).
>
> I hope this is enough information for the next few steps. To help more,
> I would need more information on how exactly the validator is working.
--
Gerald Oskoboiny <gerald@w3.org> +1 617 253 2920
System Administrator, W3C http://www.w3.org/People/Gerald/
World Wide Web Consortium, MIT Laboratory for Computer Science
545 Technology Square, Room NE43-353 Cambridge MA 02139 USA
Received on Friday, 7 August 1998 01:50:05 UTC