Re: [WMVS] Some initial thoughts on code M12N...

* Martin Duerst wrote:
>That may well be for some of the things you are working on, in the
>validator core. I guess it is less the case for charset detection/
>conversion.

Well, I need the functionality for a number of other things than the
Validator and I am thus going to update the HTML::Encoding module on
CPAN that I wrote in 2001; I haven't yet decided whether it will do
transcoding but in case it won't there'll probably be some other module
to do it (just using Encode or Text::Iconv would not suffice as that
might fail so there would likely be trial and error code somewhere). So
any work you might do in this regard likely conflicts with my work in
some way. What do you suggest we can do to avoid making duplicate
efforts here?

>Also, having several people go out and create their modules, and
>then find out that while these modules may all make sense one way
>or another, they just don't fit together at all, is a problem, too.

That's why we try to discuss this on the list / wiki / #validator.
But I am not sure how these might fail to fit together...

>Are you talking about HTTP? Or about <meta>?
>In the case of HTTP, that should be dealt with before calling
>the charset detection/conversion module.

Interesting. It would seem to me that passing a HTTP::Response object
would be the most natural way to interact with the module (unless of
course there is no HTTP::Response object that could be passed) hence
such processing seems clearly part of the detection code. Especially
considering that this is non-trivial as easily demonstrated by

  http://www.bjoernsworld.de/temp/iso-8859-2-with-escaped-charset.html

My tests suggests that IE/Win, Amaya, Mozilla, Safari and the Validator
all have a different opinion on what the encoding of that document is
supposed to be, the Validator in particular is unable to process the
response (which is perfectly legal unlike some of the examples I gave
before). A similar example would be

  http://www.bjoernsworld.de/temp/complicated-content-type.html

Again generally legal yet the Validator fails to process it properly.
These are of course rare cases and not generally relevant, yet our
users expect from the Validator to get everything right, and so do I.
I do not have any statistics on how common these cases are.

>For all this stuff, this is the biggest issue. And although
>much of error reporting with respect to charset detection/conversion
>is handled quite specially, it would be good to have a general
>solution. I can immagine several ways to do this:

The first problem would be to figure out which errors are actually
detected and/or reported by which part of the code. You might, for
example, write the actual Detector so that it stops right after the
XML declaration in your example and not worry about any <meta>
element at all or it might find both and just report them back to
the application which could then compare them if it cares and do
whatever it likes. Again, for your example, you might have to deal
with users who consider ISO-8859-1 and ISO_8859-1 not in conflict
while others might do that. Just like people tend to disagree what
the encoding of a document http://www.example.org/

  ...
  Content-Type: text/html

  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml"><head>
  <meta http-equiv = "Content-Type" content =
    "text/html;charset=iso-8859-2" />
  <title></title></head><body><p>...</p></body></html>

would be.

>All this would leave the responsibility for the error messages with
>the module client, which I think is probably the right thing to do.

Agreed.

Received on Friday, 24 September 2004 17:57:33 UTC