Re: [WMVS] Some initial thoughts on code M12N...

At 19:56 04/09/24 +0200, Bjoern Hoehrmann wrote:

>* Martin Duerst wrote:
> >That may well be for some of the things you are working on, in the
> >validator core. I guess it is less the case for charset detection/
> >conversion.
>
>Well, I need the functionality for a number of other things than the
>Validator

What other things are these?


>and I am thus going to update the HTML::Encoding module on
>CPAN that I wrote in 2001; I haven't yet decided whether it will do
>transcoding but in case it won't there'll probably be some other module
>to do it (just using Encode or Text::Iconv would not suffice as that
>might fail so there would likely be trial and error code somewhere).

What kind of trials and errors would this include? Are you thinking
about some encoding detection heuristics, or something else?


>So
>any work you might do in this regard likely conflicts with my work in
>some way. What do you suggest we can do to avoid making duplicate
>efforts here?

Okay. I wasn't aware that you were working on this.


> >Also, having several people go out and create their modules, and
> >then find out that while these modules may all make sense one way
> >or another, they just don't fit together at all, is a problem, too.
>
>That's why we try to discuss this on the list / wiki / #validator.
>But I am not sure how these might fail to fit together...
>
> >Are you talking about HTTP? Or about <meta>?
> >In the case of HTTP, that should be dealt with before calling
> >the charset detection/conversion module.
>
>Interesting. It would seem to me that passing a HTTP::Response object
>would be the most natural way to interact with the module (unless of
>course there is no HTTP::Response object that could be passed) hence
>such processing seems clearly part of the detection code. Especially
>considering that this is non-trivial as easily demonstrated by
>
>   http://www.bjoernsworld.de/temp/iso-8859-2-with-escaped-charset.html

This is definitely non-trivial. But the question is whether this is
a non-triviality of HTTP or of the charset detection algorithm.
In my view, it is very clearly a non-triviality of HTTP (or MIME,
if you want). So it should be handled by the HTTP side, not by
the charset detection side. The IETF definition of 'charset'
does not include backslash escaping.


>My tests suggests that IE/Win, Amaya, Mozilla, Safari and the Validator
>all have a different opinion on what the encoding of that document is
>supposed to be, the Validator in particular is unable to process the
>response (which is perfectly legal unlike some of the examples I gave
>before).

It's maybe perfectly legal, but it's also prefectly weird.
The validator's response isn't very kind or easy to understand,
but I don't mind it saying 'there is something wrong here'.
Except for testing a case across the borderline, there is no
reason for anybody to ever do this. And I don't think we should
invest too much time in getting this kind of cases fixed, one
way or another (with a nicer error message or actual unescaping).
If one of the two falls out from some changes that we made to
the code anyway, then that's nice, but it's just not worth
too much of our time worrying about that case.


>A similar example would be
>
>   http://www.bjoernsworld.de/temp/complicated-content-type.html
>
>Again generally legal yet the Validator fails to process it properly.

Again an HTTP parameter syntax problem, not a charset detection
problem. There should be a method on the HTTP::Response object
that gives back the 'charset' parameter of the 'Content-Type'
header, unescaped. It's easy to see that it belongs there
(or to some third place), and not in the charset detection
code, because it's useful for other parameters and potentially
other headers, but not specific to charset detection.


>These are of course rare cases and not generally relevant,

Thanks to see you agree. Sorry I told you so above.


>yet our
>users expect from the Validator to get everything right, and so do I.

I agree that this should be the goal. But the wish for the
perfect now is the enemy of the good soon (such as release often).


>I do not have any statistics on how common these cases are.

Very, very uncommon. Too uncommon to use statistics.
My personal guess would be the two cases above are the
only ones, and if there are others, they are also meta-
instances (cases just made up to prove a point, not actually
observed in the wild).


> >For all this stuff, this is the biggest issue. And although
> >much of error reporting with respect to charset detection/conversion
> >is handled quite specially, it would be good to have a general
> >solution. I can immagine several ways to do this:
>
>The first problem would be to figure out which errors are actually
>detected and/or reported by which part of the code. You might, for
>example, write the actual Detector so that it stops right after the
>XML declaration in your example and not worry about any <meta>
>element at all or it might find both and just report them back to
>the application which could then compare them if it cares and do
>whatever it likes.

The second variant would probably be the right thing, at least
for the validator. For other uses, in essence, they wouldn't
look at what's reported back from <meta>, so it would be the
same for them, of course with a little inefficiency. That
inefficiency could probably be reduced by making sure we
only check things that can be HTML for <meta>.


>Again, for your example, you might have to deal
>with users who consider ISO-8859-1 and ISO_8859-1 not in conflict
>while others might do that.

Yes, getting more knowledge of aliases and stuff into such
a module would probably be something to do.


>Just like people tend to disagree what
>the encoding of a document http://www.example.org/
>
>   ...
>   Content-Type: text/html
>
>   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
>     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml"><head>
>   <meta http-equiv = "Content-Type" content =
>     "text/html;charset=iso-8859-2" />
>   <title></title></head><body><p>...</p></body></html>
>
>would be.

I know some people might claim that this is iso-8859-1,
but they definitely would be wrong. If it's not for the
specs, then for all the implementations out there.
Or did you mean something else?


> >All this would leave the responsibility for the error messages with
> >the module client, which I think is probably the right thing to do.
>
>Agreed.

Very good we agree at least on this.

Regards,    Martin.

Received on Tuesday, 28 September 2004 05:56:00 UTC