Re: [WMVS] Some initial thoughts on code M12N...

Hello Bjoern,

At 19:20 04/09/21 +0200, Bjoern Hoehrmann wrote:
>* Terje Bless wrote:
> >The idea is that $Bc`G(Band let me just use the charset stuff as an example 
> here $Bc`G(B>you start to build an external, standalone, module that 
> reimplements all the
> >functionality we need for the Validator. This module brought to about Beta
> >quality and ideally is generic enough to be released standalone on CPAN.
>
>I strongly agree here, especially from a management perspective. We are
>all interested in M12N and have some ideas for it, but these tend to be
>not too well expressed and coordinated, and might even conflict, if we
>all try to do this inside check we would quickly run into problems.

That may well be for some of the things you are working on, in the
validator core. I guess it is less the case for charset detection/
conversion.

Also, having several people go out and create their modules, and
then find out that while these modules may all make sense one way
or another, they just don't fit together at all, is a problem, too.


>Development independent of check would have the following benefits:
>
>   * general purpose re-usable code, easier to create new services for
>     similar tasks, http://qa-dev.w3.org/~bjoern/appendix-c/validator/
>     for example does not deal with 'charset' parameters in HTTP headers
>     and supports only a limited set of encodings just because there is
>     no module that makes that easy

Definitely a good goal. But the goal of a module that can be used
independently and the method of reaching that goal can be quite
independent.


>   * proper documentation, helps among other things expectation manage-
>     ment, you should know from the documentation what the code is
>     supposed to do and probably how it achieves that (revealing bugs
>     without looking at the code...)

Proper documentation is always a good idea.

>   * broad review of code, it's easier to go through smaller packages
>     looking for bugs, shortcomings, etc. it is thus more likely to
>     happen; also, it eases platform independent code as there are the
>     CPAN testers who grap modules from CPAN and run their test suites
>     on their system, informing you of failures, etc.
>
>   * easier for outsiders to contribute patches, etc. because they don't
>     have to figure out all of check first (installing the Validator to
>     test changes which would require installing a web server, etc)
>
>   * easier for insiders to focus on their code, you would be generally
>     responsible for your modules and can work on them as it suits you
>     best without figuring out code from others in check, or need to
>     discuss changes with others before implementing them, etc.

Well, yes, except that there are actually interactions, in particular
as the code and the functionality look now.


>   * proper test suite close to the relevant code, for example, no need
>     to test the doctype detection code through screen scraping the HTML
>     results document of the Validator if the code is elsewhere and has
>     its own test suite
>
>   * avoids duplicating code across check and checklink, etc. if there
>     is a bug you only need to fix it in one place rather than many

Definitely. Please stop to try to convince me that a module is a good
thing, I already agree. What we seem to disagree is how to get there.
I don't want to force you to use my way to get there for what you are
doing. But I don't want you to force me to work in a way that doesn't
fit my work style and the problem at hand (as long as, of course,
that doesn't cause disruption).


>The downsides are that it might take longer for changes to get applied
>to the release version and that it requires more work (you'd have to
>write documentation, test suites, ask for feedback on relevant mailing
>lists, think about module names and interfaces, ...) in fact, a lot more
>work than just hacking some bits of the code on some rainy afternoon,

Sorry, but you get me wrong. That's not what I propose. That's just
where I propose to start, because for the issues I'm looking at, that
looks like it will help understand things better.


>but I think it is certainly worth doing. As far as I am concerned it is
>much simpler to develop a stand-along module than messing with `check`,
>http://www.w3.org/mid/41573fd6.153598893@smtp.bjoern.hoehrmann.de for
>example had complete code with 70+ tests and some documentation in about
>an hour. Trying to figure out how these things work in check today along
>with discovering bugs, reporting them, and trying to build on top of
>that would have taken much longer.

If that's the best way for you to work, please go ahead! I'm sure
it will at times be the best way to work for me.


>I would go even further than Terje and say that we should avoid making
>"improvements" inside check and rather make these improvements available
>through new, external modules and stabilize these modules so that the
>code in check could be replaced ASAP and only then benefit from these
>changes.
>
> >>HTTP::Charset is even smaller and more boring than XML::Charset: just
> >>look at the content type.
> >
> >Don't be fooled by the off-the-cuff name of the module; our charset code 
> does
> >a _lot_ more than just look at the Content-Type. Maybe a better name 
> would be
> >$BB+(BHTTP::Charset::Heuristic$BB;(B, which would do all the charset 
> determination rules
> >we use in $BB+(Bcheck$BB;(B today, plus have options to allow, e.g., a <meta> 
> element to
> >override the Content-Type (which we don't currently do) etc.
>
>HTTP:: would be a bad namespace then...

Agreed.


>I also disagree with Martin,
>even just extracting the charset is not just one line of code, you
>have to deal with cases such as
>
>   Content-Type: text/html
>   Content-Type: text/html;charset=iso-8859-1
>
>or
>
>   Content-Type: text/html;charset=iso-8859-1
>   Content-Type: text/html;charset=utf-8
>
>or
>
>   Content-Type: text/html;note="charset='iso-8859-1'";charset=utf-8
>
>or
>
>   Content-Type: text/html
>    ;charset=
>    utf-8
>
>or
>
>   Content-Type: text/html;charset="utf-8'
>
>or
>
>   Content-Type: text/html;version="...";charset=iso-8859-1
>
>and so on, that's certainly something that can go into it's own .pm.

Are you talking about HTTP? Or about <meta>?
In the case of HTTP, that should be dealt with before calling
the charset detection/conversion module. Also, I wonder
how frequent all the cases above are. I haven't heard about
the weird cases (e.g. double info) in the wild, but I guess
there are a few.


>Specifically if you add additional complexity such as reporting the
>flaws in headers as those above back to the application so it can
>report these to the user.

For all this stuff, this is the biggest issue. And although
much of error reporting with respect to charset detection/conversion
is handled quite specially, it would be good to have a general
solution. I can immagine several ways to do this:

Detection/conversion module exposes a series of parameters
or methods that can be used by the calling part after calling
the actual 'workhorse' method, e.g., in pseudocode:
$converted = Charset::DetectConvert->work($input, ...);
if ($converted->ConflictMetaXML) {
     # produce error message saying that
     # $conflict->CharsetMeta and $conflict->CharsetXML conflict
}

Another user (such as checklink) which does not care about these
details would just not do the 'if' part (because CharsetXML has precedence,
except of course if the data isn't actually XML).

Other ways of doing it would be registered callbacks (register a method
for 'ConflictMetaXML', which gets called when there is such a conflict)
and subclassing (have an empty ConflictMetaXML method, which is overwritten
by a subclass that wants special behavior such as error messages).
The disadvantage in that case would be that there is no control over
which sequence the errors are dealt with.

All this would leave the responsibility for the error messages with
the module client, which I think is probably the right thing to do.

The other way would be to have the module come up with a series of
messages that it would pass to the caller. This would be closer
to how things work for the actual validation. But it would make
the interface quite a bit more complicated.


Any comments?     Regards,    Martin.

Received on Wednesday, 22 September 2004 09:29:31 UTC