Re: [WMVS] Some initial thoughts on code M12N...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Duerst <duerst@w3.org> wrote:

>Hello Terje,
>Some comments on an older message of yours:

Thanks!


>Moving towards modularization is a good idea, I think.

I suspect that's about the limit of what we all agree on as yet. :-)

In this and in all the below, be aware that there is no absolute consensus
reached as yet and in some things I am in the minority when it comes to the
specifics. Some issues are sufficiently contested to fall into the
«controversial» category, and others are merely undecided as yet.

I'll try to reflect the consensus below, but since it's intermixed with
subjective statements I may not always succede.


>Having that stuff in a separate location would be okay by me. But I
>think that I would want to move gradually from reorganizing the code a
>bit more inside 'check', to working out data structures inside 'check',
>to having packages inside 'check', to actually moving the code into its
>own file/module. At least that's what I think would work best for the
>charset detection/conversion part, and for me. Different strategies may
>work better for other parts of the code, and for other people.

Right, charset stuff may have specific issues that make one model more
suitable than others. The above is in the general context...


The idea is that — and let me just use the charset stuff as an example here —
you start to build an external, standalone, module that reimplements all the
functionality we need for the Validator. This module brought to about Beta
quality and ideally is generic enough to be released standalone on CPAN.

While this happens, the charset code in WMVS stays more or less the same. If
the development process of the module reveals evolutionary improvements that
can be made in terms of datastructures — or error handling, or whatever — in
the current code, then that part of it gets implemented as makes sense in
light of current release plans and such.

Only when the CPAN module is considered mature enough, and release plans for
WMVS suggest it, do we rip out all the charset code in “check” and replace it
with new code that uses the CPAN module.


This has several advantages. One of which is that it forces us to treat the
module as a “Black Box”, which tends to improve the interface and make it more
general and encourages better documentation of the interface. Another is that
it doesn't destabilize the WMVS code while the CPAN module is being developed
— which is important as any given module, in this case the charset stuff, will
be one among several such modules — and avoids multiple parallel developments
of M12N features from contaminating eachother.

The downsides will be there too, of course, but IMO they are not all that bad.
The main downside I can see is that it burdens the developer with dealing with
two sets of code for the same task until we make the switch, that it forces
dicipline in determining what improvements go in immediately and what will
have to wait until we're ready to make the switch, and finally, that it takes
longer before we can take advantage of the new features in WMVS.

That there is overhead associated with developing and maintaining a CPAN
module instead of just using an internal, application-specific, module I do
not count among the downsides; simply because I think it is desireable to
produce reusable code modules that thrid-parties can take advantage of.


I'm not certain how well this reflects the concensus of the group (the most
vehement feedback has been on other issues), but if you view this as a point
on a scale you might extrapolate the two extremes and the likely range along
which relevant opinions will tend to fall.


>I generally agree, but in many cases, moving the code around cannot be
>done without noticing things that can be cleaned up, and holding back on
>things that seem obvious to fix can be really stressy.

I agree 100%; but this needs to be a tradeoff with other development concerns,
such as the risk of destabilizing the code and throwing off our release
schedule. Some new features and improvements are natural to add immediately
while doing the M12N, but IMO the first round should be very cautious about
doing anything in that vein that would trade off the time before we can reach
M12N versus adding improvements.

IOW, IMO, it's a judgement call; and there is room for disagreement on the
exact weighting in the tradeoffs to be made.


>I think the first steps towards a module would be: - Move (in terms of
>location inside 'check') the charset
>detection/conversion/checking code together to one place.
>- Move (in terms of when it's called) production of error/warning
>messages
>towards the end of the whole detection/conversion/checking code.
>- Narrow the interface, by passing a smaller set of parameters
>rather than just $File.

Yes, this sounds like a good approach to internal refactoring in preparation
to actually switch to a M12N codebase. It's the sort of thing I've been doing
— with variable success :-) — over the last 5+ releases, and which we, IMO,
should definitely continue to do on the road ahead.


>>* XML::Charset
>>- Detect charset of XML docs using the algorithm from Appendix F.
>
>As I said, that's probably too small for a module.

I disagree. As an internal module it's overkill and should probably just be a
utility function somewhere. But in the context of general reusable code, it
makes perfect sense as a standalone module. It's the type of thing that would
get reused in other modules as well as applications; and would prevent people
reimplementing this again and again.


>HTTP::Charset is even smaller and more boring than XML::Charset: just
>look at the content type.

Don't be fooled by the off-the-cuff name of the module; our charset code does
a _lot_ more than just look at the Content-Type. Maybe a better name would be
«HTTP::Charset::Heuristic», which would do all the charset determination rules
we use in «check» today, plus have options to allow, e.g., a <meta> element to
override the Content-Type (which we don't currently do) etc.


>And yes, a 'DWIM' module is indeed what we should end up with, because
>that's what others will be able to use.

The problem with a DWIM module is that it often fails to get the «WIM» bit but
forces the «D» part; a lower level interface makes sure people who know better
than us (our code) what they want are able to do it without the module getting
in the way.

So DWIMmery is good; but all things in moderation. :-)


>'charlint' isn't a product of the I18N WG, it's just written by me. It's
>hopelessly slow because it takes about 10' to load in the Unicode
>Character Database file, but if we manage to have this 'preloaded' in
>memory, that problem will go away.

Lets look at how much of Charlint makes sense to implement in WMVS. Charset
issues are _hard_; the more of them we can flag and help users with the
better.


>Looks like for you, charset handling would rather go into 'fetch' than
>into 'parse'?

Yes. But possibly this would happen by having W3C::UA subclass LWP::UA _and_
W3C::Charset. The reaosn for this is that it makes sure when you get returned
a document it'll be in UTF-8 and ready to be parsed further.
Discovering/Determining metadata and preprocessing the document does not
belong in the “parsing” portion of the code (IMO, YMMV, etc., etc.).



>BTW, I guess you mean 'transcoding' rather than 'transliteration'.

Remind me… What's the definition of each of those again?


>As I tried to say above, at least for me personally, and for the case of
>charset handling, I disagree. I have some general ideas as outlined
>above, but from here onwards, my guess is that I'll make more progress
>working on the actual code than doing more planning only.

Iff you follow the outline I gave up at the beginning of this message, the
“planning” would consist of developing the module, including POD, and
releasing it on CPAN; the interface adjustments to make it fit into WMVS would
happen along the way, including any moderate tweaks to current code to prepare
the way.

But that way we also get to hash out the interfaces and stuff before actually
implementing it __in_the_validator__, while still not preventing actual code
from getting written with formalisms such as design-up-front etc.


>But I feel ready for some coding if somebody tells
>me where that should happen (which CVS branch,...).

Currently, the «validator-0_6_0-branch» is in maintenance mode; no new stuff
goes in there apart from any minor bugfixes, and even they are questionable
(there may never be another 0.6.x release).

The current code in HEAD is on its way to becoming a 0.7.0 release. I'd
intended to make a «validator-0_7-branch» to release/maintain the 0.7.x series
of releases from — freeing up HEAD for further work towards 0.8.0 etc. — but
due to branchofobia among the rest of the gang we've ditched that idea.

I guess this means we'll move right into 0.8.0 development once 0.7.0 is
released — including holding back other development while 0.7.0 is baking
— and will probably have to go back and make a branch from a release tag,
retroactively, if it turns out we need more releases in the 0.7.x series.

The intent, as I understand it, is that more frequent moderate relases (e.g.
0.7.0->0.8.0) will obviate the need for minor releases (e.g. 0.7.0->0.7.1) in
between major releases (0.7->0.8).


So…

Any changes that are small and necessary for 0.7.0 should go in HEAD; anything
else will have to happen outside CVS or in a dedicated branch (e.g.
«validator-m12n-charset-branch»).



>I'll also want to use checklink as a second user of the eventual module.

Sounds like a good idea, but make sure you coordinate with Ville before
touching checklink; he's the one sheperding checklink and I know he's done
quite a bit of thinking on checklink's architecture and future direction.




HTH, -link
- -- 
When I decide that the situation is unacceptable for me, I'll simply fork
the tree.   I do _not_ appreciate being enlisted into anyone's holy wars,
so unless you _really_ want to go _way_ up in my  personal shitlist don't
play politics in my vicinity.                   -- Alexander Viro on lkml

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.3

iQA/AwUBQVBA+aPyPrIkdfXsEQIB3ACeKL+BWw+d4DnkuiV7/LxFLla7DXQAn33R
JT2aciQcUZqwWcJ+EprXF74q
=P1qb
-----END PGP SIGNATURE-----

Received on Tuesday, 21 September 2004 14:55:59 UTC