- From: Terje Bless <link@pobox.com>
- Date: Tue, 21 Sep 2004 16:55:54 +0200
- To: Martin Duerst <duerst@w3.org>
- cc: QA Dev <public-qa-dev@w3.org>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Martin Duerst <duerst@w3.org> wrote: >Hello Terje, >Some comments on an older message of yours: Thanks! >Moving towards modularization is a good idea, I think. I suspect that's about the limit of what we all agree on as yet. :-) In this and in all the below, be aware that there is no absolute consensus reached as yet and in some things I am in the minority when it comes to the specifics. Some issues are sufficiently contested to fall into the «controversial» category, and others are merely undecided as yet. I'll try to reflect the consensus below, but since it's intermixed with subjective statements I may not always succede. >Having that stuff in a separate location would be okay by me. But I >think that I would want to move gradually from reorganizing the code a >bit more inside 'check', to working out data structures inside 'check', >to having packages inside 'check', to actually moving the code into its >own file/module. At least that's what I think would work best for the >charset detection/conversion part, and for me. Different strategies may >work better for other parts of the code, and for other people. Right, charset stuff may have specific issues that make one model more suitable than others. The above is in the general context... The idea is that — and let me just use the charset stuff as an example here — you start to build an external, standalone, module that reimplements all the functionality we need for the Validator. This module brought to about Beta quality and ideally is generic enough to be released standalone on CPAN. While this happens, the charset code in WMVS stays more or less the same. If the development process of the module reveals evolutionary improvements that can be made in terms of datastructures — or error handling, or whatever — in the current code, then that part of it gets implemented as makes sense in light of current release plans and such. Only when the CPAN module is considered mature enough, and release plans for WMVS suggest it, do we rip out all the charset code in “check” and replace it with new code that uses the CPAN module. This has several advantages. One of which is that it forces us to treat the module as a “Black Box”, which tends to improve the interface and make it more general and encourages better documentation of the interface. Another is that it doesn't destabilize the WMVS code while the CPAN module is being developed — which is important as any given module, in this case the charset stuff, will be one among several such modules — and avoids multiple parallel developments of M12N features from contaminating eachother. The downsides will be there too, of course, but IMO they are not all that bad. The main downside I can see is that it burdens the developer with dealing with two sets of code for the same task until we make the switch, that it forces dicipline in determining what improvements go in immediately and what will have to wait until we're ready to make the switch, and finally, that it takes longer before we can take advantage of the new features in WMVS. That there is overhead associated with developing and maintaining a CPAN module instead of just using an internal, application-specific, module I do not count among the downsides; simply because I think it is desireable to produce reusable code modules that thrid-parties can take advantage of. I'm not certain how well this reflects the concensus of the group (the most vehement feedback has been on other issues), but if you view this as a point on a scale you might extrapolate the two extremes and the likely range along which relevant opinions will tend to fall. >I generally agree, but in many cases, moving the code around cannot be >done without noticing things that can be cleaned up, and holding back on >things that seem obvious to fix can be really stressy. I agree 100%; but this needs to be a tradeoff with other development concerns, such as the risk of destabilizing the code and throwing off our release schedule. Some new features and improvements are natural to add immediately while doing the M12N, but IMO the first round should be very cautious about doing anything in that vein that would trade off the time before we can reach M12N versus adding improvements. IOW, IMO, it's a judgement call; and there is room for disagreement on the exact weighting in the tradeoffs to be made. >I think the first steps towards a module would be: - Move (in terms of >location inside 'check') the charset >detection/conversion/checking code together to one place. >- Move (in terms of when it's called) production of error/warning >messages >towards the end of the whole detection/conversion/checking code. >- Narrow the interface, by passing a smaller set of parameters >rather than just $File. Yes, this sounds like a good approach to internal refactoring in preparation to actually switch to a M12N codebase. It's the sort of thing I've been doing — with variable success :-) — over the last 5+ releases, and which we, IMO, should definitely continue to do on the road ahead. >>* XML::Charset >>- Detect charset of XML docs using the algorithm from Appendix F. > >As I said, that's probably too small for a module. I disagree. As an internal module it's overkill and should probably just be a utility function somewhere. But in the context of general reusable code, it makes perfect sense as a standalone module. It's the type of thing that would get reused in other modules as well as applications; and would prevent people reimplementing this again and again. >HTTP::Charset is even smaller and more boring than XML::Charset: just >look at the content type. Don't be fooled by the off-the-cuff name of the module; our charset code does a _lot_ more than just look at the Content-Type. Maybe a better name would be «HTTP::Charset::Heuristic», which would do all the charset determination rules we use in «check» today, plus have options to allow, e.g., a <meta> element to override the Content-Type (which we don't currently do) etc. >And yes, a 'DWIM' module is indeed what we should end up with, because >that's what others will be able to use. The problem with a DWIM module is that it often fails to get the «WIM» bit but forces the «D» part; a lower level interface makes sure people who know better than us (our code) what they want are able to do it without the module getting in the way. So DWIMmery is good; but all things in moderation. :-) >'charlint' isn't a product of the I18N WG, it's just written by me. It's >hopelessly slow because it takes about 10' to load in the Unicode >Character Database file, but if we manage to have this 'preloaded' in >memory, that problem will go away. Lets look at how much of Charlint makes sense to implement in WMVS. Charset issues are _hard_; the more of them we can flag and help users with the better. >Looks like for you, charset handling would rather go into 'fetch' than >into 'parse'? Yes. But possibly this would happen by having W3C::UA subclass LWP::UA _and_ W3C::Charset. The reaosn for this is that it makes sure when you get returned a document it'll be in UTF-8 and ready to be parsed further. Discovering/Determining metadata and preprocessing the document does not belong in the “parsing” portion of the code (IMO, YMMV, etc., etc.). >BTW, I guess you mean 'transcoding' rather than 'transliteration'. Remind me… What's the definition of each of those again? >As I tried to say above, at least for me personally, and for the case of >charset handling, I disagree. I have some general ideas as outlined >above, but from here onwards, my guess is that I'll make more progress >working on the actual code than doing more planning only. Iff you follow the outline I gave up at the beginning of this message, the “planning” would consist of developing the module, including POD, and releasing it on CPAN; the interface adjustments to make it fit into WMVS would happen along the way, including any moderate tweaks to current code to prepare the way. But that way we also get to hash out the interfaces and stuff before actually implementing it __in_the_validator__, while still not preventing actual code from getting written with formalisms such as design-up-front etc. >But I feel ready for some coding if somebody tells >me where that should happen (which CVS branch,...). Currently, the «validator-0_6_0-branch» is in maintenance mode; no new stuff goes in there apart from any minor bugfixes, and even they are questionable (there may never be another 0.6.x release). The current code in HEAD is on its way to becoming a 0.7.0 release. I'd intended to make a «validator-0_7-branch» to release/maintain the 0.7.x series of releases from — freeing up HEAD for further work towards 0.8.0 etc. — but due to branchofobia among the rest of the gang we've ditched that idea. I guess this means we'll move right into 0.8.0 development once 0.7.0 is released — including holding back other development while 0.7.0 is baking — and will probably have to go back and make a branch from a release tag, retroactively, if it turns out we need more releases in the 0.7.x series. The intent, as I understand it, is that more frequent moderate relases (e.g. 0.7.0->0.8.0) will obviate the need for minor releases (e.g. 0.7.0->0.7.1) in between major releases (0.7->0.8). So… Any changes that are small and necessary for 0.7.0 should go in HEAD; anything else will have to happen outside CVS or in a dedicated branch (e.g. «validator-m12n-charset-branch»). >I'll also want to use checklink as a second user of the eventual module. Sounds like a good idea, but make sure you coordinate with Ville before touching checklink; he's the one sheperding checklink and I know he's done quite a bit of thinking on checklink's architecture and future direction. HTH, -link - -- When I decide that the situation is unacceptable for me, I'll simply fork the tree. I do _not_ appreciate being enlisted into anyone's holy wars, so unless you _really_ want to go _way_ up in my personal shitlist don't play politics in my vicinity. -- Alexander Viro on lkml -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0.3 iQA/AwUBQVBA+aPyPrIkdfXsEQIB3ACeKL+BWw+d4DnkuiV7/LxFLla7DXQAn33R JT2aciQcUZqwWcJ+EprXF74q =P1qb -----END PGP SIGNATURE-----
Received on Tuesday, 21 September 2004 14:55:59 UTC