Re: [WMVS] Some initial thoughts on code M12N...

At 16:55 04/09/21 +0200, Terje Bless wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Martin Duerst <duerst@w3.org> wrote:
>
> >Hello Terje,
> >Some comments on an older message of yours:
>
>Thanks!
>
>
> >Moving towards modularization is a good idea, I think.
>
>I suspect that's about the limit of what we all agree on as yet. :-)
>
>In this and in all the below, be aware that there is no absolute consensus
>reached as yet and in some things I am in the minority when it comes to the
>specifics. Some issues are sufficiently contested to fall into the
>$BB+(Bcontroversial$BB;(B category, and others are merely undecided as yet.
>
>I'll try to reflect the consensus below, but since it's intermixed with
>subjective statements I may not always succede.

okay.


> >Having that stuff in a separate location would be okay by me. But I
> >think that I would want to move gradually from reorganizing the code a
> >bit more inside 'check', to working out data structures inside 'check',
> >to having packages inside 'check', to actually moving the code into its
> >own file/module. At least that's what I think would work best for the
> >charset detection/conversion part, and for me. Different strategies may
> >work better for other parts of the code, and for other people.
>
>Right, charset stuff may have specific issues that make one model more
>suitable than others. The above is in the general context...
>
>
>The idea is that $Bc`G(Band let me just use the charset stuff as an example 
>here $Bc`G(Byou start to build an external, standalone, module that 
>reimplements all the
>functionality we need for the Validator. This module brought to about Beta
>quality and ideally is generic enough to be released standalone on CPAN.
>
>While this happens, the charset code in WMVS stays more or less the same. If
>the development process of the module reveals evolutionary improvements that
>can be made in terms of datastructures $Bc`G(Bor error handling, or whatever 
>$Bc`G(Bin
>the current code, then that part of it gets implemented as makes sense in
>light of current release plans and such.
>
>Only when the CPAN module is considered mature enough, and release plans for
>WMVS suggest it, do we rip out all the charset code in $Bc`WD(Bheck$Bc`Y(Band 
>replace it
>with new code that uses the CPAN module.

I understand that idea. But for the moment, all I can come up with
are incremental improvements to WMVS anyway. Without sorting things
out a bit better in the current code, it's very difficult to see how
the final interface and functionality for the module should look like.
Also, there is a strong interaction with 'templatization'. A lot of
taking the functionality apart is about sorting out actual functionality
and error reporting. And that's exactly what we need for templatization,
too.


>This has several advantages. One of which is that it forces us to treat the
>module as a $Bc`W#(Black Box$Bc`Y
(B which tends to improve the interface and make 
>it more
>general

Well, you can cheat on an external module, too, in particular in Perl,
and you can have a very clean internal module (once you know how it
should look!).


>and encourages better documentation of the interface. Another is that
>it doesn't destabilize the WMVS code while the CPAN module is being developed
>$Bc`G(Bwhich is important as any given module, in this case the charset 
>stuff, will
>be one among several such modules $Bc`G(Band avoids multiple parallel 
>developments
>of M12N features from contaminating eachother.

I definitely don't want to do anything that interfers with operation,
or other developments.


>The downsides will be there too, of course, but IMO they are not all that bad.

The biggest downside, in my case, is that starting in a vacum before
having a chance to sort out a few issues in actual existing code
will mean that I'm just programming out in the blue, which will
not be very productive.


>The main downside I can see is that it burdens the developer with dealing with
>two sets of code for the same task until we make the switch,

Yes. At some stages, that may be necessary. But it is more difficult
for people like me, who do stuff only occasionally, than for others
who are more regularly involved.


>that it forces
>dicipline in determining what improvements go in immediately and what will
>have to wait until we're ready to make the switch, and finally, that it takes
>longer before we can take advantage of the new features in WMVS.

Just a thought: In some way, I see "doing branches without branches".


>That there is overhead associated with developing and maintaining a CPAN
>module instead of just using an internal, application-specific, module I do
>not count among the downsides; simply because I think it is desireable to
>produce reusable code modules that thrid-parties can take advantage of.

I agree with that. Having a CPAN module in the end is the right thing.
But it may not be the point to start with.


> >I think the first steps towards a module would be: - Move (in terms of
> >location inside 'check') the charset
> >detection/conversion/checking code together to one place.
> >- Move (in terms of when it's called) production of error/warning
> >messages
> >towards the end of the whole detection/conversion/checking code.
> >- Narrow the interface, by passing a smaller set of parameters
> >rather than just $File.
>
>Yes, this sounds like a good approach to internal refactoring in preparation
>to actually switch to a M12N codebase. It's the sort of thing I've been doing
>$Bc`G(Bwith variable success :-) $Bc`G(Bover the last 5+ releases, and which we, 
>IMO,
>should definitely continue to do on the road ahead.

Okay, very good.


> >>* XML::Charset
> >>- Detect charset of XML docs using the algorithm from Appendix F.
> >
> >As I said, that's probably too small for a module.
>
>I disagree. As an internal module it's overkill and should probably just be a
>utility function somewhere. But in the context of general reusable code, it
>makes perfect sense as a standalone module. It's the type of thing that would
>get reused in other modules as well as applications; and would prevent people
>reimplementing this again and again.

I think it shouldn't be a module of its own. But it may well be
exposed as one function of a module. Would that work for you?


> >HTTP::Charset is even smaller and more boring than XML::Charset: just
> >look at the content type.
>
>Don't be fooled by the off-the-cuff name of the module; our charset code does
>a _lot_ more than just look at the Content-Type. Maybe a better name would be
>$BB+(BHTTP::Charset::Heuristic$BB;(B, which would do all the charset determination 
>rules
>we use in $BB+(Bcheck$BB;(B today, plus have options to allow, e.g., a <meta> 
>element to
>override the Content-Type (which we don't currently do) etc.

As Bjoern has already confirmed, this shouldn't have to do with HTTP.



> >And yes, a 'DWIM' module is indeed what we should end up with, because
> >that's what others will be able to use.
>
>The problem with a DWIM module is that it often fails to get the $BB+(BWIM$BB;(B 
>bit but
>forces the $BB+(BD$BB;(B part; a lower level interface makes sure people who know 
>better
>than us (our code) what they want are able to do it without the module getting
>in the way.
>
>So DWIMmery is good; but all things in moderation. :-)

Of course a module should have options and the like.


> >'charlint' isn't a product of the I18N WG, it's just written by me. It's
> >hopelessly slow because it takes about 10' to load in the Unicode
> >Character Database file, but if we manage to have this 'preloaded' in
> >memory, that problem will go away.
>
>Lets look at how much of Charlint makes sense to implement in WMVS. Charset
>issues are _hard_; the more of them we can flag and help users with the
>better.

Yes. I think we should put this off for a later stage.


> >Looks like for you, charset handling would rather go into 'fetch' than
> >into 'parse'?
>
>Yes. But possibly this would happen by having W3C::UA subclass LWP::UA _and_
>W3C::Charset. The reaosn for this is that it makes sure when you get returned
>a document it'll be in UTF-8 and ready to be parsed further.
>Discovering/Determining metadata and preprocessing the document does not
>belong in the $Bc`WQ(Barsing$Bc`Y(Bportion of the code (IMO, YMMV, etc., etc.).

Would be fine by me.


> >BTW, I guess you mean 'transcoding' rather than 'transliteration'.
>
>Remind me$Bc`%r(B What's the definition of each of those again?

Changing from one (character) encoding to another: transcoding.

Writing something in a different script that the original one
(e.g. Latin instead of cyrillic): transliteration.


> >But I feel ready for some coding if somebody tells
> >me where that should happen (which CVS branch,...).
>
>Currently, the $BB+(Bvalidator-0_6_0-branch$BB;(B is in maintenance mode; no new stuff
>goes in there apart from any minor bugfixes, and even they are questionable
>(there may never be another 0.6.x release).

Okay, ditched the 0_6_0 checkout.


>The current code in HEAD is on its way to becoming a 0.7.0 release. I'd
>intended to make a $BB+(Bvalidator-0_7-branch$BB;(B to release/maintain the 0.7.x 
>series
>of releases from $Bc`G(Bfreeing up HEAD for further work towards 0.8.0 etc. 
>$Bc`G(Bbut
>due to branchofobia among the rest of the gang we've ditched that idea.
>
>I guess this means we'll move right into 0.8.0 development once 0.7.0 is
>released $Bc`G(Bincluding holding back other development while 0.7.0 is baking
>$Bc`HD(B and will probably have to go back and make a branch from a release tag,
>retroactively, if it turns out we need more releases in the 0.7.x series.
>
>The intent, as I understand it, is that more frequent moderate relases (e.g.
>0.7.0->0.8.0) will obviate the need for minor releases (e.g. 0.7.0->0.7.1) in
>between major releases (0.7->0.8).
>
>
>So$Bc`%r(B
>
>Any changes that are small and necessary for 0.7.0 should go in HEAD; anything
>else will have to happen outside CVS or in a dedicated branch (e.g.
>$BB+(Bvalidator-m12n-charset-branch$BB;(B).

Okay. Am I correct that moving as much of the message text out of
'check' is part of the 0.7.0 release?


Regards,    Martin.

Received on Wednesday, 22 September 2004 01:00:37 UTC