Re: [WMVS] Some initial thoughts on code M12N...

Hello Terje,

Some comments on an older message of yours:

At 19:21 04/08/09 +0200, Terje Bless wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Hi all,
>
>Since we seem to be at a stable point in the release branch, and HEAD is
>merged and more-or-less ready for 0.7 to begin, we should probably begin doing
>a little planning for Modularizing the Markup Validator. I know Bjoern has 
>been
>doing quite a bit of thinking on this subject that I hope he will share with
>us; but to get the ball rolling, here are a few random thoughts from me on
>where I think we should be going and how to get there.
>
>
>Chapter 1: The Characters
>
>Currently the code $Bc`G(Bmodulo supporting files like templates and 
>configuration
>files $Bc`G(Bis in one monolithic block; all the code exists within $B%D+c(Bheck$BB;(B 
>in a
>single namespace. It's been reasonably well $BB+(Binternally modularized$B%5(B $Bc`G(Bif
>you'll forgive me for inventing terminology :-) $Bc`G(Bby moving things into
>subroutines and having subroutines treated as pseudo-objects, but I think
>we've pretty much reached the limits of what is possible with the current
>architecture (and I use that term loosely).
>
>I think the only sane way forward is to split the lump of functionality into
>self-contained modules $Bc`G(Bgeneric enough to be released on CPAN $Bc$(B€$BG(Bof 
>which the
>Markup Validator will only be one consumer. This allows core/backend
>development to be decoupled from front-end development, and will also make it
>far easier to produce multiple front-ends (e.g. commandline, native GUI,
>WebService, etc.) and pluggable backends (e.g. multiple XML Processors).
>
>Especially that last point $Bc`HD(B pluggable backends $Bc`G(Bmakes me believe 
>that most,
>if not all, the modules should be OO rather than just a bunch of externalized
>subroutines. The current code is somewhat prepared for this by passing data
>and context with all subroutine calls (i.e. the equivalent of the implicit
>object in a postulated OO module).
>
>
>I think all of the above is uncontroversial judging by our discussons of this
>in the bi-weekly IRC meetings, non?

Moving towards modularization is a good idea, I think.


>Chapter 2: The Branch Situation
>
>Originally I was set on doing this by using CVS's branches, but since I
>register overwhelming sceptiscism to branches $Bc`G(B:-) $Bc`G(Blet me lay out 
>the whys
>and the hows...
>
>Breaking out the subroutines means modifying the majority of the code. In
>0.6.7 we have about 950 lines of main code and 2500 lines of subroutines; in
>HEAD the numbers are about 850 for main and 2000 for subs. The number of calls
>from main into a subroutine is fairly large. In other words, even in the best
>case scenario this stuff is intrusive and disruptive.
>
>This means, IMO, that we have three plausible ways to get there. We can try to
>do this on HEAD $Bc`G(Bwithout using any branches $Bc`G(Band pushing any 0.7 release
>until we're stable again; we can do this on a single parallell branch that
>will eventually replace $Bc`HD(B instead of merge with $Bc`HDA~(Bthe HEAD/0.7 
>code; or we can
>split out m12n completely, in a separate CVS module even.
>
>I think the first option will clearly cause too long a period of instability
>where we can't make any new releases (modulo extremely small bugfixes back in
>the 0.6.x branch). The second option has similar problems while retaining the
>problematic branch useage.
>
>So I'm thinking that the smart way to go about this is to start building the
>modules as a completely separate deliverable from the main Validator code.

Having that stuff in a separate location would be okay by me.
But I think that I would want to move gradually from reorganizing
the code a bit more inside 'check', to working out data structures
inside 'check', to having packages inside 'check', to actually moving
the code into its own file/module. At least that's what I think would
work best for the charset detection/conversion part, and for me.
Different strategies may work better for other parts of the code,
and for other people.


>Chater 3: Where The Devil Resides
>
>As for implementation details I have some loose thoughts accumulated over the
>last couple of years that only slightly reflects the input from Bjoern and 
>Nick
>on IRC. This would benefit from your input before we get to anything solid to
>work by, and will probably need to be modified once we actually start
>implementing it.
>
>
>My thought has always been that we should start by modularizing the code and
>featureset we already have, in lieu of adding significant new features in the
>M12N code.

I generally agree, but in many cases, moving the code around cannot be
done without noticing things that can be cleaned up, and holding back
on things that seem obvious to fix can be really stressy.


>Similar approaches can probably be taken with several functional parts €$BG(Bthe
>XML Appendix F charset detection code, say $Bc`HD(B if we choose to, but there 
>will
>probably still be a goodly chunk of the code that needs integrating wholesale
>instead of piecemeal. Probably this will coincide with Validator-specific code
>that is a poor match for releasing onto CPAN.

I think the general functionality that makes most sense to make into a module
is "here is a string of bytes, and some meta-information (HTTP charset,...),
convert this into a string of characters." Anything smaller than that might
also be exposed by such a module, but shouldn't be the main usage pattern.

With respect to modularization, the main places where things get difficult
with respect to the validator are:
- The validator has a lot of different checks, errors, and warnings, at
   different steps in the process. For some part of this, the best thing
   may be to hand back a list of detected charsets, and have the validator
   main part do the comparisons and messages (but not figuring out what
   the actual encoding is, which should be done by the module, maybe with
   some configuration options). The 'list of detected charsets' would be
   close to what's currently under File->{Charset}, which I think is also
   a good start for the core data for a charset detection/conversion module.
   Probably adding a few more cases as explicit variables will help clear
   things up (e.g. making Override and Fallback separate variables rather
   than having a flag to distinguish them).
- There are some dependencies between charset detection and media type/
   DTD detection:
   - charset detection is different for text/html, text/[foo+]xml, and so on.
   - Currently, both doctype and <meta> charset are extracted in &preparse.
     If stuff is split up, we'll do something like 'preparse' twice, but
     it might work out because getting the doctype can probably done much
     more easily, which leaves only charset for &preparse.
   - There are some interactions between charset detection and other
     massaging/preparations of the byte stream, in particular 
&normalize_newlines.
     But that should work out just as part of the overall module functionality.
     More complicating is the fact that from some point onwards, we are
     working on the data as an array of lines, which in the general case
     may not be what a user is expecting as the result of the module.

I think the first steps towards a module would be:
- Move (in terms of location inside 'check') the charset
   detection/conversion/checking code together to one place.
- Move (in terms of when it's called) production of error/warning messages
   towards the end of the whole detection/conversion/checking code.
- Narrow the interface, by passing a smaller set of parameters
   rather than just $File.
- ...


>Apart from the SGML Parser $Bc`G(Bwhich is a whole project in itself; even
>$Bbd(Biscounting the addition of XML Processors, the API $BB+(BArchitecting$BB;(B, and any
>addon features this makes possible (e.g attribute checking, Appendix C
>checking, etc.) $Bc`G(Ba few bits that seem to be good candidates for M12N and
>release on CPAN (not necessarily as the standalone modules suggested below):
>
>* XML::Charset
>   - Detect charset of XML docs using the algorithm from Appendix F.

As I said, that's probably too small for a module.


>* IP::ACL
>   - Determine whether an IP is $BB+(Ballowed$BB;(B based on configured criteria.
>
>* HTTP::Auth::Proxy (or CGI::Auth::Proxy)
>   - Proxy HTTP Auth requests.
>
>* HTML::Wrap
>   - Markup-aware line-wrapping (maybe, probably not).
>
>* HTML::DOCTYPE(::Munge)
>   - Some parts of our DOCTYPE munging code in the detect/override subs.
>
>* HTML::Outline
>   - Our $BB+(Boutline$BB;(B code.
>
>* HTML::Parsetree
>   - Iff we improve the parsetree code before we get started on M12N,
>     it would probably be a useful candidate for CPAN.
>
>* HTTP::Charset
>   - Companion to XML::Charset? Our overall charset determination code
>     to find final charset for a doc according to all the relevant specs.
>     Could also be wrapped up in a DWIM module to get you UTF-8 from
>     arbitrary input, but obeying all the relevant specs?

HTTP::Charset is even smaller and more boring than XML::Charset:
just look at the content type. I think in general, it's much better
to separate HTTP download (or whatever other download) and
charset detection (with Content-Type or the 'charset' extracted
from Content-Type, or whatever corresponding info from another
protocol, and the byte stream, as inputs) into separate modules.
And yes, a 'DWIM' module is indeed what we should end up with,
because that's what others will be able to use.


>* $BB+(BFoo$BB;(B::Charlint
>   - Our Charlint subset, or a variant of I18N WG Charlint code?

'charlint' isn't a product of the I18N WG, it's just written by me.
It's hopelessly slow because it takes about 10' to load in the
Unicode Character Database file, but if we manage to have this
'preloaded' in memory, that problem will go away.


>Chapter 4: Deeper and Deeper
>
>In the end, I want to end up a minimal implementation $Bc`HD(B for instance the
>implementation distributed with the module, or as an example in the module's
>$B%D%)(Beg/*$B%D%5(B directory $Bc`G(Bto look like this:
>
>     % perl
>     use W3C::Markup::Validator;
>
>     my $val = new W3C::Markup::Validator;
>     $val->validate('http://www.example.com/')
>       or die qq(Validating $val->uri->as_string failed: $1\n);
>     print $val->results;
>     ^D
>     http://www.example.com/ is not Valid:
>     Line 1, column 0: End of document in prolog.
>     %
>
>For the code deployed on v.w3.org, we'll probably want to use a lower level
>interface. e.g.
>
>     #!/usr/bin/perl -Tw
>
>     use W3C::MWS;
>     use W3C::UA;
>
>     my $p = W3C::MWS->new(
>                           Parser => OpenSP,
>                           Mode   => XML,
>                          );
>
>     my $ua = new W3C::UA;
>
>     my $r = $p->parse($ua->fetch('http://www.example.com/');

The DWIM charset module could be called from 'parse' or from 'fetch',
depending on whether we want to pass bytes or characters from one
to the other. I guess I'd find handling it in 'parse' more natural.


>     if ($r->status->is_ok) {
>       &report_meta($r->info); # Report metadata; the URI, DTD, charset...
>       &report_valid($r); # Or $r->report_valid? Maybe...
>     } else {
>       &report_errors($r->errors->as_string);
>       &report_invalid;
>     }
>$Bc`%r(B
>
>etc.; with as much as possible of the niggling details contained down in the
>modules. e.g. W3C::UA $Bc`G(Bour LWP::UserAgent wrapper class $Bc`G(Bwill handle
>auth-proxying transparently, and maintain redirect info; possibly also do
>charset detection, checking, and transliteration transparently.

Looks like for you, charset handling would rather go into 'fetch' than
into 'parse'? BTW, I guess you mean 'transcoding' rather than 
'transliteration'.


>While looking at this top-down, I see the OpenSP interface mainly providing
>the ESIS, the list of errors, and possibly a DOM-ish datastructure constructed
>internally from the ESIS or the parsetree (or both) $Bc`G(Bincluding *->as_string
>type functions $Bc`G(Bbut this could possibly be by way of lower-level subclasses
>that speak some kind of SAX or DOM, and which can use both OpenSP and one or
>more XML::* modules (which I think is the direction Bj$BC6(Brn is taking here).

Does OpenSP work on characters? Currently, we are handling it a byte steam
in UTF-8, and telling it to take it as UTF-8.


>Chapter 6: Stumbling Into Apotheosis
>
>So how to get started on this...
>
>Well, first I need to see y'all's feedback on this stuff. The above is
>somewhat of a braindump so most of it is likely to see some level of change as
>soon as I hear from you. And since I think we would benefit from as much
>design-upfront/code-to-spec development as we can be bothered with here, we
>should probably try to hash out as much as possible on the list before we try
>to nail down any specific development plans.
>
>IOW, we should probably arrive at a list of some of the main modules, as well
>as pseudo-POD for them, before we start coding.

As I tried to say above, at least for me personally, and for the case
of charset handling, I disagree. I have some general ideas as outlined
above, but from here onwards, my guess is that I'll make more progress
working on the actual code than doing more planning only. As said above,
your milage may vary. But I feel ready for some coding if somebody tells
me where that should happen (which CVS branch,...). I'll also want to
use checklink as a second user of the eventual module. Having two users
for a module is always a good crosscheck that you got some of the
interface right.


Regards,    Martin.

Received on Tuesday, 21 September 2004 08:12:32 UTC