[WMVS] Some initial thoughts on code M12N...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

Since we seem to be at a stable point in the release branch, and HEAD is
merged and more-or-less ready for 0.7 to begin, we should probably begin doing
a little planning for Modularizing the Markup Validator. I know Björn has been
doing quite a bit of thinking on this subject that I hope he will share with
us; but to get the ball rolling, here are a few random thoughts from me on
where I think we should be going and how to get there.


Chapter 1: The Characters

Currently the code — modulo supporting files like templates and configuration
files — is in one monolithic block; all the code exists within «check» in a
single namespace. It's been reasonably well «internally modularized» — if
you'll forgive me for inventing terminology :-) — by moving things into
subroutines and having subroutines treated as pseudo-objects, but I think
we've pretty much reached the limits of what is possible with the current
architecture (and I use that term loosely).

I think the only sane way forward is to split the lump of functionality into
self-contained modules — generic enough to be released on CPAN — of which the
Markup Validator will only be one consumer. This allows core/backend
development to be decoupled from front-end development, and will also make it
far easier to produce multiple front-ends (e.g. commandline, native GUI,
WebService, etc.) and pluggable backends (e.g. multiple XML Processors).

Especially that last point — pluggable backends — makes me believe that most,
if not all, the modules should be OO rather than just a bunch of externalized
subroutines. The current code is somewhat prepared for this by passing data
and context with all subroutine calls (i.e. the equivalent of the implicit
object in a postulated OO module).


I think all of the above is uncontroversial judging by our discussons of this
in the bi-weekly IRC meetings, non?


Chapter 2: The Branch Situation

Originally I was set on doing this by using CVS's branches, but since I
register overwhelming sceptiscism to branches — :-) — let me lay out the whys
and the hows...

Breaking out the subroutines means modifying the majority of the code. In
0.6.7 we have about 950 lines of main code and 2500 lines of subroutines; in
HEAD the numbers are about 850 for main and 2000 for subs. The number of calls
from main into a subroutine is fairly large. In other words, even in the best
case scenario this stuff is intrusive and disruptive.

This means, IMO, that we have three plausible ways to get there. We can try to
do this on HEAD — without using any branches — and pushing any 0.7 release
until we're stable again; we can do this on a single parallell branch that
will eventually replace — instead of merge with — the HEAD/0.7 code; or we can
split out m12n completely, in a separate CVS module even.

I think the first option will clearly cause too long a period of instability
where we can't make any new releases (modulo extremely small bugfixes back in
the 0.6.x branch). The second option has similar problems while retaining the
problematic branch useage.

So I'm thinking that the smart way to go about this is to start building the
modules as a completely separate deliverable from the main Validator code.

We set up «W3C::Markup::Validator::*» in «$CVSROOT/perl/modules/W3C/…» and
build a complete set of backend modules there; completely separated from main
Validator code. In this CVS module we have a copy of «check» that makes use of
the modules instead of internal subs, just to provide smoketest functionality
(in addition to a «t/*» directory of tests, hopefully!).

Once that is in a state we're more or less happy with, we install those
modules on v.w3.org — or qa-dev, or whatever — and start modifying the
mainline Markup Validator to make use of the modules instead of its own
internal subroutines. Possibly all in one fell swoop, but possibly little by
little as we become comfortable with deploying the modules in production.

This switchover can happen on a single branch or it can happen on a
side-branch specifically for the purpose, depending on what makes sense and
what people are comfortable with at that point. My own preference would be to
integrate the M12N version on HEAD while we're busy stabilizing a 0.7 or 0.8
release on a release branch, but that's a different topic.


Chater 3: Where The Devil Resides

As for implementation details I have some loose thoughts accumulated over the
last couple of years that only slightly reflects the input from Björn and Nick
on IRC. This would benefit from your input before we get to anything solid to
work by, and will probably need to be modified once we actually start
implementing it.


My thought has always been that we should start by modularizing the code and
featureset we already have, in lieu of adding significant new features in the
M12N code. e.g. we should modularize the SGML Parser backend — with both
OpenSP and jjc's SP implemented, as proof of concept (maybe) — without adding
any real XML Processors; this to avoid changing too much and getting into
stabilization and integration issues.

I may have to revise that in light of input from Björn — as his ideas may make
the issue moot for at least the Parser backend and interface — but I'll stick
to that as a baseline for now.

And the main implementation decision is what to do with the SGML Parser.

A while back I hacked up a Perl module to interface with OpenSP using C++ and
XS calling into OpenSP's Generic API (with much help from Nick and a few other
kind folks). This is out on CPAN — «SGML::Parser::OpenSP» — and the code is
currently published on SF.net <http://sf.net/projects/spo/>. The code is not
in a state where it's actually useable for anything, but recently Björn has
been doing some hacking on this and seems to have found a way to save it.

In either case, I think this is a good place to start on the processes. Build
a workable version of this module — at least to the point of replicating the
parts of «onsgmls» that we actually use — and design the backend parser API on
top of that.

Since our current use of «onsgmls» is fairly simple — grotty complicated code,
but very simple use of OpenSP — we could even conceivably adopt
SGML::Parser::OpenSP to replace the calls to «onsgmls» in the production
Markup Validator without waiting for any kind of complete M12N to happen;
about the only requirement would be to be comfortable that the S::P::O code is
stable and without major bugs (and feature complete for the bits we need).

This would have a beneficial impact on performance and memory use on v.w.o,
and would give us one way of integrating Sean's «sgmlid» code fairly quickly.

OTOH, this would probably only be one (side)step on the way to where we really
want to be; something which involves at the very least a more complete API to
OpenSP, and possibly a full-fledged backend processor API along the lines
Björn has been working on.

Similar approaches can probably be taken with several functional parts — the
XML Appendix F charset detection code, say — if we choose to, but there will
probably still be a goodly chunk of the code that needs integrating wholesale
instead of piecemeal. Probably this will coincide with Validator-specific code
that is a poor match for releasing onto CPAN.


Apart from the SGML Parser — which is a whole project in itself; even
discounting the addition of XML Processors, the API «Architecting», and any
addon features this makes possible (e.g attribute checking, Appendix C
checking, etc.) — a few bits that seem to be good candidates for M12N and
release on CPAN (not necessarily as the standalone modules suggested below):

* XML::Charset
  - Detect charset of XML docs using the algorithm from Appendix F.

* IP::ACL
  - Determine whether an IP is «allowed» based on configured criteria.

* HTTP::Auth::Proxy (or CGI::Auth::Proxy)
  - Proxy HTTP Auth requests.

* HTML::Wrap
  - Markup-aware line-wrapping (maybe, probably not).

* HTML::DOCTYPE(::Munge)
  - Some parts of our DOCTYPE munging code in the detect/override subs.

* HTML::Outline
  - Our «outline» code.

* HTML::Parsetree
  - Iff we improve the parsetree code before we get started on M12N,
    it would probably be a useful candidate for CPAN.

* HTTP::Charset
  - Companion to XML::Charset? Our overall charset determination code
    to find final charset for a doc according to all the relevant specs.
    Could also be wrapped up in a DWIM module to get you UTF-8 from
    arbitrary input, but obeying all the relevant specs?

* «Foo»::Charlint
  - Our Charlint subset, or a variant of I18N WG Charlint code?


Some of these are probably dead ends, and there are probably other bits that
are good candidates; and there is of course much code here that needs to be
internal (not released to CPAN, as such) modules.


Chapter 4: Deeper and Deeper

In the end, I want to end up a minimal implementation — for instance the
implementation distributed with the module, or as an example in the module's
«eg/*» directory — to look like this:

    % perl
    use W3C::Markup::Validator;
    
    my $val = new W3C::Markup::Validator;
    $val->validate('http://www.example.com/')
      or die qq(Validating $val->uri->as_string failed: $1\n);
    print $val->results;
    ^D
    http://www.example.com/ is not Valid:
    Line 1, column 0: End of document in prolog.
    %

For the code deployed on v.w3.org, we'll probably want to use a lower level
interface. e.g.

    #!/usr/bin/perl -Tw
    
    use W3C::MWS;
    use W3C::UA;
    
    my $p = W3C::MWS->new(
                          Parser => OpenSP,
                          Mode   => XML,
                         );
    
    my $ua = new W3C::UA;
    
    my $r = $p->parse($ua->fetch('http://www.example.com/');
    
    if ($r->status->is_ok) {
      &report_meta($r->info); # Report metadata; the URI, DTD, charset...
      &report_valid($r); # Or $r->report_valid? Maybe...
    } else {
      &report_errors($r->errors->as_string);
      &report_invalid;
    }
…

etc.; with as much as possible of the niggling details contained down in the
modules. e.g. W3C::UA — our LWP::UserAgent wrapper class — will handle
auth-proxying transparently, and maintain redirect info; possibly also do
charset detection, checking, and transliteration transparently.


While looking at this top-down, I see the OpenSP interface mainly providing
the ESIS, the list of errors, and possibly a DOM-ish datastructure constructed
internally from the ESIS or the parsetree (or both) — including *->as_string
type functions — but this could possibly be by way of lower-level subclasses
that speak some kind of SAX or DOM, and which can use both OpenSP and one or
more XML::* modules (which I think is the direction Björn is taking here).


Chapter 5: Stay Mobile!

One key reason for the extra levels of abstraction (extra layers of
subclasses) here is so the actual code in what is «check» today is reduced to
be as little as possible. Ideally this should be so small that we can maintain
«check.cgi», «check.fcgi», «check.pm» (mod_perl), «check.php», «check.asp»,
«check.pl» (commandline), and «check.exe» (native GUI). etc. All with _only_
the code specific to their function, and with config options to tweak the
output generation (i.e. select PHP templates instead of HTML::Template
templates for the PHP version).

This lets us hit more targets in the various CGI and CGI-alike environments;
and it allows any GUI versions to focus on the GUI parts of it instead of
getting bogged down in the gray area towards the core code.


Also, more levels of abstraction make it easier to do stuff like insert
Appendix C and attribute checking (or even XML Schema) into the chain without
affecting the front-end code. The tradeoff will be complexity and possibly in
terms of performance and API flexibility.



Chapter 6: Stumbling Into Apotheosis

So how to get started on this...

Well, first I need to see y'all's feedback on this stuff. The above is
somewhat of a braindump so most of it is likely to see some level of change as
soon as I hear from you. And since I think we would benefit from as much
design-upfront/code-to-spec development as we can be bothered with here, we
should probably try to hash out as much as possible on the list before we try
to nail down any specific development plans.

IOW, we should probably arrive at a list of some of the main modules, as well
as pseudo-POD for them, before we start coding.

Once we have that we can get started; and I think we should probably do it in
a two-pronged fashion.

First off we should work on SGML::Parser::OpenSP — or possibly whatever
subclass will sit in front of it (and an XML Processor) — seperately. Maybe on
SF.net where S::P::O lives; possibly migrated to dev.w3.org and qa-dev.

Second, we should create «$CVSROOT/perl/modules/W3C/MVS/» and begin populating
it. Probably it will make sense to start populating it top-down to get a
framework, and then add in other bits standalone for integration when we get
our feet under us. We could go the other way, but I fear that'd lead to
fragmentation that we'd fight with later on.


But most importantly, I think we should be running normal development on 0.7
(and maybe even 0.8 and 0.9) in paralell with this effort. I do not want to
stop mainline devlopment for production while we try to implement any big
radical changes to the architecture.

Yes, this will likely mean M12N won't bear fruit until later than it could
have if we gave it absolute priority.


Chapter 7: All Good Things…

Version Roadmap:

0.7.0 — Current HEAD. Has Template code and new Config facility.

0.8.0 — Make use of new Config stuff to add warnings about mismatches
        or suboptimal useage of DOCTYPEs, Content-Types, charsets etc.

        Possible target for landing minimal SGML::Parser::OpenSP ?
        Possible target for landing minimal I18N facility ?

0.9.0 — Refinement of 0.8 features. What else?

        Possible target for landing minimal SGML::Parser::OpenSP ?
        Possible target for landing minimal I18N facility ?

1.0.0 — Landing M12N (of some level; maybe not complete)?

1.1.0 — Addition of stuff like Attribute checks, Appendix C?

1.2.0 — Addition of minimal Schema Validation?


2.0.0 — Full Schema Validation? Replaces DTD Validation?
        Where are we time-wise at this point? What's happened in the
        world up to this point?


Maybe something that looks completely different... :-)

In any case. I think 1.0 is a good place to land M12N; or rather, I think when
we land M12N is a good place to declare status 1.0.

I also think we have stuff that fits into 0.7/0.8/0.9 that we should aim to
do. Possibly this should include making use of S::P::O — in the minimalistic
«just replace the fork() of “onsgmls”» sense — for performance and a clean way
to integrate Sean's «sgmlid» code (the latter because it fits in nicely with
the other new checking considered for 0.7-.9).

This first step to that is to get 0.7.0 cleaned up, stabilized, and released.
We can probably do without a lot of feature additions for 0.7.0.


This gives us a window from 0.7.0 through 0.9.x for preparing the M12N code in
paralell with production.


__END__


As mentioned, this is a braindump. I hope you will all chime in with your
thoughts on this; especially Björn has been shooting his mouth off so I'll
expect a good long dissertation from him! :-)

Everyone else please do speak up at whatever length and level of detail you
have time for! A «Sounds good» is ok; a «that's stoopid» comment probably
requires a little more explanation; but do please speak up!


I'm not currently married to any of the stuff above, but it is the result of
thinking about this for quite some time so I may take convincing for different
approaches (unless you code it; he who writes the code makes the rules!).




- -- 
Editor's note: in the last update,   we noted that Larry Wall would "vomment"
on existing RFCs. Some took that to be a cross between "vomit" and "comment."
We are unsure of whether it was a subconscious slip or a typographical error.
We are also unsure of whether or not to regret the error.      -- use.perl.org

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.3

iQA/AwUBQReygaPyPrIkdfXsEQJ8egCgyac3SYnpQdSEUa5oOVweWpDtOjUAnRJM
AyafQegIJzgLgOWG4IXX1cWF
=zYHa
-----END PGP SIGNATURE-----

Received on Monday, 9 August 2004 17:21:11 UTC