- From: Terje Bless <link@pobox.com>
- Date: Mon, 9 Aug 2004 19:21:07 +0200
- To: QA Dev <public-qa-dev@w3.org>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all, Since we seem to be at a stable point in the release branch, and HEAD is merged and more-or-less ready for 0.7 to begin, we should probably begin doing a little planning for Modularizing the Markup Validator. I know Björn has been doing quite a bit of thinking on this subject that I hope he will share with us; but to get the ball rolling, here are a few random thoughts from me on where I think we should be going and how to get there. Chapter 1: The Characters Currently the code — modulo supporting files like templates and configuration files — is in one monolithic block; all the code exists within «check» in a single namespace. It's been reasonably well «internally modularized» — if you'll forgive me for inventing terminology :-) — by moving things into subroutines and having subroutines treated as pseudo-objects, but I think we've pretty much reached the limits of what is possible with the current architecture (and I use that term loosely). I think the only sane way forward is to split the lump of functionality into self-contained modules — generic enough to be released on CPAN — of which the Markup Validator will only be one consumer. This allows core/backend development to be decoupled from front-end development, and will also make it far easier to produce multiple front-ends (e.g. commandline, native GUI, WebService, etc.) and pluggable backends (e.g. multiple XML Processors). Especially that last point — pluggable backends — makes me believe that most, if not all, the modules should be OO rather than just a bunch of externalized subroutines. The current code is somewhat prepared for this by passing data and context with all subroutine calls (i.e. the equivalent of the implicit object in a postulated OO module). I think all of the above is uncontroversial judging by our discussons of this in the bi-weekly IRC meetings, non? Chapter 2: The Branch Situation Originally I was set on doing this by using CVS's branches, but since I register overwhelming sceptiscism to branches — :-) — let me lay out the whys and the hows... Breaking out the subroutines means modifying the majority of the code. In 0.6.7 we have about 950 lines of main code and 2500 lines of subroutines; in HEAD the numbers are about 850 for main and 2000 for subs. The number of calls from main into a subroutine is fairly large. In other words, even in the best case scenario this stuff is intrusive and disruptive. This means, IMO, that we have three plausible ways to get there. We can try to do this on HEAD — without using any branches — and pushing any 0.7 release until we're stable again; we can do this on a single parallell branch that will eventually replace — instead of merge with — the HEAD/0.7 code; or we can split out m12n completely, in a separate CVS module even. I think the first option will clearly cause too long a period of instability where we can't make any new releases (modulo extremely small bugfixes back in the 0.6.x branch). The second option has similar problems while retaining the problematic branch useage. So I'm thinking that the smart way to go about this is to start building the modules as a completely separate deliverable from the main Validator code. We set up «W3C::Markup::Validator::*» in «$CVSROOT/perl/modules/W3C/…» and build a complete set of backend modules there; completely separated from main Validator code. In this CVS module we have a copy of «check» that makes use of the modules instead of internal subs, just to provide smoketest functionality (in addition to a «t/*» directory of tests, hopefully!). Once that is in a state we're more or less happy with, we install those modules on v.w3.org — or qa-dev, or whatever — and start modifying the mainline Markup Validator to make use of the modules instead of its own internal subroutines. Possibly all in one fell swoop, but possibly little by little as we become comfortable with deploying the modules in production. This switchover can happen on a single branch or it can happen on a side-branch specifically for the purpose, depending on what makes sense and what people are comfortable with at that point. My own preference would be to integrate the M12N version on HEAD while we're busy stabilizing a 0.7 or 0.8 release on a release branch, but that's a different topic. Chater 3: Where The Devil Resides As for implementation details I have some loose thoughts accumulated over the last couple of years that only slightly reflects the input from Björn and Nick on IRC. This would benefit from your input before we get to anything solid to work by, and will probably need to be modified once we actually start implementing it. My thought has always been that we should start by modularizing the code and featureset we already have, in lieu of adding significant new features in the M12N code. e.g. we should modularize the SGML Parser backend — with both OpenSP and jjc's SP implemented, as proof of concept (maybe) — without adding any real XML Processors; this to avoid changing too much and getting into stabilization and integration issues. I may have to revise that in light of input from Björn — as his ideas may make the issue moot for at least the Parser backend and interface — but I'll stick to that as a baseline for now. And the main implementation decision is what to do with the SGML Parser. A while back I hacked up a Perl module to interface with OpenSP using C++ and XS calling into OpenSP's Generic API (with much help from Nick and a few other kind folks). This is out on CPAN — «SGML::Parser::OpenSP» — and the code is currently published on SF.net <http://sf.net/projects/spo/>. The code is not in a state where it's actually useable for anything, but recently Björn has been doing some hacking on this and seems to have found a way to save it. In either case, I think this is a good place to start on the processes. Build a workable version of this module — at least to the point of replicating the parts of «onsgmls» that we actually use — and design the backend parser API on top of that. Since our current use of «onsgmls» is fairly simple — grotty complicated code, but very simple use of OpenSP — we could even conceivably adopt SGML::Parser::OpenSP to replace the calls to «onsgmls» in the production Markup Validator without waiting for any kind of complete M12N to happen; about the only requirement would be to be comfortable that the S::P::O code is stable and without major bugs (and feature complete for the bits we need). This would have a beneficial impact on performance and memory use on v.w.o, and would give us one way of integrating Sean's «sgmlid» code fairly quickly. OTOH, this would probably only be one (side)step on the way to where we really want to be; something which involves at the very least a more complete API to OpenSP, and possibly a full-fledged backend processor API along the lines Björn has been working on. Similar approaches can probably be taken with several functional parts — the XML Appendix F charset detection code, say — if we choose to, but there will probably still be a goodly chunk of the code that needs integrating wholesale instead of piecemeal. Probably this will coincide with Validator-specific code that is a poor match for releasing onto CPAN. Apart from the SGML Parser — which is a whole project in itself; even discounting the addition of XML Processors, the API «Architecting», and any addon features this makes possible (e.g attribute checking, Appendix C checking, etc.) — a few bits that seem to be good candidates for M12N and release on CPAN (not necessarily as the standalone modules suggested below): * XML::Charset - Detect charset of XML docs using the algorithm from Appendix F. * IP::ACL - Determine whether an IP is «allowed» based on configured criteria. * HTTP::Auth::Proxy (or CGI::Auth::Proxy) - Proxy HTTP Auth requests. * HTML::Wrap - Markup-aware line-wrapping (maybe, probably not). * HTML::DOCTYPE(::Munge) - Some parts of our DOCTYPE munging code in the detect/override subs. * HTML::Outline - Our «outline» code. * HTML::Parsetree - Iff we improve the parsetree code before we get started on M12N, it would probably be a useful candidate for CPAN. * HTTP::Charset - Companion to XML::Charset? Our overall charset determination code to find final charset for a doc according to all the relevant specs. Could also be wrapped up in a DWIM module to get you UTF-8 from arbitrary input, but obeying all the relevant specs? * «Foo»::Charlint - Our Charlint subset, or a variant of I18N WG Charlint code? Some of these are probably dead ends, and there are probably other bits that are good candidates; and there is of course much code here that needs to be internal (not released to CPAN, as such) modules. Chapter 4: Deeper and Deeper In the end, I want to end up a minimal implementation — for instance the implementation distributed with the module, or as an example in the module's «eg/*» directory — to look like this: % perl use W3C::Markup::Validator; my $val = new W3C::Markup::Validator; $val->validate('http://www.example.com/') or die qq(Validating $val->uri->as_string failed: $1\n); print $val->results; ^D http://www.example.com/ is not Valid: Line 1, column 0: End of document in prolog. % For the code deployed on v.w3.org, we'll probably want to use a lower level interface. e.g. #!/usr/bin/perl -Tw use W3C::MWS; use W3C::UA; my $p = W3C::MWS->new( Parser => OpenSP, Mode => XML, ); my $ua = new W3C::UA; my $r = $p->parse($ua->fetch('http://www.example.com/'); if ($r->status->is_ok) { &report_meta($r->info); # Report metadata; the URI, DTD, charset... &report_valid($r); # Or $r->report_valid? Maybe... } else { &report_errors($r->errors->as_string); &report_invalid; } … etc.; with as much as possible of the niggling details contained down in the modules. e.g. W3C::UA — our LWP::UserAgent wrapper class — will handle auth-proxying transparently, and maintain redirect info; possibly also do charset detection, checking, and transliteration transparently. While looking at this top-down, I see the OpenSP interface mainly providing the ESIS, the list of errors, and possibly a DOM-ish datastructure constructed internally from the ESIS or the parsetree (or both) — including *->as_string type functions — but this could possibly be by way of lower-level subclasses that speak some kind of SAX or DOM, and which can use both OpenSP and one or more XML::* modules (which I think is the direction Björn is taking here). Chapter 5: Stay Mobile! One key reason for the extra levels of abstraction (extra layers of subclasses) here is so the actual code in what is «check» today is reduced to be as little as possible. Ideally this should be so small that we can maintain «check.cgi», «check.fcgi», «check.pm» (mod_perl), «check.php», «check.asp», «check.pl» (commandline), and «check.exe» (native GUI). etc. All with _only_ the code specific to their function, and with config options to tweak the output generation (i.e. select PHP templates instead of HTML::Template templates for the PHP version). This lets us hit more targets in the various CGI and CGI-alike environments; and it allows any GUI versions to focus on the GUI parts of it instead of getting bogged down in the gray area towards the core code. Also, more levels of abstraction make it easier to do stuff like insert Appendix C and attribute checking (or even XML Schema) into the chain without affecting the front-end code. The tradeoff will be complexity and possibly in terms of performance and API flexibility. Chapter 6: Stumbling Into Apotheosis So how to get started on this... Well, first I need to see y'all's feedback on this stuff. The above is somewhat of a braindump so most of it is likely to see some level of change as soon as I hear from you. And since I think we would benefit from as much design-upfront/code-to-spec development as we can be bothered with here, we should probably try to hash out as much as possible on the list before we try to nail down any specific development plans. IOW, we should probably arrive at a list of some of the main modules, as well as pseudo-POD for them, before we start coding. Once we have that we can get started; and I think we should probably do it in a two-pronged fashion. First off we should work on SGML::Parser::OpenSP — or possibly whatever subclass will sit in front of it (and an XML Processor) — seperately. Maybe on SF.net where S::P::O lives; possibly migrated to dev.w3.org and qa-dev. Second, we should create «$CVSROOT/perl/modules/W3C/MVS/» and begin populating it. Probably it will make sense to start populating it top-down to get a framework, and then add in other bits standalone for integration when we get our feet under us. We could go the other way, but I fear that'd lead to fragmentation that we'd fight with later on. But most importantly, I think we should be running normal development on 0.7 (and maybe even 0.8 and 0.9) in paralell with this effort. I do not want to stop mainline devlopment for production while we try to implement any big radical changes to the architecture. Yes, this will likely mean M12N won't bear fruit until later than it could have if we gave it absolute priority. Chapter 7: All Good Things… Version Roadmap: 0.7.0 — Current HEAD. Has Template code and new Config facility. 0.8.0 — Make use of new Config stuff to add warnings about mismatches or suboptimal useage of DOCTYPEs, Content-Types, charsets etc. Possible target for landing minimal SGML::Parser::OpenSP ? Possible target for landing minimal I18N facility ? 0.9.0 — Refinement of 0.8 features. What else? Possible target for landing minimal SGML::Parser::OpenSP ? Possible target for landing minimal I18N facility ? 1.0.0 — Landing M12N (of some level; maybe not complete)? 1.1.0 — Addition of stuff like Attribute checks, Appendix C? 1.2.0 — Addition of minimal Schema Validation? 2.0.0 — Full Schema Validation? Replaces DTD Validation? Where are we time-wise at this point? What's happened in the world up to this point? Maybe something that looks completely different... :-) In any case. I think 1.0 is a good place to land M12N; or rather, I think when we land M12N is a good place to declare status 1.0. I also think we have stuff that fits into 0.7/0.8/0.9 that we should aim to do. Possibly this should include making use of S::P::O — in the minimalistic «just replace the fork() of “onsgmls”» sense — for performance and a clean way to integrate Sean's «sgmlid» code (the latter because it fits in nicely with the other new checking considered for 0.7-.9). This first step to that is to get 0.7.0 cleaned up, stabilized, and released. We can probably do without a lot of feature additions for 0.7.0. This gives us a window from 0.7.0 through 0.9.x for preparing the M12N code in paralell with production. __END__ As mentioned, this is a braindump. I hope you will all chime in with your thoughts on this; especially Björn has been shooting his mouth off so I'll expect a good long dissertation from him! :-) Everyone else please do speak up at whatever length and level of detail you have time for! A «Sounds good» is ok; a «that's stoopid» comment probably requires a little more explanation; but do please speak up! I'm not currently married to any of the stuff above, but it is the result of thinking about this for quite some time so I may take convincing for different approaches (unless you code it; he who writes the code makes the rules!). - -- Editor's note: in the last update, we noted that Larry Wall would "vomment" on existing RFCs. Some took that to be a cross between "vomit" and "comment." We are unsure of whether it was a subconscious slip or a typographical error. We are also unsure of whether or not to regret the error. -- use.perl.org -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0.3 iQA/AwUBQReygaPyPrIkdfXsEQJ8egCgyac3SYnpQdSEUa5oOVweWpDtOjUAnRJM AyafQegIJzgLgOWG4IXX1cWF =zYHa -----END PGP SIGNATURE-----
Received on Monday, 9 August 2004 17:21:11 UTC