- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Fri, 7 Dec 2007 13:42:56 +0200
- To: olivier Thereaux <ot@w3.org>
- Cc: Tools dev list <public-qa-dev@w3.org>
Hi, On Dec 6, 2007, at 09:19, olivier Thereaux wrote: > * Installation * > > I find the building mechanism you adopted rather fascinating. The > fact that the build script goes and fetch all dependencies and files > automatically, and starts the servlet, is great. A downside of this > is that the number of dependencies downloaded is huge! About half a > gigabyte, with a number of jars present in multiple instances (ant, > xerces-impl). Have you thought of a way to keep the number of jars > to a minimum, perhaps by renaming them and keeping them all in a > single directory? I considered it very briefly and figured that by downloading from the original distribution sites I don't need to consider what legal or maintenance obligations I'd have if I distributed the third-party code myself. For example, I don't need to find out which packages would require me to find complete corresponding source code and distributing that, too. I did consider using Maven, but I figured that I'd run into trouble if even one of the dependencies weren't pre-Mavenized for me. Also, at the time I had zero experience with Maven. I now have and, indeed, it is the case that if I were using Maven for the whole thing, I'd have to take care of packaging some of the dependencies for Maven myself. The size issue itself I hadn't considered. So far, I haven't touched any disk limits on any of the systems on which I've run the build script. Is the on-disk footprint of the dependencies directory a problem? Or the download size? With the disks and bandwidth available today, is it a problem that is worth addressing? > * Opensourciness * > > In a discussion with Mike we were wondering if the tool could be > distributed as open source. I've seen licenses for the html5parser, > but apparently not for the whole. We were also wondering if the > dependencies were all OSS-friendly. Disclaimer: IANAL, TINLA. There is currently one data file that the software pulls in from the network at runtime (yeah, that's in itself bad) that I think might not be Open Source: the IANA language tag registry. I contacted the IANA and got a permission to distribute the registry file as-is. (I currently don't distribute it in any form.) I followed up asking about modifications but never got a reply. Therefore, the IANA language tag registry is potentially a non-Open Source invariant section. I could modify the software to read any designated file that is in the language tag registry format. However, the software performs RFC 4646 language tag validation only if the data file is equivalent to the IANA registry file. I think accepting even one invariant section is a slippery slope and a potential problem considering inclusion in software distributions that don't like slippery slopes. I don't really know what I should do about to IANA registry file. The ideal solution would be for the IETF to relax their licensing terms. I think using copyright to enforce the integrity of normative files is the wrong way to go. I think the IETF (and the W3C for that matter) should release their normative stuff Free as in Free Software and only require modified versions to carry a notice that they are modified versions. Other than that, everything above the JDK should be Open Source as defined by OSI and Free Software as defined by the FSF. I am also pretty sure that all the *runtime* dependencies are also Free Software as defined by Debian. However, some packages used by Validator.nu have build/test-time dependencies that may not be Free Software in the Debian sense. Disclaimer: IANDD. If you choose to use IcedTea on Linux, you should be able to make the stack from JDK and downwards Open Source, too. (I haven't tried running Validator.nu on IcedTea, but I have every reason to believe that it would run. I haven't tried gcj/Classpath, either, but I'm less confident about that option Just Working.) As far as I can tell, one way to analyze Debian-compatibility is to analyze GPLv3-compatibility. Here's a not-legal-advice-I-am-not-a- lawyer overview I wrote earlier. Quoting the build script: > dependencyPackages = [ > ("http://www.nic.funet.fi/pub/mirrors/apache.org/commons/codec/binaries/commons-codec-1.3.zip > ", "c30c769e07339390862907504ff4b300"), Apache License 2.0 => GPLv3-compatible (JUnit needed for building tests for source; not needed at runtime.) > ("http://www.nic.funet.fi/pub/mirrors/apache.org/jakarta/httpcomponents/commons-httpclient-3.x/binary/commons-httpclient-3.1.zip > ", "1752a2dc65e2fb03d4e762a8e7a1db49"), Apache License 2.0 => GPLv3-compatible (JUnit needed for building tests for source; not needed at runtime.) > ("http://www.nic.funet.fi/pub/mirrors/apache.org/commons/logging/binaries/commons-logging-1.1.zip > ", "cc4d307492a48e27fbfeeb04d59c6578"), Apache License 2.0 => GPLv3-compatible (Weird build-time deps that Validator.nu doesn't need at runtime.) > ("http://download.icu-project.org/files/icu4j/3.6.1/ > icu4j_3_6_1.jar", "f5ffe0784a9e4c414f42d88e7f6ecefd"), X Consortium-style => GPLv3-compatible > ("http://download.icu-project.org/files/icu4j/3.6.1/icu4j-charsets_3_6_1.jar > ", "0c8485bc3846fb8f243ed393f3f5b7f9"), X Consortium-style => GPLv3-compatible > ("http://belnet.dl.sourceforge.net/sourceforge/jena/ > Jena-2.5.2.zip", "cd9c74f58b7175e56e3512443c84fcf8"), 3-clause BSD => GPLv3-compatible (Build-time deps are Apache License 2.0, PD and a MIT-style one-off Sun license; not needed at runtime.) Validator.nu only needs the IRI subpackage from Jena. > ("http://dist.codehaus.org/jetty/jetty-6.1.5/jetty-6.1.5.zip", > "c05153e639810c0d28a602366c69a632"), Jetty itself is under Apache License 2.0 => GPLv3-compatible Be warned, though, that some of the optional bits that Validator.nu does not use are under LGPL or the *CDDL*. Note that the servlet API jar that comes with Jetty and that Validator.nu needs is claimed to have CDDL bits in it. FSF may have a GPLv3-compatible replacement files available in GNU Classpath. I bet Debian has figured out a way to deal with the servlet API jar by now. > ("http://mirror.eunet.fi/apache/logging/log4j/1.2.14/logging-log4j-1.2.14.zip > ", "6c4f8da1fed407798ea0ad7984fe60db"), Apache License 2.0 => GPLv3-compatible (CDDL build-time deps that Validator.nu doesn't need at runtime if you don't use logging to SMTP.) > ("http://mirror.eunet.fi/apache/xml/xerces-j/Xerces-J-bin. > 2.9.0.zip", "a3aece3feb68be6d319072b85ad06023"), Apache License 2.0 + W3C Software Notice => GPLv3-compatible > ("http://ftp.mozilla.org/pub/mozilla.org/js/rhino1_6R5.zip", > "c93b6d0bb8ba83c3760efeb30525728a"), MPL 1.1 or GPLv2 or later => GPLv3-compatible > ("http://download.berlios.de/jsontools/jsontools-core-1.5.jar", > "1f242910350f28d1ac4014928075becd"), LGPL 2.1 or later => GPLv3-compatible (Only used for HTML parser tests. Validator.nu doesn't need this at runtime.) > ("http://hsivonen.iki.fi/code/antlr.jar", > "9d2e9848c52204275c72d9d6e79f307c"), Public Domain => GPLv3-compatible (Only used for HTML parser tests. Validator.nu doesn't need this at runtime.) > ("http://www.cafeconleche.org/XOM/xom-1.1.jar", > "6b5e76db86d7ae32a451ffdb6fce0764"), LGPL 2.1 (*no* "or later" option) => OK for Debian as standalone but *not* GPLv3-compatible (Validator.nu doesn't need this at runtime. It is part of the extended feature set of the HTML parser and can be severed without harm to Validator.nu.) > ("http://www.slf4j.org/dist/slf4j-1.4.3.zip", > "5671faa7d5aecbd06d62cf91f990f80a"), MIT => GPLv3-compatible (Build-time deps include the concrete logging system you want to use-- log4j in this case.) > ("http://www.nic.funet.fi/pub/mirrors/apache.org/commons/fileupload/binaries/commons-fileupload-1.2-bin.zip > ", "6fbe6112ebb87a9087da8ca1f8d8fd6a"), Apache License 2.0 => GPLv3-compatible > ("http://mirror.eunet.fi/apache/xml/xalan-j/xalan-j_2_7_0-bin.zip", > "ec42adbc83eb0d1354f73a600e274afe"), Apache License 2.0 => GPLv3-compatible It may have some Apache License 1.1 & MIT-style bits, but I don't grok what role those have. :-( I'm not sure what the GPLv3-compat status of Apache License 1.1 is. > ("http://mirror.eunet.fi/apache/ant/binaries/apache-ant-1.7.0- > bin.zip" , "ac30ce5b07b0018d65203fbc680968f5"), Apache License 2.0 => GPLv3-compatible You only need the core Ant plus Launcher to allow oNVDL+Xalan to run. However, Ant as a whole is likely to have *insane* build-time deps. Fortunately, Ant is already in Debian. > ("http://surfnet.dl.sourceforge.net/sourceforge/iso-relax/isorelax.20041111.zip > " , "10381903828d30e36252910679fcbab6"), MIT => GPLv3-compatible > ("http://ovh.dl.sourceforge.net/sourceforge/junit/junit-4.4.jar", > "f852bbb2bbe0471cef8e5b833cb36078"), CPL 1.0 => *not* GPLv3-compatible. Aargh! Fortunately, this is not a run-time dep. An older and sufficient version of JUnit is already in Debian. I have to suspect this is against Debian's own policy, but that's not my fight. In any case, you could satisfy the build-time JUnit deps with stubs. Note that I have to re-introduce a build-time dependency to MPL 1.0 code in order to sync with upstream. Debian could address this with a one-class stub or with a quick patch in the future. > moduleNames = [ > "build", MIT => GPLv3-compatible > "syntax", MIT => GPLv3-compatible > "util", MIT => GPLv3-compatible > "htmlparser", MIT => GPLv3-compatible > "xmlparser", GPLv2 or later with Library Exception => GPLv3-compatible > "onvdl", 3-clause BSD => GPLv3-compatible > "validator", MIT => GPLv3-compatible Quoting olivier Thereaux again: > * Running as servlet * > > I found the class that runs the validator per se > (nu.validator.servlet.Main) which indeed works nicely, but it uses > its own standalone server (it wraps around jetty?). It would be nice > to have a way to run this as a servlet from an existing jetty/tomcat/ > jigsaw (I'm particularly interested in running an instance on > jigsaw). Is that possible? I couldn't find any doc on this yet. nu.validator.servlet.Main makes the software more debuggable and helps me avoid XML situps. The Main class does these things: 1) Initializes log4j. 2) Installs the servlet. 3) Installs servlet filters. 4) Runs Jetty. The servlet class (nu.validator.servlet.VerifierServlet) itself takes care of its initialization in a static {} block, so it should be fairly easy to load it in jigsaw with whatever mechanism jigsaw uses if you can manage to get jigsaw initialize log4j first and if jigsaw doesn't do crazy class unloading and reloading that would cause the static{} block to run repeatedly. To make the servlet work easily in the configurations that I've used it in, it does some quick and dirty sniffing of its deployment context in order to dispatch between the HTML5 facet and the generic facet. I'd be happy to make this sniffing less dirty if you have suggestions about what to do. The servlet filters are optional. They implement compression of output (filter from Jetty) and support for form-based file upload and <textarea>-based input. > * All dynamic * > > I see that the interface is dynamic > (nu.validator.servlet.FormEmitter) Would it make sense to have this > as a static document? Having the main interface be a dynamic > resource would be costly for a high-traffic service. I have considered caching the front page in RAM, but so far generating it dynamically hasn't been a real problem, so I have postponed it as premature optimization. > Also, the lack of static documents served along with the servlet > means that stylesheets are hardcoded to pointing to http://hsivonen.iki.fi > , which is subpar. I have deliberately decided against serving any files from the disk through the servlet to avoid security holes and to avoid doing work that is best left to Apache. I think serving style sheets and scripts from an Apache instance responding at another host name is quite adequate. The hard-coded network locations are indeed uncool. That's not the only instance. There are * URL of the style sheet * URL of the JavaScript file * URL of the about page * URL of the WHATWG HTML5 spec (twice) * URL of the IANA language tag registry * URL of the WHATWG wiki page for microsyntax descriptions When I finish writing this email, I'll start parametrizing these. > * Code review / doc-a-thon * > > Have you done code reviews of the validator in the past? The (X)HTML5 schema has gotten code review form fantasai. The Java code has not been subjected to review (before you taking a look now). > A webcast or teleconference-based intro would be very cool, it would > be a great way to present the features and quite likely help write > doc on the fly / interest people in participating in code. I'm sure > we could organize something. Would you be interested? I'm interested in showing people around the code and/or answering questions over a telecon. However, I think I won't be able to write docs in real time during a telecon without wasting other people's time. > * SVG validation * > > knowing that validator.nu had RNG and nvdl capabilities, I was > particularly interested in seeing how it worked with SVG. I haven't > had time to extensively test with a lot of SVG. I haven't tested the SVG schema a lot, either. Moreover, I haven't tested the NVDL part of oNVDL at all yet. > I noticed it only has one SVG schema. It's the one from Relaxed without any improvements or testing on my part. Eventually, I'd like to improve SVG validation. > I wonder if it would be possible to preparse documents with SVG > media type and look for version and baseprofile attributes on the > root element, switching between SVG tiny, basic and full based on > that. I'm not sure if NVDL could already do that. Currently, the automatic schema selection for XML documents is based on the root element namespace. This is implemented in nu.validator.servlet.BufferingRootNamespaceSniffer. Since the SVG baseProfile and version attributes occur on the root element if present, it would be relatively easy to make the sniffer sniff those attributes as well and to extend the schema selection mechanism to label schemas not only with the namespace URI but with other data as well. I think SVG profiling is a spec misfeature, though, so I haven't been particularly keen on developing support for it. However, I can see why a W3C installation might want to support it, so I'd accept a [suitably licensed] patch. Ideally, I'd like to see a single unified SVG spec whose latest version the validator would track. That is, I'd prefer an "SVG5" schema over profiles. > I tried validating an SVG document with a number of foreign > namespace content in it (typical sodipodi/inkscape output) and found > that the validator.nu complained about these. Is it on purpose? It's on purpose in the sense that, by design, RELAX NG schemas are white lists--not black lists. It isn't on purpose in the sense that I haven't really reviewed the SVG schema yet. > I've heard a lot of arguments in favor of dropping anything in > namespaces not known by the validator, and/or using nvdl to validate > foreign namespace fragments. Is that something validator.nu can do, > is it planned? I'm certain this would be a fantastic tool for the > adoption of SVG. It might be that the NVDL part of oNVDL allows something like this already. I haven't investigated yet. I have considered this issue, though, and I am planning on punching a hole in the XHTML5 schema for embedded RDF, for example. I guess the SVG schema should get that hole, too. I have also considered an option to filter out unknown namespaces, but I'm not sure if it is good to open such "anything goes" holes in a validator. Thank you for your comments. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Friday, 7 December 2007 11:43:34 UTC