Roll call: Bjoern, Nick[niq] (half here), Ville[scop], Yan, Karl, Yves, Olivier[yod], Dom, Terje[xover] (arrived later) last meeting: http://www.w3.org/mid/C798D705-8925-11D8-AEFA-000393A63FC8@w3.org ** Agenda 1 - checklink and robots ** [00:45:48:] I understood OT has been testing 3.9.3-dev a bit, what about others? [00:46:49:] * bjoern_ fwiw, did not test checklink... [00:47:22:] * yod happy with new feature, with the reservation that I wonder whether it should ignore the robots protocol for non-recursive mode [00:47:58:] yod: I have a feeling that could be a bit hairy [00:48:07:] (to implement, that is) [00:48:34:] scop: because of different UA/ RobotUA classes? [00:48:58:] * yod would like to know other's gut feeling about that too, beyond implementation issue [00:49:10:] yod: yep, might be possible to work around that though by directly accessing RobotRules, dunno [00:49:45:] * bjoern_ thinks that link checkers should ignore at least Disallow: *... [00:50:29:] nope [00:50:36:] it's after all just a HEAD and following robots.txt makes link checkers less useful [00:50:59:] niq? [00:51:15:] should display "forbidden by robot rules" with a link to howto change that to allow the link checker [00:51:57:] that'll work with default imple of robotrules [00:52:03:] I think in non-recursive mode, the linkchecker is hardly a robot [00:52:09:] it's merely a browser [00:52:27:] indeed [00:52:29:] it is. And it falls straight into ban-me traps [00:52:50:] and it subjects webservers to rapid-fire [00:53:01:] how so? [00:53:13:] hmm... actually, what I mean is a bit more precise: the link checker should not fail when the primary URI is excluded by robots rules [00:53:28:] that too [00:53:30:] ... only when checked URIs inside the page falls down onto these rules [00:53:53:] * niq thinks it should [00:54:12:] I would respect robots.txt... if someone put a robots.txt with Disallow, it's because they have reasons for that, this same person who's in charge will have also the possibility to tweak a configuration to let the link checker goes if needed. UserAgent string for example [00:54:17:] otherwise it's open to varoius attacks, like pointing it at a bad-crawler-trap page directly [00:54:31:] well, as an author, if I want my links checked and the link checker says I should test manually, I would open the link in my browser which yields in much more traffic than caused by the link checker (HEAD vs GET, style sheets, images, ...) [00:54:34:] karlcow++ [00:55:08:] niq/karlcow++ [00:55:27:] bjoern_: as author is one thing, but an online robot can be pointed at a third-party webserver, including in a malicious attack [00:56:13:] http://qa-dev.w3.org/wlc/checklink?uri=http%3A%2F%2Fkoti.welho.com%2Fvskytta%2Ft.html [00:56:47:] what's missing is the link to a howto describing how to allow the link checker to access the site [00:57:05:] yep [00:57:12:] <__Yves> well people are usually editing robots.txt once for all and use User-Agent * [00:57:30:] I want to validate external links (internal ones never break) and I cannot change the robots.txt of a foreign server. [00:57:34:] scop++ [00:57:45:] I would use my own link checker that does not honor robots.txt instead [00:58:21:] fine. so that can fall straight into a ban-me tarpit and start generating 403s on every page [00:58:23:] yes bjoern_: but you can't force people if they don't want. People have really the choice or not. [00:58:24:] * yod agrees at least with Dom's point about not stopping when the (checked) page is disallowed [00:58:31:] And typically you use robots.txt for things you don't want to show up on search engines... [00:59:23:] * yod would like to make a distinction on recursive/nonrecursive [00:59:36:] I don't think there is any disagreement with recursive mode, is there? [00:59:45:] (bjoern?) [00:59:50:] We could limit the number of pages/host to prevent malicious use [01:00:42:] even one page could get the checker banned automatically from a site [01:01:41:] # of pages/host is not too much different from "full" robots.txt "compliance", it also produces unsatisfactory results for author POV [01:02:05:] http://www.robotstxt.org/wc/exclusion.html#robotstxt [01:02:43:] btw, fwiw, LWP does not support the "revised internet-draft" version of the spec [01:03:10:] ok the "spec" is clear [01:03:23:] it's for all robots [01:03:32:] it's for retrieved document [01:03:41:] not mentiong of indexing. [01:03:46:] HEAD is not retrieval [01:03:52:] s/g// [01:04:08:] [[ Robots are often used for maintenance and indexing purposes, by ]] [01:04:25:] yep [01:04:33:] maintenance ;) for example [01:04:42:] <__Yves> HEAD retreives meta-information, so it is partly retreival [01:04:49:] the "spec" talks about "visiting" [01:04:59:] it says " Note that these instructions apply to any HTTP method on a URL." [01:05:00:] <__Yves> GET retreives data and metadata, so not only the content [01:05:10:] * yod waiting in a corner for the spec bashing to start [01:05:19:] ahaha [01:05:33:] It's only a draft... [01:05:34:] :) [01:06:01:] <__Yves> if ever crwalers would be willing to start using OPTIONS * :) [01:06:25:] yeah... [01:06:40:] and means in web servers to configure OPTIONS... [01:06:49:] * yod thinks... that we need to find a way to make checklink behave, and that robots.txt is one such mechanism [01:06:59:] * yod would be happy with : [01:07:03:] s/one/the/ [01:07:47:] 1 - inviting people to be nicer to checklink in their robots.txt [01:08:15:] 2 - an option (not available in recursive mode?) to ignore the protocol [01:08:31:] 2-- [01:08:43:] with default to follow it and a note on responsibility + other "behave mechanisms" (timer?) [01:09:12:] the trouble is, any such option is an open invitation to the malicious [01:09:24:] there's already the 1 sec delay, not bound to robots.txt as such [01:09:59:] yeah I was thinking of increasing the delay when not following robots.txt [01:10:07:] tell that to a slashdotted site [01:11:58:] is recursive mode limited to the host of the original page uri? [01:12:07:] For the malicious ones... Checklink is a perl program Open Source... a real malicious geek will anyway reactivate what he wants. So I think the options can be minimum. [01:12:22:] bjoern_: host + base uri [01:12:44:] so it follows only "internal" links? [01:12:58:] <__Yves> malicious ones don't need that to do a DoS [01:13:02:] no, the restriction is for *documents*, not links [01:13:28:] <__Yves> bet more on a user fumbling with a config than someone wanting to doe vil things [01:13:32:] I mean, if I have a link on x.org to www.w3.org, would it follow links on www.w3.org? [01:14:04:] depends on definition of "follow", but yes, it would do the "link checking" on them, ie HEAD [01:14:49:] why? [01:14:55:] * dom__ wonders how a bot is supposed to react when robots.txt is forbidden of being visited by a robots.txt file [01:15:09:] * dom__ knows he's looking for troubles :) [01:15:41:] dom__: http://www.robotstxt.org/wc/norobots-rfc.html section 3.1 [01:15:44:] The bot would be ashamed and hide in the corner of the server... [01:16:18:] * xover arrives... [01:16:28:] <__Yves> dom: and you can make it forget this using a Cache-Control: no-cache, no-store [01:16:28:] what do we do with ? [01:16:45:] bjoern_: why what? /me lost... [01:17:09:] Why it would check links on foreign sites in recursive mode [01:17:49:] well, it is a link checker? note, that is not the same as recursing offsite [01:17:59:] unhandled ATM [01:18:42:] oops, misread, it does not check links *on* foreign sites. it does check links *to* foreign sites [01:20:19:] ok [01:22:19:] well, we don't seem to have an agreement on that [01:22:45:] what do we do then? [01:22:54:] launch 3.9.3 beta [01:22:59:] get feedback [01:23:03:] decide what to do [01:23:13:] yod++ [01:23:18:] (I think the current behavior is fine, although I'd prefer the way I proposed) [01:24:08:] I think this discussion had interesting points, will re-use that to steer feedback when we go to beta [01:24:30:] speaking of which, please play with the instance on qa-dev, as well as the markup validator there [01:24:43:] they have the latest lwp, which we need to try [01:24:59:] [closing this item] (later) [02:22:29:] btw, first cut at documenting the /robots.txt stuff for checklink up @ http://qa-dev.w3.org/wlc/checklink?uri=http%3A%2F%2Fkoti.welho.com%2Fvskytta%2Ft.html [02:22:36:] wording improvements welcome ** Agenda 2 - CSS validator - progress and priorities ** [01:25:58:] dodji updated libcroco CVS, I am going to have a look at that [01:26:07:] no progress on css schema [01:26:28:] <__Yves> ok, so I recently closed some issues, partly by fixing the grammar (which is really thin) and by upgrading javaCC [01:27:06:] <__Yves> would be nice to have a test suite (that can act as a regression TS as well) [01:27:24:] __Yves, I can look at the bugs and prioritize them to some extend [01:27:29:] <__Yves> also a list of "needs to be fixed in priority" would be nice :) [01:27:52:] <__Yves> bjeorn: well, what may have a high priority to me might not have the same for others [01:28:02:] <__Yves> s/bjeorn/bjoern_/ [01:28:18:] <__Yves> so guidance from users and people interacting with users is welcomed :) [01:29:02:] Well, I would probably give those highest priority which most users complained about... [01:29:16:] btw http://www.w3.org/Bugs/Public/buglist.cgi?product=CSSValidator [01:29:40:] <__Yves> yep, saw this, I found the .not bug there (and fixed it) [01:29:59:] P2 quite crowded [01:30:50:] * scop notes that P2 is the default in Bugzilla [01:31:07:] <__Yves> yeah and P1 is for a URI that has moved... [01:31:24:] http://www.w3.org/Bugs/Public/show_bug.cgi?id=337 [01:31:40:] It should probably be closed as invalid [01:32:12:] <__Yves> yes [01:32:23:] <__Yves> so only P2 bugs remains [01:33:04:] <__Yves> (if the mime type is good, there are no reason it would work regarldess of the URI) [01:33:05:] there is a P5 http://www.w3.org/Bugs/Public/show_bug.cgi?id=399 which should probably have higher priority [01:33:08:] so ACTION: bjoern to modify priorities in CSSValidator's bugzilla [01:33:23:] and ACTION: Yves to fix bugs [01:33:24:] :) [01:33:32:] <__Yves> yeah :) [01:33:34:] what do we do re test suite? [01:33:42:] <__Yves> ACTION yod to start a test suite :) [01:33:47:] !!! [01:34:06:] I have not touched test suites for a while, my bad [01:34:16:] <__Yves> bjoern : I have a set of files used to test some bugs, they can be used to do regression test, but not more [01:34:29:] Yves: send that list to me [01:34:29:] <__Yves> and we need perhaps more thatn that (from regular stuff to corner cases) [01:34:37:] <__Yves> and this works also for the markup validator [01:34:51:] I'll try to work on that within the next 2 weeks [01:34:51:] <__Yves> (including weird encodings cornercases) [01:35:07:] I also have a number of test pages/style sheets, a number of them linked from bugzilla... [01:35:07:] <__Yves> yod: remind me so that I won't forget [01:35:18:] I will... [01:35:35:] ACTION: Yves send olivier list of "test" cases URI for the CSS validator [01:35:48:] now I know I will remind you [01:36:12:] on a related (to the CSS validator) note, the spanish office is motivated to handle translation of interfaces and errors [01:36:14:] s/Yves/Yves and Bjoern/ [01:36:38:] I will (tomorrow I think) work on a plan for translations and maintenance thereof [01:37:16:] anything else on the CSS validator? [01:37:30:] should be straightforward for the css validator [01:37:42:] bjoern_: I think so [01:37:45:] I would like information from sijtsche/plh/whoever on how much CSS3 is supposed to be implemented [01:37:49:] <__Yves> that should be it (note that with the new JavaCC, preformance improved) [01:38:02:] <__Yves> yeah, so do I, and information on support for other profiles [01:38:29:] bjoern_: would you like to start a mail thread about it on qa-dev? [01:38:36:] There are lots of things i am not sure about whether they are unimplemented or broken... [01:38:42:] or w-v-c if you prefer [01:39:18:] I would prefer if you send them a mail to summarize what's implemented / what they implemented / something like that [01:39:30:] cc'ing w-v-c/qa-dev [01:39:31:] fine [01:39:33:] I will [01:40:07:] oh, and probably w3c-css-wg [01:40:10:] ACTION: olivier contact PLH/Sijtsche and ask them what is implemented / to what extent (esp. CSS3) [01:40:58:] (btw, Bert has an ongoing action item to make sure CSS 2.1 is supported in the css validator...) [01:40:41:] [closing item] ** Agenda 3 - Markup Validator ** [01:42:22:] Markup Validator : not much feedback on 0.6.5b2, beyond style issues [01:42:41:] bjoern animating interesting discussions [01:42:58:] without much luck, as I expected... [01:43:15:] there were answers... from the usual suspects [01:43:45:] There wasn't much feedback on previous betas either (not considering my comments), it seems we have a general feedback issue [01:44:08:] Well, this beta was pretty much low profile [01:44:19:] compared to others, which were announced much more broadly [01:44:47:] which did not yield in much feedback either [01:44:55:] add a
with a