- From: Yves Lafon <ylafon@w3.org>
- Date: Wed, 2 Feb 2011 12:53:41 -0500 (EST)
- To: www-tag@w3.org
- cc: Ted Guild <ted@w3.org>
Scalability of URI Access to Resources The issue was raised after discussion on the issue generated by unnecessary dereferenciations of DTD URIs on W3C Website. Since the original report of the issue, a few documents were published: [1] on how to build a local cache, [2] pointing to the use of HTTP caching in libraries/programs, and before this report [3] on the use of local resolvers. At that time, the solution used to handle the traffic was in two steps: 1/ sending HTTP 503 to small abuser 2/ firewall big abusers. The other part of the solution was the use of Catalogs to transfer the resolution of those popular DTDs on the clients. This was done in the CSS Validator, then in many other tools, even browsers. All in all, with education and outreach done by many people, the issue was contained, but still taking its load on W3C's Web infrastructure. In 2010, W3C changed the way those DTD are served. A Load balancer (LB) is now in charge of dispatching requests on servers, and there is now a special rule for dtds, schemas and entities. The small abusers are no longer greeted with a 503, the big abusers are still firewalled (5mn, first then 24h), but the main difference is that now every hit on a ddt/schema/ent is served after a 30 seconds delay, making those URIs more expensive to retrieve for the clients. Still, unnecessary retrievals of DTD URIs on W3C site are generating lots of unwanted traffic. However one thing is missing from that picture: the publication of Catalogs as a statement from the publisher (W3C in this case) that the association between URIs and attached content is cast in stone, and local resolvers can rely on that content to always be the same. The publication of such catalogs are still being evaluated and could be part of the W3C Recommendation publication process, can be generated automatically based on frequency of access and stability of the documents, etc... This issue is one of the multiple facets of dereferencing URIs. A resource can become inaccessible because of its popularity, because it disappear on the authritative site, because of network issues, denial of services attacks, DNS hijacking, etc... The issue described here is one of the many aspect of Resilience. One way of expressing information about available mirrored resource can be found in draft-bryan-metalinkhttp-19 [4]. It allows the identification of mirrors, and ways to retrieve content from alternate locations and possibly using alternate protocols (ftp, p2p), while ensuring that the mirrored content is genuine via the use of cryptographic hashed. It is not solving the DTD issues of W3C, as such documents are small, so a GET will retrieve the whole DTD in a few packets, too fast for trying to contact other servers. However, the description of the mirrored resource might be a good start for other cases, like when an authoritative source disappear while mirrored, or if the authoritative source is under attack. For the F2F, we should discuss if we want to address only this issue or if we make it part of a bigger one related to resilience or URI dereferencing. Cheers, PS: Dear trackbot, this is ISSUE-58 [1] http://www.w3.org/QA/2008/09/caching_xml_data_at_install_ti.html [2] http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic [3] http://nwalsh.com/docs/articles/xml2003/ [4] http://tools.ietf.org/html/draft-bryan-metalinkhttp-19 -- Baroula que barouleras, au tiéu toujou t'entourneras. ~~Yves
Received on Wednesday, 2 February 2011 17:53:43 UTC