scalabilityOfURIAccess-58 from Yves Lafon on 2011-02-02 (www-tag@w3.org from February 2011)

From: Yves Lafon <ylafon@w3.org>
Date: Wed, 2 Feb 2011 12:53:41 -0500 (EST)
To: www-tag@w3.org
cc: Ted Guild <ted@w3.org>
Message-ID: <alpine.DEB.1.10.1102021248570.24135@wnl.j3.bet>
Scalability of URI Access to Resources

The issue was raised after discussion on the issue generated by 
unnecessary dereferenciations of DTD URIs on W3C Website.

Since the original report of the issue, a few documents were published:
[1] on how to build a local cache, [2] pointing to the use of 
HTTP caching in libraries/programs, and before this report [3] on the 
use of local resolvers.

At that time, the solution used to handle the traffic was in two steps:
1/ sending HTTP 503 to small abuser
2/ firewall big abusers.
The other part of the solution was the use of Catalogs to transfer the 
resolution of those popular DTDs on the clients. This was done in the
CSS Validator, then in many other tools, even browsers.

All in all, with education and outreach done by many people, the issue 
was contained, but still taking its load on W3C's Web infrastructure.

In 2010, W3C changed the way those DTD are served. A Load balancer (LB) 
is now in charge of dispatching requests on servers, and there is now
a special rule for dtds, schemas and entities. The small abusers are no 
longer greeted with a 503, the big abusers are still firewalled (5mn, 
first then 24h), but the main difference is that now every hit on a 
ddt/schema/ent is served after a 30 seconds delay, making those URIs
more expensive to retrieve for the clients. Still, unnecessary 
retrievals of DTD URIs on W3C site are generating lots of unwanted 
traffic.

However one thing is missing from that picture: the publication of
Catalogs as a statement from the publisher (W3C in this case) that the
association between URIs and attached content is cast in stone, and 
local resolvers can rely on that content to always be the same.

The publication of such catalogs are still being evaluated and could be 
part of the W3C Recommendation publication process, can be generated 
automatically based on frequency of access and stability of the 
documents, etc...

This issue is one of the multiple facets of dereferencing URIs. 
A resource can become inaccessible because of its popularity, because it 
disappear on the authritative site, because of network issues, denial of 
services attacks, DNS hijacking, etc... The issue described here is one 
of the many aspect of Resilience.

One way of expressing information about available mirrored resource can 
be found in draft-bryan-metalinkhttp-19 [4]. It allows the 
identification of mirrors, and ways to retrieve content from alternate 
locations and possibly using alternate protocols (ftp, p2p), while 
ensuring that the mirrored content is genuine via the use of 
cryptographic hashed.

It is not solving the DTD issues of W3C, as such documents are small, so 
a GET will retrieve the whole DTD in a few packets, too fast for trying 
to contact other servers. However, the description of the mirrored 
resource might be a good start for other cases, like when an 
authoritative source disappear while mirrored, or if the authoritative 
source is under attack.

For the F2F, we should discuss if we want to address only this issue or 
if we make it part of a bigger one related to resilience or URI 
dereferencing.
Cheers,

PS: Dear trackbot, this is ISSUE-58

[1] http://www.w3.org/QA/2008/09/caching_xml_data_at_install_ti.html
[2] http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
[3] http://nwalsh.com/docs/articles/xml2003/
[4] http://tools.ietf.org/html/draft-bryan-metalinkhttp-19

-- 
Baroula que barouleras, au tiéu toujou t'entourneras.

         ~~Yves
Received on Wednesday, 2 February 2011 17:53:43 UTC