- From: Yves Lafon <ylafon@w3.org>
- Date: Wed, 2 Feb 2011 12:53:41 -0500 (EST)
- To: www-tag@w3.org
- cc: Ted Guild <ted@w3.org>
Scalability of URI Access to Resources
The issue was raised after discussion on the issue generated by
unnecessary dereferenciations of DTD URIs on W3C Website.
Since the original report of the issue, a few documents were published:
[1] on how to build a local cache, [2] pointing to the use of
HTTP caching in libraries/programs, and before this report [3] on the
use of local resolvers.
At that time, the solution used to handle the traffic was in two steps:
1/ sending HTTP 503 to small abuser
2/ firewall big abusers.
The other part of the solution was the use of Catalogs to transfer the
resolution of those popular DTDs on the clients. This was done in the
CSS Validator, then in many other tools, even browsers.
All in all, with education and outreach done by many people, the issue
was contained, but still taking its load on W3C's Web infrastructure.
In 2010, W3C changed the way those DTD are served. A Load balancer (LB)
is now in charge of dispatching requests on servers, and there is now
a special rule for dtds, schemas and entities. The small abusers are no
longer greeted with a 503, the big abusers are still firewalled (5mn,
first then 24h), but the main difference is that now every hit on a
ddt/schema/ent is served after a 30 seconds delay, making those URIs
more expensive to retrieve for the clients. Still, unnecessary
retrievals of DTD URIs on W3C site are generating lots of unwanted
traffic.
However one thing is missing from that picture: the publication of
Catalogs as a statement from the publisher (W3C in this case) that the
association between URIs and attached content is cast in stone, and
local resolvers can rely on that content to always be the same.
The publication of such catalogs are still being evaluated and could be
part of the W3C Recommendation publication process, can be generated
automatically based on frequency of access and stability of the
documents, etc...
This issue is one of the multiple facets of dereferencing URIs.
A resource can become inaccessible because of its popularity, because it
disappear on the authritative site, because of network issues, denial of
services attacks, DNS hijacking, etc... The issue described here is one
of the many aspect of Resilience.
One way of expressing information about available mirrored resource can
be found in draft-bryan-metalinkhttp-19 [4]. It allows the
identification of mirrors, and ways to retrieve content from alternate
locations and possibly using alternate protocols (ftp, p2p), while
ensuring that the mirrored content is genuine via the use of
cryptographic hashed.
It is not solving the DTD issues of W3C, as such documents are small, so
a GET will retrieve the whole DTD in a few packets, too fast for trying
to contact other servers. However, the description of the mirrored
resource might be a good start for other cases, like when an
authoritative source disappear while mirrored, or if the authoritative
source is under attack.
For the F2F, we should discuss if we want to address only this issue or
if we make it part of a bigger one related to resilience or URI
dereferencing.
Cheers,
PS: Dear trackbot, this is ISSUE-58
[1] http://www.w3.org/QA/2008/09/caching_xml_data_at_install_ti.html
[2] http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
[3] http://nwalsh.com/docs/articles/xml2003/
[4] http://tools.ietf.org/html/draft-bryan-metalinkhttp-19
--
Baroula que barouleras, au tiéu toujou t'entourneras.
~~Yves
Received on Wednesday, 2 February 2011 17:53:43 UTC