- From: Adrien de Croy <adrien@qbik.com>
- Date: Sat, 23 May 2009 12:00:34 +1200
- To: Jamie Lokier <jamie@shareable.org>
- CC: David Morris <dwm@xpasc.com>, HTTP Working Group <ietf-http-wg@w3.org>
Jamie Lokier wrote: > David Morris wrote: > >>>> Since URIs can be arbitrarily long, yet database fields aren't good with >>>> this, I'd presume it's common practise to look up based on some hash >>>> value. Is this approach used? Is there any industry-standard hashing >>>> method, e.g. MD5 of method+URI(normalised) + querystring ? >>>> >>> I doubt it. Why would you do that? I don't think it's normal to use >>> a URI to select an application and pass the querystring verbatim to a >>> database, or at least it's not a good idea :-) >>> >> Why not? .. this is a caching related question where the URI is part of >> the cache key ... since I've not implemented such a cache, I can't speak >> to what I have done, but a hash such as MD5 seems reasonable ... in >> particular if followed by an exact match comparison with a value stored >> in a blob, etc. >> > > Ah, your question was about how to implement a cache. > yep, sorry if that wasn't clear > There's lots of ways. Hashing the URI is one, then that could look up > in a big hash table or a file in a directory, or multi-level directory tree. > > Or it could look up in a database like DB or TDB. There are lots of > key-value databases which are happy with arbitrary length key strings, > or which have a fairly big limit and don't pad them. > we are on a windows platform, and we wish the cache index to be able to be shared by multiple proxies. This really directs us to a client-server SQL DBMS. Tests so far haven't shown any problem with performance (depending on the DBMS). The DB itself then can run on any platform. Most DBs will allow arbitrary length text fields, but there's a wide variation in support for indexing on them or searching in them. We want to allow the customer to choose the DBMS (using ODBC) so we need to cater for a wide range, which necessitates a lowest common functionality approach (e.g. the MS Jet engine). We used to not cache anything with a querystring, but there are a lot of origin servers that mark responses to such requests as cacheable, and more and more sites using query strings to define the resource for pretty much all pages in a site via a single base URI. So not caching responses to requests that had a querystring seems increasingly ineffecient. As for normalising, the example in RFC2616 s3.2.3 on URI comparison implies that the URI should be normalised, and that %xx should be treated the same as the equivalent character. Has this been found to be problematic in practise? Also many agents have a max limit on a URI as well as has been discussed on this list so maybe another approach is to use a long varchar field and let the DB do the hashing, and just limit the URI length, and not cache anything with excessively long URIs... seems then a bit arbitrary as to where you choose the cutoff for what is deemed an over-size URI. Thanks Adrien > Some key-value databases hash internally, and some of them use B-trees > or other data structures. If it's on disk, a B-tree might be good > because it'll preserve locality among similar URIs. > > I've used a multi-level multi-key tree structure, in order to handle > Etags properly with different Vary on the same URI. > > Using an SQL database sounds like a way to make your cache > unnecessarily slow, and not a good fit for the problem. > > Be careful when normalising that you don't convert %xx of any > "sensitive" characters, as it can change the meaning of the URI. > Since any escaped characters could be meaningfully distinct from its > unescaped form to an application, that might mean don't convert any > %xx at all. > > -- Jamie > > -- Adrien de Croy - WinGate Proxy Server - http://www.wingate.com
Received on Friday, 22 May 2009 23:58:01 UTC