Re: querystring part of cache key

Jamie Lokier wrote:
> David Morris wrote:
>   
>>>> Since URIs can be arbitrarily long, yet database fields aren't good with
>>>> this, I'd presume it's common practise to look up based on some hash
>>>> value.  Is this approach used?  Is there any industry-standard hashing
>>>> method, e.g. MD5 of method+URI(normalised) + querystring ?
>>>>         
>>> I doubt it.  Why would you do that?  I don't think it's normal to use
>>> a URI to select an application and pass the querystring verbatim to a
>>> database, or at least it's not a good idea :-)
>>>       
>> Why not? .. this is a caching related question where the URI is part of
>> the cache key ... since I've not implemented such a cache, I can't speak 
>> to what I have done, but a hash such as MD5 seems reasonable ... in 
>> particular if followed by an exact match comparison with a value stored
>> in a blob, etc.
>>     
>
> Ah, your question was about how to implement a cache.
>   
yep, sorry if that wasn't clear

> There's lots of ways.  Hashing the URI is one, then that could look up
> in a big hash table or a file in a directory, or multi-level directory tree.
>
> Or it could look up in a database like DB or TDB.  There are lots of
> key-value databases which are happy with arbitrary length key strings,
> or which have a fairly big limit and don't pad them.
>   
we are on a windows platform, and we wish the cache index to be able to 
be shared by multiple proxies.

This really directs us to a client-server SQL DBMS.  Tests so far 
haven't shown any problem with performance (depending on the DBMS).  The 
DB itself then can run on any platform.

Most DBs will allow arbitrary length text fields, but there's a wide 
variation in support for indexing on them or searching in them.  We want 
to allow the customer to choose the DBMS (using ODBC) so we need to 
cater for a wide range, which necessitates a lowest common functionality 
approach (e.g. the MS Jet engine).

We used to not cache anything with a querystring, but there are a lot of 
origin servers that mark responses to such requests as cacheable, and 
more and more sites using query strings to define the resource for 
pretty much all pages in a site via a single base URI.  So not caching 
responses to requests that had a querystring seems increasingly ineffecient.

As for normalising, the example in RFC2616 s3.2.3 on URI comparison 
implies that the URI should be normalised, and that %xx should be 
treated the same as the equivalent character.  Has this been found to be 
problematic in practise?

Also many agents have a max limit on a URI as well as has been discussed 
on this list so maybe another approach is to use a long varchar field 
and let the DB do the hashing, and just limit the URI length, and not 
cache anything with excessively long URIs... seems then a bit arbitrary 
as to where you choose the cutoff for what is deemed an over-size URI.

Thanks

Adrien

> Some key-value databases hash internally, and some of them use B-trees
> or other data structures.  If it's on disk, a B-tree might be good
> because it'll preserve locality among similar URIs.
>
> I've used a multi-level multi-key tree structure, in order to handle
> Etags properly with different Vary on the same URI.
>
> Using an SQL database sounds like a way to make your cache
> unnecessarily slow, and not a good fit for the problem.
>
> Be careful when normalising that you don't convert %xx of any
> "sensitive" characters, as it can change the meaning of the URI.
> Since any escaped characters could be meaningfully distinct from its
> unescaped form to an application, that might mean don't convert any
> %xx at all.
>
> -- Jamie
>
>   

-- 
Adrien de Croy - WinGate Proxy Server - http://www.wingate.com

Received on Friday, 22 May 2009 23:58:01 UTC