Re: HTTP Spec: PUT without data transfer, since hash of data is known to server from Thomas Güttler on 2015-10-08 (w3c-dist-auth@w3.org from October to December 2015)

From: Thomas Güttler <guettliml@thomas-guettler.de>
Date: Thu, 8 Oct 2015 19:47:22 +0200
To: w3c-dist-auth@w3.org
Message-ID: <5616AC2A.3050403@thomas-guettler.de>
Am 07.10.2015 um 16:26 schrieb Ed McClanahan:
> Hmm... HTTP PATCH sounds like a problem then. Imagine that a previous PUT
> of some other resource included said hash. A later PATCH modifies a portion
> of that old resource. In order to be able to reference the new content of
> that old resource, a new hash for the entire resource needs to be
> recalculated. Not very practical for small PATCHes to large resources...

Yes, a small PATCH to a big resource would result into a re-calculation
of the hash sum. This re-calculation would need to scan the whole resource,
although only a small part has changed. That's true. 
But "that's live", I see no problem. At least in my environment PATCH is hardly used. 
I see mostly this: Whole files get uploaded and downloaded.

> Still, it seems HTTP PATCH also provides an elegant solution. Using PATCH,
> they payload could be a simple "the data for my new resource has this hash"
> rather than the data itself. The HTTP server could accept or reject the
> PATCH request based upon whether or not it has seen this hash before. If
> rejected, the client just does the normal PUT with unique data anyway.

I am not sure if I can follow your thoughts. 

Do you want to use PATCH to implement uploads without data transfer, or
do you want to use "sending data without transfer" for PATCH, too?

>From RFC:

 The PATCH method requests that a set of changes described in the
 request entity be applied to the resource identified by the Request-URI.

AFAIK you can only PATCH existing resources. My idea is to PUT new
resources. The same way could be used for PATCH, but I would like to
handle this later.

> Going further, some sort of rsync like HTTP PATCH payload could be used
> where blocks of the resource to be loaded are individually hashed. The
> PATCH response could be "OK, I have these blocks but not those". A
> subsequent PATCH could upload only those blocks that contain new data.

I would like to keep it simple during the first step and focus on whole uploads only.

> I would like to add that hashes aren't perfect - most notably MD5. False
> positives would seemingly be a problem. Some scheme might be needed to be
> able to detect false positives.

Yes, I know. Client and server need to agree on a hash method somehow.
If both want md5, they should do it. But I would not offer it, if I would
write a server.

> Finally, there is definitely a security question. The best example of it
> was once described to me this way:
> 
> 1) I work at a company that archives the form letters containing all job
> offers differing only by the employee's name and salary.
> 
> 2) I want to know John Smith's salary (i.e. I know his name but not his
> salary).
> 
> 3) I compose a series of form letter offers each with John Smith's name but
> with varying salaries.
> 
> 4) I try this dedupe-able PUT/PATCH operation for each such offer letter.
> 
> 5) My HTTP client reports which one is dedupe-able.
> 
> The result of #5 reveals John Smith's salary. Oops!


Yes, that's a security concern. 

This could be a solution: If the data with the same hash value is
from a differen area, then the server should answer with "I have
the data for this hash-sum" only if the data was uploaded twice or more.

I can't answer next week.

I was told this list is wrong, since my topic is about http and not webdav.

I will write to the http list in the week of the 19. Oct.

I hope to see/read you there.

Thank you for reading and your interest in this topic.

Regards,
  Thomas Güttler


-- 
http://www.thomas-guettler.de/
Received on Thursday, 8 October 2015 17:47:50 UTC