Re: Transmitting deltas [was Re: Efficient RDF Interchange, Re: Zippy]

On 2/18/14, 8:00 PM, David Booth wrote:
> On 02/18/2014 01:20 PM, Stephen Williams wrote:
>> [ . . . ]
>> Has anyone been working on compact, efficient binary representation of
>> RDF/N-Quads or similar?  Chunking / deltas?
>> Does anyone want to work on these problems?
>
> I am interested in transmitting deltas, though so far I have only been casually thinking about the problem and looking around a 
> little.  I would be most interested in a solution that is parameterized by the delta algorithm, so that it could be used with any 
> data and any delta algorithm -- not just RDF.

There is a tradeoff between processing time/memory and generality. There are a number of algorithms for processing arbitrary bytes, 
but they can't take advantage of scanning and encoding shortcuts possible with carefully selected infoset encoding.  Additionally, 
for certain types of data / encoding, especially for bit-packed data like EXI or certain compression methods, any change can change 
every following bit.

>
> FYI, in 2002 RFC 3229 attempted to address the problem of transmitting deltas, but to my mind not very satisfactorily, partly 
> because it seemed too complicated and partly because it required a new HTTP header:
> http://tools.ietf.org/html/rfc3229
> AFAICT others also have not found it satisfactory, because I have not seen any uptake.

Yes, that is a well thought out general solution.

>
> FWIW, in thinking about the problem, one way I considered approaching the problem was to use the HTTP Content-Encoding header to 
> indicate a delta-encoding.  But one issue is that the ETag is computed *after* the Content-Encoding is applied, and hence is only 
> an ETag for the delta -- not for the original content.   I do not want to lose the ability to receive the ETag for the original 
> content.  Hence, the original ETag would have to be somehow bundled with the delta.  Also, I don't know if an approach based on 
> the Content-Encoding header would work well in terms of current HTTP library implementations.  Maybe someone who is more 
> knowledgeable about HTTP libraries would know.

I'm much more interested in the data than the way the HTTP protocol is used, but it would be great to come to a consensus on that.
The ETags / IF-* solution isn't bad for some situations.  Referring to the base ETag when returning a delta would be clear.

> Another possible approach that came to mind is to use the HTTP PATCH method: http://tools.ietf.org/html/rfc5789
> That's normally for a request, not a response.  Still, I was thinking that it might be possible to adapt it to work for responses.

It is definitely odd that HTTP PATCH only solves the request/put problem while apparently doing nothing to help responses.
There are a number of circumstances where each side needs to provide information that is often redundant with simplistic methods.  
The Editing document is a good start, but it has a much narrower view than the breadth of real world situations and problems.
http://www.w3.org/1999/04/Editing/

These are the main update cases that come to mind for me:

 1. Publishing a database/knowledgebase that has various updates but is essentially read only or read only for many clients.
    Wikipedia etc.
 2. Updating state / objects with possibly interleaved "competitors".  Sometimes you want to fail if state changed, other times it
    is far better to apply some delta, like increment or add.  However, many of those cases can be turned into append and compute
    transactions, such as a bank account: You don't need to update a total, that's something that is a summary computation.
 3. Systems with complex semantics: Can't reasonably directly support an equivalent to 'increment', so fall back to 'optimistic
    locking' / no reservation test/set/retry or use a smart server which can manage semantics.
 4. Merge changes similar to git, supporting text or a type of subtree / object / field / value update with some type of
    conflict-free boundaries.  SCM systems assume non-overlapping == independence.  Other options might be dependency trees or other
    chunky scope.
 5. Realtime collaborative editing of documents, spreadsheets, chat, etc.


>
> Also, I have not yet found any standard "diff" media type except for JSON.  See MNot's blog post:
> http://www.mnot.net/blog/2012/09/05/patch
> Does anyone know of any?
>
> Here's a PATCH media type for XML:
> http://tools.ietf.org/html/draft-wilde-xml-patch-08

This is exactly the kind of thing that I defined for EXI, only with patches to encoded data.

>
> David Booth

sdw

-- 
Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: http://sdw.st/in
V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer

Received on Wednesday, 19 February 2014 09:34:18 UTC