Re: lossless paging & volunteering for the army from Sandro Hawke on 2014-02-21 (public-ldp@w3.org from February 2014)

From: Sandro Hawke <sandro@w3.org>
Date: Fri, 21 Feb 2014 06:54:09 -0500
To: "henry.story@bblfish.net" <henry.story@bblfish.net>
CC: Steve K Speicher <sspeiche@gmail.com>,public-ldp <public-ldp@w3.org>
Message-ID: <03e42f55-12f3-4e93-bac6-ada9b612f14c@email.android.com>
On February 21, 2014 4:24:57 AM EST, "henry.story@bblfish.net" <henry.story@bblfish.net> wrote:
>I was thinking that there is an additional reason to have lossless
>paging
>which is best explained with the "Volunteering for the army" counter
>use-case. [1]
>
>Given the recent resolution to name ldp:Container what was previously
>known
>as the ldp:IndirectContainer [2] a client doing a GET on an LDPC will
>not know
>if it can POST to the container without the danger of that action
>volunteering 
>its owner to the army -- unless it has read through all pages of the
>container, which
>may be close to impossible as argued by Sandro.
>
>For example a GET on a container may return
>
>[[
><> a ldp:Container;
>
><?firstPage>
>   a ldp-paging:Page;
>   ldp-paging:pageOf <>;
>   ldp-paging:containerSortCriteria (<#SortValueAscending>).
>]]
>
>   which would lead through  pages and pages, and finally somewhere 
>in the middle you would get
>
>[[
> <>
>   ldp:membershipResource <http://army.us/#usf>;
>   ldp:hasMemberRelation army:hasVolunteer;
>   ldp:insertedContentRelation dc:author .
>]]
>
>The advantage of Sandro's paging proposal which has a notion of stable
>paging, 
>is that one could perhaps specify in such a paging proposal that the
>first page be the
>one that contain all the information about the LDPC such as the 
>ldp:hasMemberRelation, ...
>
>Otherwise without that should one not for simple clients at least allow
>the header to contain a rel=type ldp:SimpleContainer, so that a client
>could be sure that it need not worry about what the consequences of its
>POSTing to the container are beyone the publication of the document?
>

I've been fretting over this, too.    The problem with the solution you state here is that it doesn't generalize to arbitrary graphs.    I'd really really like a paging solution that does.    The sorting I proposed isn't too bad; it would mean the members configuration triples would be either at the beginning or the ending of the series of pages, never in the middle.   But that's still kind of lame.

Maybe a simpler solution is to have something like rel=meta from the container to an ldp-rs which only contains the most important stuff.   Not sure if we can precisely define that, but for containers maybe it would just be defined as the empty-container triples.    To me that seems a little cleaner than what's in the spec now using prefer, since prefer isn't guaranteed to even get to the server.

    - Sandro



>
>Henry
>
>[1] http://lists.w3.org/Archives/Public/public-ldp-wg/2013Nov/0022.html
>[2] http://www.w3.org/2013/meeting/ldp/2014-02-17#line0173
>
>
>On 20 Feb 2014, at 02:39, Sandro Hawke <sandro@w3.org> wrote:
>
>> On 02/19/2014 05:49 PM, Steve Speicher wrote:
>>> 
>>> 
>>> 
>>> On Sun, Feb 16, 2014 at 2:49 PM, Sandro Hawke <sandro@w3.org> wrote:
>>> Sorry I didn't propose this months ago.   It's taken me some time to
>see this solution clearly.   Thanks to Eric and Andrei for being
>sounding boards.   They haven't seen this particular framing, though.
>>> 
>>> This is my proposal for "stable paging", a way to prevent the "lossy
>paging" problem in the currect spec and highlighted in RFC-5005.  I
>believe it can be                 implemented efficiently, and isn't
>prohibitively complicated.  I'd like to see it in the LC draft, marked
>"at risk" in case it turns out to be harder to implement               
> than I think.
>>> 
>>> static paging
>>> 
>>> First, let me define something simpler, which I'll call "static
>paging".  With static paging, the server embeds the etag of the paged
>resource into the paging URLs.  When it responds to the paging URLs, it
>checks to make sure the etag is still the same.  If it's not, then the
>server responds with a 410 GONE.  Now the client knows it needs to
>re-initiate paging.
>>> 
>>> With static paging, the client gets a perfect view of the resource,
>with no loss, no corruption.  And it's trivial to implement.  The
>problem is that resources                 that are large enough to
>require paging are likely to change frequently, so a client would be
>unable to follow PREV/NEXT links very many hops before the resource
>changed and the client had to start again.  It might be it could never
>get more than 10 pages in before it had to start again.  That doesn't
>seem acceptable.
>>> 
>>> stable paging definition
>>> 
>>> With "stable paging" I propose a looser requirement.  We allow the
>graph to change during paging, but each triple which could
>theoretically ever be in the graph is assigned to a particular page.
>This results in certain guarantees for the client.  Specifically, if
>the client traverses all the pages with stable paging, (1) every triple
>it sees must have been in the graph at some point during the traversal,
>and (2) every triple it does not see must have been absent from the
>graph at some point during the traversal.  In other words, the error is
>limited to exactly those triples that changed during the traversal. 
>(With lossy paging, the client may miss triples that were always in the
>graph.)
>>> 
>>> To be very clear, what you mean by "exactly those triples that
>changed" meaning only those triples whose object position has changed
>(subject and predicate have not changed).  So it could be a delete +
>insert of a triple with same subject and predicate is also a "triple
>that changed" and not a "delete, then an insert", correct?  I guess you
>could argue there is no difference from the client's perspective.
>>>  
>> 
>> In all of this, I'm thinking of general RDF graphs, so I'm not
>treating any position in the triple as special.
>> 
>> At the point paging starts, the paged resource is an RDF graph
>containing the set of triples G0.
>> 
>> As time goes on, and the client slowly flips through the pages, the
>paged resource might happen to be updated.  After the first change, it
>contains the set of triples G1.  Then, after another change, it
>contains the set of triples G2.   Etc, up to Gn, when the client does
>its last page GET.    (Of course, the server doesn't know this is the
>last one, unless it decides to cut the client off and start replying
>410 GONE.)
>> 
>> Let's define union(Gi, Gj, ...) and intersection(Gi, Gj, ...) to be
>the set union and set intersection, respectively, of the set of triples
>in each given graph.
>> 
>> Now, we can say: to do lossless paging, the server MUST construct the
>pages such that the union of all the pages is a superset of
>intersection(G0, ... Gn) and a subset of union (G0, ... Gn).
>> 
>> In other words, the client will see every triple that is there the
>whole time the traversal is happening, and it MAY see triples that are
>there part of the time the traversal is happening.
>> 
>> Again, I think this is fairly intuitive behavior -- if triples are
>added while you're reading, it shouldn't surprise you that you might
>ore might not see them (depending exactly when it happened, and where
>you were in your reading).
>> 
>> This isn't nice as static paging, where the client simply sees G0,
>but it's much cheaper.   Compared to static paging, lossless paging can
>produce inconsistent views of the data, as some triples of a structure
>are seen but not others.   With paging containers, though, since each
>triple is independent, that's not a problem.
>> 
>> Contrast this with lossy paging, as ATOM, where if triples are added
>while you're reading, COMPLETELY UNRELATED TRIPLES might be omitted
>from what you see.    The guarantee there is just that what you'll see
>is a subset of union (G0, ... Gn).   But it could be any random subset.
>> 
>> BTW, instead of "single-page resource", I suggest the term "subset
>page" or "subset page resource" or "subset resource".   To me a
>"single-page resource" sounds like a resource that doesn't happen to be
>paged.
>> 
>>> 
>>> directory example 
>>> 
>>> Let's consider the common case: we're paging an LDPC which has a
>sort order.  (In fact, I expect this will be the only case where paging
>will be implemented by most LDP servers.  When else would it ever
>actually be useful?)  Maybe it's a directory of employees, sorted by
>the foaf:name of each person.  The FIRST page has the folks at the
>beginning of the alphabet; the LAST page has the folks at the end.
>>> 
>>> It's a big directory, and people are being added and removed all the
>time.
>>> 
>>> With lossy paging, and a server that naively just puts 100 people
>per page, if a name were deleted during forward paging or added during
>reverse paging, the client would simply, silently miss a person.
>Displayed to a user this might be okay, but if the client is the
>payroll system, or an acl system, or a spam filter lookup, or an
>emergency alert system, missing even a single person is inexcusable.
>Static paging might be nice, but isn't practical.  But stable paging is
>pretty good.  With stable paging we might still miss a person, but only
>if that person were added or deleted during paging.  That seems like
>perfectly reasonable behavior, about what people would expect.   
>>> 
>>> I guess I'm confused by a couple of the statements here, namely: 
>>> "With stable paging we might still miss a person, but only if that
>person were added or deleted during paging"
>>> 
>>> Isn't that the same argument you used why lossy paging is bad? 
>>> "if a name were deleted during forward paging or added during
>reverse paging, the client would simply, silently miss a person."
>>> 
>>> Perhaps your point was more on the silent failure part (more on that
>below).  I believe you meant with stable paging it will not miss a
>person but I wanted to make sure.
>>> 
>> 
>> Hopefully I made that clear above.  
>> 
>>>  
>>> So, how does stable paging work?
>>> 
>>> Instead of just putting 100 names per page, with stable paging the
>server creates a function mapping names to pages independent of what
>else is in the graph.  It might be the first page has names starting
>with A, the second has names starting with B, etc.  Or it might be the
>first is "" through "Anderson", the second is "Anderson" through
>"Aubrey", etc.  The paging function can be created by looking at the
>database, but once a page has been served to a client and certain
>triples included/excluded from it because of some boundary test, that
>boundary MUST NOT change.
>>> 
>>> Each paging URL then includes an identifier for that paging
>function. As with static paging, the server can, at any time, give up
>on a particular paging function and answer 410 GONE for those URLs. 
>With the directory, I imagine one would only do that when the directory
>changes size significantly.  I can also imagine one would have two
>paging functions, one for big pages and one for small pages, and the
>server would pick one based on a client PREFER header.
>>> 
>>> So after the ("Anderson" < page <= "Aubrey") is served to the
>client, say "Arthur" is inserted.  My client (perhaps after waiting for
>some period) requests ("Aubrey" < page <= "Brook"), the server now will
>have to replay the paging function to the entire paged resource and see
>if the page sizes are consistent with the boundaries defined (i.e.
>nothing has changed).  The "Arthur" change is detected, then 410 (or
>some indicator) is sent.  This right?   So there is a decent cost
>involved with determining for each page, if paging needs to be
>reinitiated.
>> 
>> In the implementation I was proposing here, I was assuming page sizes
>didn't need to be exact.   If a few adds or deletes happen, the page
>sizes will be off a bit, and that will be okay.
>> 
>> I think the implementation I mentioned earlier today [1], where you
>put the boundary values in the URL, is better than this one in ever
>case.   So, no need to worry about the logic here.   The core of my
>proposal is the definition of lossless paging above, and that we
>require it of LDP servers if they're going to do paging.
>> 
>>        -- Sandro
>> 
>> [1]
>http://lists.w3.org/Archives/Public/public-ldp-wg/2014Feb/0058.html
>> 
>>> 
>>> (apologies for replying in reverse order on these, I should have
>comprehended this first).
>>> 
>>> - Steve Speicher
>>> 
>>> 
>>> microblogging example
>>> 
>>> In the application Andrei and I are working on (microblogging), the
>server will have a container of postings on each "channel".  It will
>keep this sorted by the date of posting, so the LAST page will be the
>most recent postings.  It's fine if some system posts every 10 seconds
>(3 million posts per year) because the clients just jump to the LAST
>page, and maybe follow a few PREV links, depending how far back they
>want to go.  Lossy paging would result in postings not being shown to
>some people in some circumstance, which is likely to be unacceptable. 
>For us, the paging function will just be constructed by partitioning
>the range of dates present, and re-partitioning if a lot of postings
>are inserted into the middle or deleted.  In the normal case of
>postings being added "now", we're just adding to the end, so no
>re-partitioning is needed.
>>> 
>>> conclusion
>>> 
>>> I know this is a little complicated, but without it, it seems to me
>paging is worthless, or even harmful, at least for the applications I'm
>thinking of.  I'm willing to help draft text for the spec if people
>think this is reasonable to include.
>>> 
>>>       -- Sandro
>>> 
>>> 
>> 
>
>Social Web Architect
>http://bblfish.net/
Received on Friday, 21 February 2014 11:54:20 UTC