Re: stable paging from Steve Speicher on 2014-02-19 (public-ldp@w3.org from February 2014)

From: Steve Speicher <sspeiche@gmail.com>
Date: Wed, 19 Feb 2014 17:49:49 -0500
To: Sandro Hawke <sandro@w3.org>
Cc: public-ldp <public-ldp@w3.org>
Message-ID: <CAOUJ7Jq2dfxy1oBCmfbSo9+HJHbxnv1ckibe8nKfG=MwsPOxEA@mail.gmail.com>
On Sun, Feb 16, 2014 at 2:49 PM, Sandro Hawke <sandro@w3.org> wrote:

>  Sorry I didn't propose this months ago.   It's taken me some time to see
> this solution clearly.   Thanks to Eric and Andrei for being sounding
> boards.   They haven't seen this particular framing, though.
>
> This is my proposal for "stable paging", a way to prevent the "lossy
> paging" problem in the currect spec and highlighted in RFC-5005.  I believe
> it can be implemented efficiently, and isn't prohibitively complicated.
> I'd like to see it in the LC draft, marked "at risk" in case it turns out
> to be harder to implement than I think.
>
> *static paging*
>
> First, let me define something simpler, which I'll call "static paging".
> With static paging, the server embeds the etag of the paged resource into
> the paging URLs.  When it responds to the paging URLs, it checks to make
> sure the etag is still the same.  If it's not, then the server responds
> with a 410 GONE.  Now the client knows it needs to re-initiate paging.
>
> With static paging, the client gets a perfect view of the resource, with
> no loss, no corruption.  And it's trivial to implement.  The problem is
> that resources that are large enough to require paging are likely to change
> frequently, so a client would be unable to follow PREV/NEXT links very many
> hops before the resource changed and the client had to start again.  It
> might be it could never get more than 10 pages in before it had to start
> again.  That doesn't seem acceptable.
>
> *stable paging definition*
>
> With "stable paging" I propose a looser requirement.  We allow the graph
> to change during paging, but each triple which could theoretically ever be
> in the graph is assigned to a particular page. This results in certain
> guarantees for the client.  Specifically, if the client traverses all the
> pages with stable paging, (1) every triple it sees must have been in the
> graph at some point during the traversal, and (2) every triple it does not
> see must have been absent from the graph at some point during the
> traversal.  In other words, the error is limited to exactly those triples
> that changed during the traversal.  (With lossy paging, the client may miss
> triples that were always in the graph.)
>

To be very clear, what you mean by "exactly those triples that changed"
meaning only those triples whose object position has changed (subject and
predicate have not changed).  So it could be a delete + insert of a triple
with same subject and predicate is also a "triple that changed" and not a
"delete, then an insert", correct?  I guess you could argue there is no
difference from the client's perspective.


>
> *directory example *
>
> Let's consider the common case: we're paging an LDPC which has a sort
> order.  (In fact, I expect this will be the only case where paging will be
> implemented by most LDP servers.  When else would it ever actually be
> useful?)  Maybe it's a directory of employees, sorted by the foaf:name of
> each person.  The FIRST page has the folks at the beginning of the
> alphabet; the LAST page has the folks at the end.
>
> It's a big directory, and people are being added and removed all the time.
>
> With lossy paging, and a server that naively just puts 100 people per
> page, if a name were deleted during forward paging or added during reverse
> paging, the client would simply, silently miss a person. Displayed to a
> user this might be okay, but if the client is the payroll system, or an acl
> system, or a spam filter lookup, or an emergency alert system, missing even
> a single person is inexcusable. Static paging might be nice, but isn't
> practical.  But stable paging is pretty good.  With stable paging we might
> still miss a person, but only if that person were added or deleted during
> paging.  That seems like perfectly reasonable behavior, about what people
> would expect.
>
> I guess I'm confused by a couple of the statements here, namely:
"With stable paging we might still miss a person, but only if that person
were added or deleted during paging"

Isn't that the same argument you used why lossy paging is bad?
"if a name were deleted during forward paging or added during reverse
paging, the client would simply, silently miss a person."

Perhaps your point was more on the silent failure part (more on that
below).  I believe you meant with stable paging it will not miss a person
but I wanted to make sure.



> So, how does stable paging work?
>
> Instead of just putting 100 names per page, with stable paging the server
> creates a function mapping names to pages independent of what else is in
> the graph.  It might be the first page has names starting with A, the
> second has names starting with B, etc.  Or it might be the first is ""
> through "Anderson", the second is "Anderson" through "Aubrey", etc.  The
> paging function can be created by looking at the database, but once a page
> has been served to a client and certain triples included/excluded from it
> because of some boundary test, that boundary MUST NOT change.
>
> Each paging URL then includes an identifier for that paging function. As
> with static paging, the server can, at any time, give up on a particular
> paging function and answer 410 GONE for those URLs.  With the directory, I
> imagine one would only do that when the directory changes size
> significantly.  I can also imagine one would have two paging functions, one
> for big pages and one for small pages, and the server would pick one based
> on a client PREFER header.
>

So after the ("Anderson" < page <= "Aubrey") is served to the client, say
"Arthur" is inserted.  My client (perhaps after waiting for some period)
requests ("Aubrey" < page <= "Brook"), the server now will have to replay
the paging function to the entire paged resource and see if the page sizes
are consistent with the boundaries defined (i.e. nothing has changed).  The
"Arthur" change is detected, then 410 (or some indicator) is sent.  This
right?   So there is a decent cost involved with determining for each page,
if paging needs to be reinitiated.

(apologies for replying in reverse order on these, I should have
comprehended this first).

- Steve Speicher


> *microblogging example*
>
> In the application Andrei and I are working on (microblogging), the server
> will have a container of postings on each "channel".  It will keep this
> sorted by the date of posting, so the LAST page will be the most recent
> postings.  It's fine if some system posts every 10 seconds (3 million posts
> per year) because the clients just jump to the LAST page, and maybe follow
> a few PREV links, depending how far back they want to go.  Lossy paging
> would result in postings not being shown to some people in some
> circumstance, which is likely to be unacceptable.  For us, the paging
> function will just be constructed by partitioning the range of dates
> present, and re-partitioning if a lot of postings are inserted into the
> middle or deleted.  In the normal case of postings being added "now", we're
> just adding to the end, so no re-partitioning is needed.
>

> *conclusion*
>
> I know this is a little complicated, but without it, it seems to me paging
> is worthless, or even harmful, at least for the applications I'm thinking
> of.  I'm willing to help draft text for the spec if people think this is
> reasonable to include.
>
>       -- Sandro
>
>
Received on Wednesday, 19 February 2014 22:50:17 UTC