stable paging from Sandro Hawke on 2014-02-16 (public-ldp-wg@w3.org from February 2014)

From: Sandro Hawke <sandro@w3.org>
Date: Sun, 16 Feb 2014 14:50:50 -0500
To: Linked Data Platform WG <public-ldp-wg@w3.org>
Message-ID: <5301169A.60402@w3.org>
Sorry I didn't propose this months ago.   It's taken me some time to see 
this solution clearly.   Thanks to Eric and Andrei for being sounding 
boards.   They haven't seen this particular framing, though.

This is my proposal for "stable paging", a way to prevent the "lossy 
paging" problem in the currect spec and highlighted in RFC-5005.  I 
believe it can be implemented efficiently, and isn't prohibitively 
complicated.  I'd like to see it in the LC draft, marked "at risk" in 
case it turns out to be harder to implement than I think.

*static paging*

First, let me define something simpler, which I'll call "static 
paging".  With static paging, the server embeds the etag of the paged 
resource into the paging URLs.  When it responds to the paging URLs, it 
checks to make sure the etag is still the same.  If it's not, then the 
server responds with a 410 GONE.  Now the client knows it needs to 
re-initiate paging.

With static paging, the client gets a perfect view of the resource, with 
no loss, no corruption.  And it's trivial to implement.  The problem is 
that resources that are large enough to require paging are likely to 
change frequently, so a client would be unable to follow PREV/NEXT links 
very many hops before the resource changed and the client had to start 
again.  It might be it could never get more than 10 pages in before it 
had to start again.  That doesn't seem acceptable.

*stable paging definition*

With "stable paging" I propose a looser requirement.  We allow the graph 
to change during paging, but each triple which could theoretically ever 
be in the graph is assigned to a particular page. This results in 
certain guarantees for the client.  Specifically, if the client 
traverses all the pages with stable paging, (1) every triple it sees 
must have been in the graph at some point during the traversal, and (2) 
every triple it does not see must have been absent from the graph at 
some point during the traversal.  In other words, the error is limited 
to exactly those triples that changed during the traversal.  (With lossy 
paging, the client may miss triples that were always in the graph.)

*directory example *

Let's consider the common case: we're paging an LDPC which has a sort 
order.  (In fact, I expect this will be the only case where paging will 
be implemented by most LDP servers.  When else would it ever actually be 
useful?)  Maybe it's a directory of employees, sorted by the foaf:name 
of each person.  The FIRST page has the folks at the beginning of the 
alphabet; the LAST page has the folks at the end.

It's a big directory, and people are being added and removed all the time.

With lossy paging, and a server that naively just puts 100 people per 
page, if a name were deleted during forward paging or added during 
reverse paging, the client would simply, silently miss a person. 
Displayed to a user this might be okay, but if the client is the payroll 
system, or an acl system, or a spam filter lookup, or an emergency alert 
system, missing even a single person is inexcusable. Static paging might 
be nice, but isn't practical.  But stable paging is pretty good.  With 
stable paging we might still miss a person, but only if that person were 
added or deleted during paging.  That seems like perfectly reasonable 
behavior, about what people would expect.

So, how does stable paging work?

Instead of just putting 100 names per page, with stable paging the 
server creates a function mapping names to pages independent of what 
else is in the graph.  It might be the first page has names starting 
with A, the second has names starting with B, etc.  Or it might be the 
first is "" through "Anderson", the second is "Anderson" through 
"Aubrey", etc.  The paging function can be created by looking at the 
database, but once a page has been served to a client and certain 
triples included/excluded from it because of some boundary test, that 
boundary MUST NOT change.

Each paging URL then includes an identifier for that paging function. As 
with static paging, the server can, at any time, give up on a particular 
paging function and answer 410 GONE for those URLs.  With the directory, 
I imagine one would only do that when the directory changes size 
significantly.  I can also imagine one would have two paging functions, 
one for big pages and one for small pages, and the server would pick one 
based on a client PREFER header.

*microblogging example*

In the application Andrei and I are working on (microblogging), the 
server will have a container of postings on each "channel".  It will 
keep this sorted by the date of posting, so the LAST page will be the 
most recent postings.  It's fine if some system posts every 10 seconds 
(3 million posts per year) because the clients just jump to the LAST 
page, and maybe follow a few PREV links, depending how far back they 
want to go.  Lossy paging would result in postings not being shown to 
some people in some circumstance, which is likely to be unacceptable.  
For us, the paging function will just be constructed by partitioning the 
range of dates present, and re-partitioning if a lot of postings are 
inserted into the middle or deleted.  In the normal case of postings 
being added "now", we're just adding to the end, so no re-partitioning 
is needed.

*conclusion*

I know this is a little complicated, but without it, it seems to me 
paging is worthless, or even harmful, at least for the applications I'm 
thinking of.  I'm willing to help draft text for the spec if people 
think this is reasonable to include.

       -- Sandro
Received on Sunday, 16 February 2014 19:50:57 UTC