W3C home > Mailing lists > Public > public-ldp@w3.org > February 2014

Re: lossless paging & volunteering for the army

From: <henry.story@bblfish.net>
Date: Mon, 24 Feb 2014 12:22:20 +0100
Cc: Sandro Hawke <sandro@w3.org>, Steve K Speicher <sspeiche@gmail.com>, public-ldp <public-ldp@w3.org>
Message-Id: <89CCE268-0966-41F4-A2FA-3E8FFD583657@bblfish.net>
To: Soiland-Reyes Stian <soiland-reyes@cs.manchester.ac.uk>

On 24 Feb 2014, at 10:16, Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk> wrote:

> Why can't the member relations simply be required to be present on
> EVERY page?

That's an interesting answer.

> Would there be so many member relations that they also
> need to be paged?

Not more than three at present.
 

> 
> On 21 February 2014 11:54, Sandro Hawke <sandro@w3.org> wrote:
>> On February 21, 2014 4:24:57 AM EST, "henry.story@bblfish.net" <henry.story@bblfish.net> wrote:
>>> I was thinking that there is an additional reason to have lossless
>>> paging
>>> which is best explained with the "Volunteering for the army" counter
>>> use-case. [1]
>>> 
>>> Given the recent resolution to name ldp:Container what was previously
>>> known
>>> as the ldp:IndirectContainer [2] a client doing a GET on an LDPC will
>>> not know
>>> if it can POST to the container without the danger of that action
>>> volunteering
>>> its owner to the army -- unless it has read through all pages of the
>>> container, which
>>> may be close to impossible as argued by Sandro.
>>> 
>>> For example a GET on a container may return
>>> 
>>> [[
>>> <> a ldp:Container;
>>> 
>>> <?firstPage>
>>>  a ldp-paging:Page;
>>>  ldp-paging:pageOf <>;
>>>  ldp-paging:containerSortCriteria (<#SortValueAscending>).
>>> ]]
>>> 
>>>  which would lead through  pages and pages, and finally somewhere
>>> in the middle you would get
>>> 
>>> [[
>>> <>
>>>  ldp:membershipResource <http://army.us/#usf>;
>>>  ldp:hasMemberRelation army:hasVolunteer;
>>>  ldp:insertedContentRelation dc:author .
>>> ]]
>>> 
>>> The advantage of Sandro's paging proposal which has a notion of stable
>>> paging,
>>> is that one could perhaps specify in such a paging proposal that the
>>> first page be the
>>> one that contain all the information about the LDPC such as the
>>> ldp:hasMemberRelation, ...
>>> 
>>> Otherwise without that should one not for simple clients at least allow
>>> the header to contain a rel=type ldp:SimpleContainer, so that a client
>>> could be sure that it need not worry about what the consequences of its
>>> POSTing to the container are beyone the publication of the document?
>>> 
>> 
>> I've been fretting over this, too.    The problem with the solution you state here is that it doesn't generalize to arbitrary graphs.    I'd really really like a paging solution that does.    The sorting I proposed isn't too bad; it would mean the members configuration triples would be either at the beginning or the ending of the series of pages, never in the middle.   But that's still kind of lame.
>> 
>> Maybe a simpler solution is to have something like rel=meta from the container to an ldp-rs which only contains the most important stuff.   Not sure if we can precisely define that, but for containers maybe it would just be defined as the empty-container triples.    To me that seems a little cleaner than what's in the spec now using prefer, since prefer isn't guaranteed to even get to the server.
>> 
>>    - Sandro
>> 
>> 
>> 
>>> 
>>> Henry
>>> 
>>> [1] http://lists.w3.org/Archives/Public/public-ldp-wg/2013Nov/0022.html
>>> [2] http://www.w3.org/2013/meeting/ldp/2014-02-17#line0173
>>> 
>>> 
>>> On 20 Feb 2014, at 02:39, Sandro Hawke <sandro@w3.org> wrote:
>>> 
>>>> On 02/19/2014 05:49 PM, Steve Speicher wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Feb 16, 2014 at 2:49 PM, Sandro Hawke <sandro@w3.org> wrote:
>>>>> Sorry I didn't propose this months ago.   It's taken me some time to
>>> see this solution clearly.   Thanks to Eric and Andrei for being
>>> sounding boards.   They haven't seen this particular framing, though.
>>>>> 
>>>>> This is my proposal for "stable paging", a way to prevent the "lossy
>>> paging" problem in the currect spec and highlighted in RFC-5005.  I
>>> believe it can be                 implemented efficiently, and isn't
>>> prohibitively complicated.  I'd like to see it in the LC draft, marked
>>> "at risk" in case it turns out to be harder to implement
>>> than I think.
>>>>> 
>>>>> static paging
>>>>> 
>>>>> First, let me define something simpler, which I'll call "static
>>> paging".  With static paging, the server embeds the etag of the paged
>>> resource into the paging URLs.  When it responds to the paging URLs, it
>>> checks to make sure the etag is still the same.  If it's not, then the
>>> server responds with a 410 GONE.  Now the client knows it needs to
>>> re-initiate paging.
>>>>> 
>>>>> With static paging, the client gets a perfect view of the resource,
>>> with no loss, no corruption.  And it's trivial to implement.  The
>>> problem is that resources                 that are large enough to
>>> require paging are likely to change frequently, so a client would be
>>> unable to follow PREV/NEXT links very many hops before the resource
>>> changed and the client had to start again.  It might be it could never
>>> get more than 10 pages in before it had to start again.  That doesn't
>>> seem acceptable.
>>>>> 
>>>>> stable paging definition
>>>>> 
>>>>> With "stable paging" I propose a looser requirement.  We allow the
>>> graph to change during paging, but each triple which could
>>> theoretically ever be in the graph is assigned to a particular page.
>>> This results in certain guarantees for the client.  Specifically, if
>>> the client traverses all the pages with stable paging, (1) every triple
>>> it sees must have been in the graph at some point during the traversal,
>>> and (2) every triple it does not see must have been absent from the
>>> graph at some point during the traversal.  In other words, the error is
>>> limited to exactly those triples that changed during the traversal.
>>> (With lossy paging, the client may miss triples that were always in the
>>> graph.)
>>>>> 
>>>>> To be very clear, what you mean by "exactly those triples that
>>> changed" meaning only those triples whose object position has changed
>>> (subject and predicate have not changed).  So it could be a delete +
>>> insert of a triple with same subject and predicate is also a "triple
>>> that changed" and not a "delete, then an insert", correct?  I guess you
>>> could argue there is no difference from the client's perspective.
>>>>> 
>>>> 
>>>> In all of this, I'm thinking of general RDF graphs, so I'm not
>>> treating any position in the triple as special.
>>>> 
>>>> At the point paging starts, the paged resource is an RDF graph
>>> containing the set of triples G0.
>>>> 
>>>> As time goes on, and the client slowly flips through the pages, the
>>> paged resource might happen to be updated.  After the first change, it
>>> contains the set of triples G1.  Then, after another change, it
>>> contains the set of triples G2.   Etc, up to Gn, when the client does
>>> its last page GET.    (Of course, the server doesn't know this is the
>>> last one, unless it decides to cut the client off and start replying
>>> 410 GONE.)
>>>> 
>>>> Let's define union(Gi, Gj, ...) and intersection(Gi, Gj, ...) to be
>>> the set union and set intersection, respectively, of the set of triples
>>> in each given graph.
>>>> 
>>>> Now, we can say: to do lossless paging, the server MUST construct the
>>> pages such that the union of all the pages is a superset of
>>> intersection(G0, ... Gn) and a subset of union (G0, ... Gn).
>>>> 
>>>> In other words, the client will see every triple that is there the
>>> whole time the traversal is happening, and it MAY see triples that are
>>> there part of the time the traversal is happening.
>>>> 
>>>> Again, I think this is fairly intuitive behavior -- if triples are
>>> added while you're reading, it shouldn't surprise you that you might
>>> ore might not see them (depending exactly when it happened, and where
>>> you were in your reading).
>>>> 
>>>> This isn't nice as static paging, where the client simply sees G0,
>>> but it's much cheaper.   Compared to static paging, lossless paging can
>>> produce inconsistent views of the data, as some triples of a structure
>>> are seen but not others.   With paging containers, though, since each
>>> triple is independent, that's not a problem.
>>>> 
>>>> Contrast this with lossy paging, as ATOM, where if triples are added
>>> while you're reading, COMPLETELY UNRELATED TRIPLES might be omitted
>>> from what you see.    The guarantee there is just that what you'll see
>>> is a subset of union (G0, ... Gn).   But it could be any random subset.
>>>> 
>>>> BTW, instead of "single-page resource", I suggest the term "subset
>>> page" or "subset page resource" or "subset resource".   To me a
>>> "single-page resource" sounds like a resource that doesn't happen to be
>>> paged.
>>>> 
>>>>> 
>>>>> directory example
>>>>> 
>>>>> Let's consider the common case: we're paging an LDPC which has a
>>> sort order.  (In fact, I expect this will be the only case where paging
>>> will be implemented by most LDP servers.  When else would it ever
>>> actually be useful?)  Maybe it's a directory of employees, sorted by
>>> the foaf:name of each person.  The FIRST page has the folks at the
>>> beginning of the alphabet; the LAST page has the folks at the end.
>>>>> 
>>>>> It's a big directory, and people are being added and removed all the
>>> time.
>>>>> 
>>>>> With lossy paging, and a server that naively just puts 100 people
>>> per page, if a name were deleted during forward paging or added during
>>> reverse paging, the client would simply, silently miss a person.
>>> Displayed to a user this might be okay, but if the client is the
>>> payroll system, or an acl system, or a spam filter lookup, or an
>>> emergency alert system, missing even a single person is inexcusable.
>>> Static paging might be nice, but isn't practical.  But stable paging is
>>> pretty good.  With stable paging we might still miss a person, but only
>>> if that person were added or deleted during paging.  That seems like
>>> perfectly reasonable behavior, about what people would expect.
>>>>> 
>>>>> I guess I'm confused by a couple of the statements here, namely:
>>>>> "With stable paging we might still miss a person, but only if that
>>> person were added or deleted during paging"
>>>>> 
>>>>> Isn't that the same argument you used why lossy paging is bad?
>>>>> "if a name were deleted during forward paging or added during
>>> reverse paging, the client would simply, silently miss a person."
>>>>> 
>>>>> Perhaps your point was more on the silent failure part (more on that
>>> below).  I believe you meant with stable paging it will not miss a
>>> person but I wanted to make sure.
>>>>> 
>>>> 
>>>> Hopefully I made that clear above.
>>>> 
>>>>> 
>>>>> So, how does stable paging work?
>>>>> 
>>>>> Instead of just putting 100 names per page, with stable paging the
>>> server creates a function mapping names to pages independent of what
>>> else is in the graph.  It might be the first page has names starting
>>> with A, the second has names starting with B, etc.  Or it might be the
>>> first is "" through "Anderson", the second is "Anderson" through
>>> "Aubrey", etc.  The paging function can be created by looking at the
>>> database, but once a page has been served to a client and certain
>>> triples included/excluded from it because of some boundary test, that
>>> boundary MUST NOT change.
>>>>> 
>>>>> Each paging URL then includes an identifier for that paging
>>> function. As with static paging, the server can, at any time, give up
>>> on a particular paging function and answer 410 GONE for those URLs.
>>> With the directory, I imagine one would only do that when the directory
>>> changes size significantly.  I can also imagine one would have two
>>> paging functions, one for big pages and one for small pages, and the
>>> server would pick one based on a client PREFER header.
>>>>> 
>>>>> So after the ("Anderson" < page <= "Aubrey") is served to the
>>> client, say "Arthur" is inserted.  My client (perhaps after waiting for
>>> some period) requests ("Aubrey" < page <= "Brook"), the server now will
>>> have to replay the paging function to the entire paged resource and see
>>> if the page sizes are consistent with the boundaries defined (i.e.
>>> nothing has changed).  The "Arthur" change is detected, then 410 (or
>>> some indicator) is sent.  This right?   So there is a decent cost
>>> involved with determining for each page, if paging needs to be
>>> reinitiated.
>>>> 
>>>> In the implementation I was proposing here, I was assuming page sizes
>>> didn't need to be exact.   If a few adds or deletes happen, the page
>>> sizes will be off a bit, and that will be okay.
>>>> 
>>>> I think the implementation I mentioned earlier today [1], where you
>>> put the boundary values in the URL, is better than this one in ever
>>> case.   So, no need to worry about the logic here.   The core of my
>>> proposal is the definition of lossless paging above, and that we
>>> require it of LDP servers if they're going to do paging.
>>>> 
>>>>       -- Sandro
>>>> 
>>>> [1]
>>> http://lists.w3.org/Archives/Public/public-ldp-wg/2014Feb/0058.html
>>>> 
>>>>> 
>>>>> (apologies for replying in reverse order on these, I should have
>>> comprehended this first).
>>>>> 
>>>>> - Steve Speicher
>>>>> 
>>>>> 
>>>>> microblogging example
>>>>> 
>>>>> In the application Andrei and I are working on (microblogging), the
>>> server will have a container of postings on each "channel".  It will
>>> keep this sorted by the date of posting, so the LAST page will be the
>>> most recent postings.  It's fine if some system posts every 10 seconds
>>> (3 million posts per year) because the clients just jump to the LAST
>>> page, and maybe follow a few PREV links, depending how far back they
>>> want to go.  Lossy paging would result in postings not being shown to
>>> some people in some circumstance, which is likely to be unacceptable.
>>> For us, the paging function will just be constructed by partitioning
>>> the range of dates present, and re-partitioning if a lot of postings
>>> are inserted into the middle or deleted.  In the normal case of
>>> postings being added "now", we're just adding to the end, so no
>>> re-partitioning is needed.
>>>>> 
>>>>> conclusion
>>>>> 
>>>>> I know this is a little complicated, but without it, it seems to me
>>> paging is worthless, or even harmful, at least for the applications I'm
>>> thinking of.  I'm willing to help draft text for the spec if people
>>> think this is reasonable to include.
>>>>> 
>>>>>      -- Sandro
>>>>> 
>>>>> 
>>>> 
>>> 
>>> Social Web Architect
>>> http://bblfish.net/
>> 
>> 
>> 
> 
> 
> 
> -- 
> Stian Soiland-Reyes, myGrid team
> School of Computer Science
> The University of Manchester
> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718

Social Web Architect
http://bblfish.net/
Received on Monday, 24 February 2014 11:23:27 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:03:11 UTC