- From: <henry.story@bblfish.net>
- Date: Mon, 24 Feb 2014 12:22:20 +0100
- To: Soiland-Reyes Stian <soiland-reyes@cs.manchester.ac.uk>
- Cc: Sandro Hawke <sandro@w3.org>, Steve K Speicher <sspeiche@gmail.com>, public-ldp <public-ldp@w3.org>
On 24 Feb 2014, at 10:16, Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk> wrote: > Why can't the member relations simply be required to be present on > EVERY page? That's an interesting answer. > Would there be so many member relations that they also > need to be paged? Not more than three at present. > > On 21 February 2014 11:54, Sandro Hawke <sandro@w3.org> wrote: >> On February 21, 2014 4:24:57 AM EST, "henry.story@bblfish.net" <henry.story@bblfish.net> wrote: >>> I was thinking that there is an additional reason to have lossless >>> paging >>> which is best explained with the "Volunteering for the army" counter >>> use-case. [1] >>> >>> Given the recent resolution to name ldp:Container what was previously >>> known >>> as the ldp:IndirectContainer [2] a client doing a GET on an LDPC will >>> not know >>> if it can POST to the container without the danger of that action >>> volunteering >>> its owner to the army -- unless it has read through all pages of the >>> container, which >>> may be close to impossible as argued by Sandro. >>> >>> For example a GET on a container may return >>> >>> [[ >>> <> a ldp:Container; >>> >>> <?firstPage> >>> a ldp-paging:Page; >>> ldp-paging:pageOf <>; >>> ldp-paging:containerSortCriteria (<#SortValueAscending>). >>> ]] >>> >>> which would lead through pages and pages, and finally somewhere >>> in the middle you would get >>> >>> [[ >>> <> >>> ldp:membershipResource <http://army.us/#usf>; >>> ldp:hasMemberRelation army:hasVolunteer; >>> ldp:insertedContentRelation dc:author . >>> ]] >>> >>> The advantage of Sandro's paging proposal which has a notion of stable >>> paging, >>> is that one could perhaps specify in such a paging proposal that the >>> first page be the >>> one that contain all the information about the LDPC such as the >>> ldp:hasMemberRelation, ... >>> >>> Otherwise without that should one not for simple clients at least allow >>> the header to contain a rel=type ldp:SimpleContainer, so that a client >>> could be sure that it need not worry about what the consequences of its >>> POSTing to the container are beyone the publication of the document? >>> >> >> I've been fretting over this, too. The problem with the solution you state here is that it doesn't generalize to arbitrary graphs. I'd really really like a paging solution that does. The sorting I proposed isn't too bad; it would mean the members configuration triples would be either at the beginning or the ending of the series of pages, never in the middle. But that's still kind of lame. >> >> Maybe a simpler solution is to have something like rel=meta from the container to an ldp-rs which only contains the most important stuff. Not sure if we can precisely define that, but for containers maybe it would just be defined as the empty-container triples. To me that seems a little cleaner than what's in the spec now using prefer, since prefer isn't guaranteed to even get to the server. >> >> - Sandro >> >> >> >>> >>> Henry >>> >>> [1] http://lists.w3.org/Archives/Public/public-ldp-wg/2013Nov/0022.html >>> [2] http://www.w3.org/2013/meeting/ldp/2014-02-17#line0173 >>> >>> >>> On 20 Feb 2014, at 02:39, Sandro Hawke <sandro@w3.org> wrote: >>> >>>> On 02/19/2014 05:49 PM, Steve Speicher wrote: >>>>> >>>>> >>>>> >>>>> On Sun, Feb 16, 2014 at 2:49 PM, Sandro Hawke <sandro@w3.org> wrote: >>>>> Sorry I didn't propose this months ago. It's taken me some time to >>> see this solution clearly. Thanks to Eric and Andrei for being >>> sounding boards. They haven't seen this particular framing, though. >>>>> >>>>> This is my proposal for "stable paging", a way to prevent the "lossy >>> paging" problem in the currect spec and highlighted in RFC-5005. I >>> believe it can be implemented efficiently, and isn't >>> prohibitively complicated. I'd like to see it in the LC draft, marked >>> "at risk" in case it turns out to be harder to implement >>> than I think. >>>>> >>>>> static paging >>>>> >>>>> First, let me define something simpler, which I'll call "static >>> paging". With static paging, the server embeds the etag of the paged >>> resource into the paging URLs. When it responds to the paging URLs, it >>> checks to make sure the etag is still the same. If it's not, then the >>> server responds with a 410 GONE. Now the client knows it needs to >>> re-initiate paging. >>>>> >>>>> With static paging, the client gets a perfect view of the resource, >>> with no loss, no corruption. And it's trivial to implement. The >>> problem is that resources that are large enough to >>> require paging are likely to change frequently, so a client would be >>> unable to follow PREV/NEXT links very many hops before the resource >>> changed and the client had to start again. It might be it could never >>> get more than 10 pages in before it had to start again. That doesn't >>> seem acceptable. >>>>> >>>>> stable paging definition >>>>> >>>>> With "stable paging" I propose a looser requirement. We allow the >>> graph to change during paging, but each triple which could >>> theoretically ever be in the graph is assigned to a particular page. >>> This results in certain guarantees for the client. Specifically, if >>> the client traverses all the pages with stable paging, (1) every triple >>> it sees must have been in the graph at some point during the traversal, >>> and (2) every triple it does not see must have been absent from the >>> graph at some point during the traversal. In other words, the error is >>> limited to exactly those triples that changed during the traversal. >>> (With lossy paging, the client may miss triples that were always in the >>> graph.) >>>>> >>>>> To be very clear, what you mean by "exactly those triples that >>> changed" meaning only those triples whose object position has changed >>> (subject and predicate have not changed). So it could be a delete + >>> insert of a triple with same subject and predicate is also a "triple >>> that changed" and not a "delete, then an insert", correct? I guess you >>> could argue there is no difference from the client's perspective. >>>>> >>>> >>>> In all of this, I'm thinking of general RDF graphs, so I'm not >>> treating any position in the triple as special. >>>> >>>> At the point paging starts, the paged resource is an RDF graph >>> containing the set of triples G0. >>>> >>>> As time goes on, and the client slowly flips through the pages, the >>> paged resource might happen to be updated. After the first change, it >>> contains the set of triples G1. Then, after another change, it >>> contains the set of triples G2. Etc, up to Gn, when the client does >>> its last page GET. (Of course, the server doesn't know this is the >>> last one, unless it decides to cut the client off and start replying >>> 410 GONE.) >>>> >>>> Let's define union(Gi, Gj, ...) and intersection(Gi, Gj, ...) to be >>> the set union and set intersection, respectively, of the set of triples >>> in each given graph. >>>> >>>> Now, we can say: to do lossless paging, the server MUST construct the >>> pages such that the union of all the pages is a superset of >>> intersection(G0, ... Gn) and a subset of union (G0, ... Gn). >>>> >>>> In other words, the client will see every triple that is there the >>> whole time the traversal is happening, and it MAY see triples that are >>> there part of the time the traversal is happening. >>>> >>>> Again, I think this is fairly intuitive behavior -- if triples are >>> added while you're reading, it shouldn't surprise you that you might >>> ore might not see them (depending exactly when it happened, and where >>> you were in your reading). >>>> >>>> This isn't nice as static paging, where the client simply sees G0, >>> but it's much cheaper. Compared to static paging, lossless paging can >>> produce inconsistent views of the data, as some triples of a structure >>> are seen but not others. With paging containers, though, since each >>> triple is independent, that's not a problem. >>>> >>>> Contrast this with lossy paging, as ATOM, where if triples are added >>> while you're reading, COMPLETELY UNRELATED TRIPLES might be omitted >>> from what you see. The guarantee there is just that what you'll see >>> is a subset of union (G0, ... Gn). But it could be any random subset. >>>> >>>> BTW, instead of "single-page resource", I suggest the term "subset >>> page" or "subset page resource" or "subset resource". To me a >>> "single-page resource" sounds like a resource that doesn't happen to be >>> paged. >>>> >>>>> >>>>> directory example >>>>> >>>>> Let's consider the common case: we're paging an LDPC which has a >>> sort order. (In fact, I expect this will be the only case where paging >>> will be implemented by most LDP servers. When else would it ever >>> actually be useful?) Maybe it's a directory of employees, sorted by >>> the foaf:name of each person. The FIRST page has the folks at the >>> beginning of the alphabet; the LAST page has the folks at the end. >>>>> >>>>> It's a big directory, and people are being added and removed all the >>> time. >>>>> >>>>> With lossy paging, and a server that naively just puts 100 people >>> per page, if a name were deleted during forward paging or added during >>> reverse paging, the client would simply, silently miss a person. >>> Displayed to a user this might be okay, but if the client is the >>> payroll system, or an acl system, or a spam filter lookup, or an >>> emergency alert system, missing even a single person is inexcusable. >>> Static paging might be nice, but isn't practical. But stable paging is >>> pretty good. With stable paging we might still miss a person, but only >>> if that person were added or deleted during paging. That seems like >>> perfectly reasonable behavior, about what people would expect. >>>>> >>>>> I guess I'm confused by a couple of the statements here, namely: >>>>> "With stable paging we might still miss a person, but only if that >>> person were added or deleted during paging" >>>>> >>>>> Isn't that the same argument you used why lossy paging is bad? >>>>> "if a name were deleted during forward paging or added during >>> reverse paging, the client would simply, silently miss a person." >>>>> >>>>> Perhaps your point was more on the silent failure part (more on that >>> below). I believe you meant with stable paging it will not miss a >>> person but I wanted to make sure. >>>>> >>>> >>>> Hopefully I made that clear above. >>>> >>>>> >>>>> So, how does stable paging work? >>>>> >>>>> Instead of just putting 100 names per page, with stable paging the >>> server creates a function mapping names to pages independent of what >>> else is in the graph. It might be the first page has names starting >>> with A, the second has names starting with B, etc. Or it might be the >>> first is "" through "Anderson", the second is "Anderson" through >>> "Aubrey", etc. The paging function can be created by looking at the >>> database, but once a page has been served to a client and certain >>> triples included/excluded from it because of some boundary test, that >>> boundary MUST NOT change. >>>>> >>>>> Each paging URL then includes an identifier for that paging >>> function. As with static paging, the server can, at any time, give up >>> on a particular paging function and answer 410 GONE for those URLs. >>> With the directory, I imagine one would only do that when the directory >>> changes size significantly. I can also imagine one would have two >>> paging functions, one for big pages and one for small pages, and the >>> server would pick one based on a client PREFER header. >>>>> >>>>> So after the ("Anderson" < page <= "Aubrey") is served to the >>> client, say "Arthur" is inserted. My client (perhaps after waiting for >>> some period) requests ("Aubrey" < page <= "Brook"), the server now will >>> have to replay the paging function to the entire paged resource and see >>> if the page sizes are consistent with the boundaries defined (i.e. >>> nothing has changed). The "Arthur" change is detected, then 410 (or >>> some indicator) is sent. This right? So there is a decent cost >>> involved with determining for each page, if paging needs to be >>> reinitiated. >>>> >>>> In the implementation I was proposing here, I was assuming page sizes >>> didn't need to be exact. If a few adds or deletes happen, the page >>> sizes will be off a bit, and that will be okay. >>>> >>>> I think the implementation I mentioned earlier today [1], where you >>> put the boundary values in the URL, is better than this one in ever >>> case. So, no need to worry about the logic here. The core of my >>> proposal is the definition of lossless paging above, and that we >>> require it of LDP servers if they're going to do paging. >>>> >>>> -- Sandro >>>> >>>> [1] >>> http://lists.w3.org/Archives/Public/public-ldp-wg/2014Feb/0058.html >>>> >>>>> >>>>> (apologies for replying in reverse order on these, I should have >>> comprehended this first). >>>>> >>>>> - Steve Speicher >>>>> >>>>> >>>>> microblogging example >>>>> >>>>> In the application Andrei and I are working on (microblogging), the >>> server will have a container of postings on each "channel". It will >>> keep this sorted by the date of posting, so the LAST page will be the >>> most recent postings. It's fine if some system posts every 10 seconds >>> (3 million posts per year) because the clients just jump to the LAST >>> page, and maybe follow a few PREV links, depending how far back they >>> want to go. Lossy paging would result in postings not being shown to >>> some people in some circumstance, which is likely to be unacceptable. >>> For us, the paging function will just be constructed by partitioning >>> the range of dates present, and re-partitioning if a lot of postings >>> are inserted into the middle or deleted. In the normal case of >>> postings being added "now", we're just adding to the end, so no >>> re-partitioning is needed. >>>>> >>>>> conclusion >>>>> >>>>> I know this is a little complicated, but without it, it seems to me >>> paging is worthless, or even harmful, at least for the applications I'm >>> thinking of. I'm willing to help draft text for the spec if people >>> think this is reasonable to include. >>>>> >>>>> -- Sandro >>>>> >>>>> >>>> >>> >>> Social Web Architect >>> http://bblfish.net/ >> >> >> > > > > -- > Stian Soiland-Reyes, myGrid team > School of Computer Science > The University of Manchester > http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718 Social Web Architect http://bblfish.net/
Received on Monday, 24 February 2014 11:23:27 UTC