Re: Why Range doesn't work for LDP "paging" (cf 2NN Contents-of-Related)

On 09/17/2014 12:14 AM, Amos Jeffries wrote:
> Hash: SHA1
> On 17/09/2014 6:20 a.m., Sandro Hawke wrote:
>> On 09/16/2014 04:10 AM, "Martin J. Dürst" wrote:
>>> Hello Sandro, others,
>>> On 2014/09/16 10:13, Sandro Hawke wrote:
>>>> Earlier today the LDP Working Group discussed the matter of
>>>> whether we could use range headers instead of separate page
>>>> URIs.  Use of Range headers was suggested on this list
>>>> recently.
>>>> Our conclusion was still "no", for the following reasons.
>>>> Please let us know if you see a good solution to any/all of
>>>> them:
>>>> 1.  We don't know how the server would initiate use of Range.
>>>> With our current separate-page design, the server can do a 303
>>>> redirect to the first page if it determines the representation
>>>> of the entire resource is too big.   The question here is what
>>>> to do when the client didn't anticipate this possibility.
>>>> True, the 303 isn't a great solution either, since unprepared
>>>> clients might not handle it well either. Perhaps one should
>>>> give a 4xx or 5xx when the client asks for a giant resource
>>>> without a range header...?   But there's no "representation
>>>> too big" code defined.
>>> Can't you still use a 303 if there's no indication that the
>>> client understands tuple ranges?
>> What Location would the 303 redirect to?   With Range, the
>> individual sub-parts wouldn't have their own URIs.
>> Maybe it would redirect to a page which explained that the resource
>> was too big, and gave some metadata, possibly including the first
>> few and last few elements.
>>>> 2.  We don't know how we could do safe changes.  With our
>>>> current design, it's possible for the resource to change while
>>>> paging is happening, and the client ends up with a
>>>> representation whose inaccuracy is bounded by the extent of the
>>>> change.  The data is thus still usually perfectly usable.  (If
>>>> such a change is not acceptable, the client can of course
>>>> detect the change using etags and restart.)   This bounded
>>>> inaccuracy a simple and practical concept with RDF (in a way it
>>>> isn't with arbitrary byte strings). Just using Range, a
>>>> deletion would often result in data unrelated to the change
>>>> being dropped from what the client sees.
>>> Why isn't this the case in your solution? In order to work, don't
>>> you essentially have to remember exactly how far the client read?
>>> If you have various clients, one that started before the first
>>> change, one after the first but before the second change, and so
>>> on, how is the server going to keep track of how far the client
>>> got?
>> You seem be to be thinking that pages are numbered.
>> Instead one can use HATEOAS and embed a place marker in the next
>> and prev URIs.   If those place markers are data values instead of
>> indexes, then insert/delete are handled properly.
>> This is explained in:
>>>> I suppose perhaps one could use some kind of tombstones to
>>>> avoid this problem, not closing in gaps from deletion.
>>>> Basically, a client might ask for triples 0-9 and only get 3
>>>> triples because the others were deleted?  Does that make sense
>>>> with Range?   Is it okay to not have the elements be
>>>> contiguous?
>>> It definitely wouldn't make sense for byte ranges, but I think
>>> it should be okay if you define tuple ranges to work that way.
>> I appreciate that you think that.   Do you have any evidence that
>> there is consensus around that idea?  I can easily imagine other
>> people will come along who would have a big problem with
>> non-contiguous ranges.
> "contiguous" is optional. You are defining how the tuple range unit is
> syntaxed. The only restrictions HTTP places on it is that it conforms
> to token character set, and fetching a range tuple produces the same data.
> You can even specify two tuple types, one for contiguous and one for
> non-contiguous if you really have to.
> It is also relative to ETag. With each resource edit the ETag needs to
> be updated to signal the change. The HTTP infrastructure treats two
> range responses with identical ETag as being combinable into one
> response, in either storage or delivery to the client. Differing Etag
> and the responses must be kept separate and fetched separately by the
> client.
>> It would be awkward if that happened after we re-did the spec to
>> use ranges.
>> Also, does anyone know the standardization route for making a range
>> type of RDF triples?   Does that have to be an RFC or can it be an
>> external spec, like media types?
> IETF review / RFC.

Thanks for the pointer.  I still can't tell if the text defining the new 
range type MUST be in an RFC or can be in a non-RFC formal open 
specification, as it can with media type and link type registrations.

I also don't know (forgive me) what "IETF review" means.   Who needs to 
be convinced, and how many days will it take?

>>>> 3.  Many of our usual RDF data systems don't support retrieval
>>>> of ranges by integer sequence numbers.   While some database
>>>> systems have an internal integer row number in every table that
>>>> could be used for Range, many others do not, and we don't know
>>>> of a straightforward and appropriate way to add it.
>>> So how are you going to implement paged views? I'd be surprised
>>> if there are no sequence numbers but each tuple has a page
>>> number.
>> As above.
>>>> 4.  Finally, there was some question as to whether the Web
>>>> infrastructure has any useful support for non-byte ranges. This
>>>> is perhaps not an objection, but it came up during the
>>>> discussion, and we'd be interested in any data people have on
>>>> this.
>>> By infrastructure, do you mean caches? I don't think there is
>>> much support yet, but I'm not an expert.
>> Caches, server stacks, clients stacks, deep packet inspectors, and
>> other things I probably don't know about.
> The infrastructure has mandatory support for the two failover actions:
> Either
>   ensure that non-byte Ranges are passed to the server and treated as
> non-cacheable
> Or,
>   that the Accept-Range header is pruned such that the server is not
> enticed to delivering non-byte ranges over infrastructure which will
> break processing.
> In my experience the first action is more widely available from the
> middleware infrastructure which either ignores Range entirely, or
> caches selectively what it can and lets the rest pass untouched.

Sounds reasonable.

>>>> Bottom line is we still think just using
>>>> rel=first/last/next/prev, among distinct resources, is a pretty
>>>> reasonable design.   And if we're doing that, it'd be nice to
>>>> have 2nn Contents-of-Related.
>>> Maybe this question has come up before: If you have 1M of tuples,
>>> and decide that you have to serve them in pages of 1K, how much
>>> efficiency do you gain by having the first download
>>> short-circuited, i.e. what's the efficiency gain of one roundtrip
>>> saved over 1000 roundtrips?
>> In this case, I'm just the messenger.   I'll have to ask about that
>> and get back to you.
>>> With a range-based design, various ranges can be downloaded in
>>> parallel,
>> Good point, I hadn't thought of that.   Still, why would that every
>> be useful?
> Collapsing those 1000 round trips into just 2 with a pipeline, and
> greatly reducing the opportunity for any parallel editing to interfere
> with the server responses.

If it was okay to stream it all, we wouldn't be trying to send it in 
little chunks.

> Take the mythical foo range type, where each letter A to ZZ represents
> a block of data in the resource. In reality this could be a numeric
> chapter number or a row hash ID provided that sequence was predictable
> by the client.
> client:
>   GET / HTTP/1.1
>   Accept-Ranges:foo
> server:
>   HTTP/1.1 206 Partial
>   Range: foo=A/A-ZZ
>   ETag: "A-ZZ_hash"

This dialog does not appear correct.  You seem to be using Accept-Ranges 
as a request header, and allowing the server to specify the range in a 
response header.

As I read RFC-7233 Accept-Ranges is a response header, Range is a 
request header, and then Content-Range is the corresponding response header.

> client:
>   GET / HTTP/1.1
>   Accept-Ranges:foo
>   Range: foo=B/A-ZZ
>   ETag: "A-ZZ_hash"
>   GET / HTTP/1.1
>   Accept-Ranges:foo
>   Range: foo=C/A-ZZ
>   ETag: "A-ZZ_hash"
>   GET / HTTP/1.1
>   Accept-Ranges:foo
>   Range: foo=D/A-ZZ
>   ETag: "A-ZZ_hash"
>   ...
> The UI behaviour is that the first chapter/row/whatever is delivered
> immediately signalling how many there are and that range based support
> is working. The display or client processing can proceed incrementally
> like those update-on-scroll pages we see on some popular sites -
> without needing long-polling or WebSocket connections.

What's the advantage of asking for chapters 0, 1, and 2 in separate 
requests?      If the client knows it wants all three, why not ask for 0-2?

What does any of this have to do with long-polling or WebSockets? Those 
are techniques for notifying a client of new information.

      -- Sandro

>>> or the client can adjust ranges based on throughput,..., but with
>>> your rel=first/last/next/prev design, you seem to be much more
>>> constrained.
>> We do have a Prefer header of page size, so clients can adjust
>> that. I'd say there are different constraints.  With Range, the
>> server has less ability to negotiate, and there's no easy way to
>> offer metadata.
> Range has opportunities for metadata in the request/response message
> headers, in the multipart segment headers per-range within response
> payload, and again in the format of the data within those response
> payload segments.
> Amos
> Version: GnuPG v2.0.22 (MingW32)
> W7OA6YqDq3kVCp+l9FV+5a2YVL0xW+DZC1mcHNrVnDbMOXKEQ568Dyuw0QDYXieR
> NeeMLNpG4+UB18TKo4hs28R5pcgq4oXqo1IUTAg8vmhhAa2q1QMOEzvQQcDdjGMl
> Ax+ZcmVQMl0w4E36D2m61T65fYr/gRWrgJ10r/CpwgINpVXd3DpE4Ikccr8E1j8h
> Q9+wpwAyTLu5j+JFIU9kwlJMFEgxGnr4hG4crqufpx9dUkQX55HvNvSac1cu5UPh
> MB9auHuTxAilfvLlL2imJuzpXShL2cKUgQIhAmzxKV2+mvab3xaCBOC4p9Quxnw=
> =Zx48

Received on Friday, 19 September 2014 17:20:59 UTC