Re: Why Range doesn't work for LDP "paging" (cf 2NN Contents-of-Related) from Amos Jeffries on 2014-09-17 (ietf-http-wg@w3.org from July to September 2014)

From: Amos Jeffries <squid3@treenet.co.nz>
Date: Wed, 17 Sep 2014 16:14:57 +1200
To: ietf-http-wg@w3.org
Message-ID: <54190AC1.4060201@treenet.co.nz>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 17/09/2014 6:20 a.m., Sandro Hawke wrote:
> On 09/16/2014 04:10 AM, "Martin J. Dürst" wrote:
>> Hello Sandro, others,
>> 
>> On 2014/09/16 10:13, Sandro Hawke wrote:
>>> Earlier today the LDP Working Group discussed the matter of
>>> whether we could use range headers instead of separate page
>>> URIs.  Use of Range headers was suggested on this list
>>> recently.
>>> 
>>> Our conclusion was still "no", for the following reasons.
>>> Please let us know if you see a good solution to any/all of
>>> them:
>>> 
>>> 1.  We don't know how the server would initiate use of Range.
>>> With our current separate-page design, the server can do a 303
>>> redirect to the first page if it determines the representation
>>> of the entire resource is too big.   The question here is what
>>> to do when the client didn't anticipate this possibility.
>>> True, the 303 isn't a great solution either, since unprepared
>>> clients might not handle it well either. Perhaps one should
>>> give a 4xx or 5xx when the client asks for a giant resource
>>> without a range header...?   But there's no "representation
>>> too big" code defined.
>> 
>> Can't you still use a 303 if there's no indication that the
>> client understands tuple ranges?
>> 
> 
> What Location would the 303 redirect to?   With Range, the
> individual sub-parts wouldn't have their own URIs.
> 
> Maybe it would redirect to a page which explained that the resource
> was too big, and gave some metadata, possibly including the first
> few and last few elements.
> 
>>> 2.  We don't know how we could do safe changes.  With our
>>> current design, it's possible for the resource to change while
>>> paging is happening, and the client ends up with a
>>> representation whose inaccuracy is bounded by the extent of the
>>> change.  The data is thus still usually perfectly usable.  (If
>>> such a change is not acceptable, the client can of course
>>> detect the change using etags and restart.)   This bounded 
>>> inaccuracy a simple and practical concept with RDF (in a way it
>>> isn't with arbitrary byte strings). Just using Range, a
>>> deletion would often result in data unrelated to the change
>>> being dropped from what the client sees.
>> 
>> Why isn't this the case in your solution? In order to work, don't
>> you essentially have to remember exactly how far the client read?
>> If you have various clients, one that started before the first
>> change, one after the first but before the second change, and so
>> on, how is the server going to keep track of how far the client
>> got?
>> 
> 
> You seem be to be thinking that pages are numbered.
> 
> Instead one can use HATEOAS and embed a place marker in the next
> and prev URIs.   If those place markers are data values instead of
> indexes, then insert/delete are handled properly.
> 
> This is explained in: http://www.w3.org/TR/ldp-paging/#ldpr-impl
> 
> 
>> 
>>> I suppose perhaps one could use some kind of tombstones to
>>> avoid this problem, not closing in gaps from deletion.
>>> Basically, a client might ask for triples 0-9 and only get 3
>>> triples because the others were deleted?  Does that make sense
>>> with Range?   Is it okay to not have the elements be
>>> contiguous?
>> 
>> It definitely wouldn't make sense for byte ranges, but I think
>> it should be okay if you define tuple ranges to work that way.
>> 
> 
> I appreciate that you think that.   Do you have any evidence that
> there is consensus around that idea?  I can easily imagine other
> people will come along who would have a big problem with
> non-contiguous ranges.

"contiguous" is optional. You are defining how the tuple range unit is
syntaxed. The only restrictions HTTP places on it is that it conforms
to token character set, and fetching a range tuple produces the same data.

You can even specify two tuple types, one for contiguous and one for
non-contiguous if you really have to.

It is also relative to ETag. With each resource edit the ETag needs to
be updated to signal the change. The HTTP infrastructure treats two
range responses with identical ETag as being combinable into one
response, in either storage or delivery to the client. Differing Etag
and the responses must be kept separate and fetched separately by the
client.


> 
> It would be awkward if that happened after we re-did the spec to
> use ranges.
> 
> Also, does anyone know the standardization route for making a range
> type of RDF triples?   Does that have to be an RFC or can it be an
> external spec, like media types?

http://tools.ietf.org/html/rfc7233#section-5.1

IETF review / RFC.

> 
>> 
>>> 3.  Many of our usual RDF data systems don't support retrieval
>>> of ranges by integer sequence numbers.   While some database
>>> systems have an internal integer row number in every table that
>>> could be used for Range, many others do not, and we don't know
>>> of a straightforward and appropriate way to add it.
>> 
>> So how are you going to implement paged views? I'd be surprised
>> if there are no sequence numbers but each tuple has a page
>> number.
>> 
> 
> As above.
> 
>> 
>>> 4.  Finally, there was some question as to whether the Web 
>>> infrastructure has any useful support for non-byte ranges. This
>>> is perhaps not an objection, but it came up during the
>>> discussion, and we'd be interested in any data people have on
>>> this.
>> 
>> By infrastructure, do you mean caches? I don't think there is
>> much support yet, but I'm not an expert.
>> 
> 
> Caches, server stacks, clients stacks, deep packet inspectors, and
> other things I probably don't know about.

The infrastructure has mandatory support for the two failover actions:
Either
 ensure that non-byte Ranges are passed to the server and treated as
non-cacheable
Or,
 that the Accept-Range header is pruned such that the server is not
enticed to delivering non-byte ranges over infrastructure which will
break processing.

In my experience the first action is more widely available from the
middleware infrastructure which either ignores Range entirely, or
caches selectively what it can and lets the rest pass untouched.


> 
>> 
>>> Bottom line is we still think just using
>>> rel=first/last/next/prev, among distinct resources, is a pretty
>>> reasonable design.   And if we're doing that, it'd be nice to
>>> have 2nn Contents-of-Related.
>> 
>> Maybe this question has come up before: If you have 1M of tuples,
>> and decide that you have to serve them in pages of 1K, how much
>> efficiency do you gain by having the first download
>> short-circuited, i.e. what's the efficiency gain of one roundtrip
>> saved over 1000 roundtrips?
>> 
> 
> In this case, I'm just the messenger.   I'll have to ask about that
> and get back to you.
> 
>> With a range-based design, various ranges can be downloaded in
>> parallel,
> 
> Good point, I hadn't thought of that.   Still, why would that every
> be useful?

Collapsing those 1000 round trips into just 2 with a pipeline, and
greatly reducing the opportunity for any parallel editing to interfere
with the server responses.

Take the mythical foo range type, where each letter A to ZZ represents
a block of data in the resource. In reality this could be a numeric
chapter number or a row hash ID provided that sequence was predictable
by the client.

client:
 GET / HTTP/1.1
 Accept-Ranges:foo

server:
 HTTP/1.1 206 Partial
 Range: foo=A/A-ZZ
 ETag: "A-ZZ_hash"

client:
 GET / HTTP/1.1
 Accept-Ranges:foo
 Range: foo=B/A-ZZ
 ETag: "A-ZZ_hash"

 GET / HTTP/1.1
 Accept-Ranges:foo
 Range: foo=C/A-ZZ
 ETag: "A-ZZ_hash"

 GET / HTTP/1.1
 Accept-Ranges:foo
 Range: foo=D/A-ZZ
 ETag: "A-ZZ_hash"

 ...

The UI behaviour is that the first chapter/row/whatever is delivered
immediately signalling how many there are and that range based support
is working. The display or client processing can proceed incrementally
like those update-on-scroll pages we see on some popular sites -
without needing long-polling or WebSocket connections.


> 
>> or the client can adjust ranges based on throughput,..., but with
>> your rel=first/last/next/prev design, you seem to be much more
>> constrained.
> 
> We do have a Prefer header of page size, so clients can adjust
> that. I'd say there are different constraints.  With Range, the
> server has less ability to negotiate, and there's no easy way to
> offer metadata.

Range has opportunities for metadata in the request/response message
headers, in the multipart segment headers per-range within response
payload, and again in the format of the data within those response
payload segments.

Amos
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)

iQEcBAEBAgAGBQJUGQrBAAoJELJo5wb/XPRjyJ0H/2rA/zFe9sYm6NouZTZ8gBU+
W7OA6YqDq3kVCp+l9FV+5a2YVL0xW+DZC1mcHNrVnDbMOXKEQ568Dyuw0QDYXieR
NeeMLNpG4+UB18TKo4hs28R5pcgq4oXqo1IUTAg8vmhhAa2q1QMOEzvQQcDdjGMl
Ax+ZcmVQMl0w4E36D2m61T65fYr/gRWrgJ10r/CpwgINpVXd3DpE4Ikccr8E1j8h
Q9+wpwAyTLu5j+JFIU9kwlJMFEgxGnr4hG4crqufpx9dUkQX55HvNvSac1cu5UPh
MB9auHuTxAilfvLlL2imJuzpXShL2cKUgQIhAmzxKV2+mvab3xaCBOC4p9Quxnw=
=Zx48
-----END PGP SIGNATURE-----
Received on Wednesday, 17 September 2014 04:15:40 UTC