Re: Q about reconciliation query batch size


Curious, Do you yourself have a particular preference for seeing rate
limiting from a service?
What methods do you see services use most often for that? HTTP Client Error
codes 4xx?  206 Partial Content returned after a Range header field sent?
(I see Amazon, Google, etc.  mostly use HTTP error responses specifically
of 403 and 429)


On Thu, Jun 11, 2020 at 12:45 PM Tom Morris <> wrote:

> All of the currently defined limits are for controlling the number of
> responses sent by the server, rather than the requests sent by the client.
> While we could add "recommended batch size" and/or "maximum batch size" to
> the manifest, I'm not sure it would add a lot of value. As a practical
> matter, clients are going to choose a batch size which balances between
> amortizing request overhead/latency and responsiveness for progress
> reporting. They aren't motivated to use giant request sizes. In a DOS
> situation, the attacker isn't going to be respecting any advertised limits.
> Note that the server is always free to respond 413 Request Entity Too
> Large and all modern service frameworks have a configurable limit for
> this.
> The spec is also silent on whether you can send simultaneous requests in
> parallel, rate limits, etc. I think this would be a more valuable area to
> improve from the point of view of protecting services. The 429 Too Many
> Requests code and Retry-After: header provide a starting point, but it
> may be useful to make use of some extended headers in the X-RateLimit-*
> space.
> There are two resources buckets associated with large requests: space &
> time. Once you've accepted the request, the space is used up, but there are
> no requirements or guarantees on how quickly the request will be processed.
> If you want to meter work on a per-request basis and take longer to respond
> to bigger batches, that's completely within the service implementers right
> to do.
> OpenRefine currently uses a fixed batch size of 10 and processes batches
> serially in a single threaded fashion, which is inherently rate limiting,
> but it would be nice to improve the latency hiding and be able to have
> multiple requests in flight, while still being polite to reconciliation
> services.
> Tom
> On Thu, Jun 11, 2020 at 12:00 PM Ford, Kevin <> wrote:
>> Hello all:
>> I presume this is the best place to ask this question, which I’ve
>> harbored for years but which for a variety of reasons I’ve never had real
>> occasion to ask until now.
>> Do I understand correctly that there is no limit to, and no way to
>> enforce a limit of, the reconciliation query batch size?
>> This sentence from the documentation on the Github site – “OpenRefine
>> queries the reconciliation service in batch mode
>> <>
>> on the first ten items of the column to be reconciled.” [1] -  /might/
>> suggest the size of batches is 10, but I believe we’re to understand that
>> this particular call basically represents a test before the real, full
>> reconciliation kicks off.  Yes?
>> The “Note” under this section of the W3C specification work [2] seems to
>> make it abundantly clear that there is no restriction on the length of
>> query batches.
>> I didn’t see a clear way to do this via the service manifest.
>> If there is no limit on the size, is there a way for a service provider
>> to impose a limit?  If so, how?  If not, why not?
>> Assuming it is not possible to impose a limit, how does one protect a
>> service from becoming overwhelmed by one extremely large reconciliation
>> request or a number of big ones?  It seems that this opens up the service
>> to a DoS attack, but perhaps I am mistaken.  Even if that risk is perhaps
>> marginal, it still seems that a provider could nevertheless experience a
>> considerable performance penalty having to field requests with huge query
>> batch sizes.
>> I’m familiar in an academic sense with OpenRefine, but not whether it
>> might control the size of query batches to ensure a provider is not
>> overwhelmed.  That said, if this work is to become a more generic way to
>> provide reconciliation or suggest services to be used by software other
>> than OpenRefine, then it still seems this should be an
>> advertiseable/controllable value since one cannot always count on the
>> client being responsible.
>> Yours,
>> Kevin
>> [1]
>> [2]
>> --
>> Kevin Ford
>> Library of Congress
>> Washington, DC

Received on Thursday, 11 June 2020 18:01:32 UTC