Re: Q about reconciliation query batch size

Tom,

Curious, Do you yourself have a particular preference for seeing rate
limiting from a service?
What methods do you see services use most often for that? HTTP Client Error
codes 4xx?  206 Partial Content returned after a Range header field sent?
(I see Amazon, Google, etc.  mostly use HTTP error responses specifically
of 403 and 429)

Thad
https://www.linkedin.com/in/thadguidry/


On Thu, Jun 11, 2020 at 12:45 PM Tom Morris <tfmorris@gmail.com> wrote:

> All of the currently defined limits are for controlling the number of
> responses sent by the server, rather than the requests sent by the client.
>
> While we could add "recommended batch size" and/or "maximum batch size" to
> the manifest, I'm not sure it would add a lot of value. As a practical
> matter, clients are going to choose a batch size which balances between
> amortizing request overhead/latency and responsiveness for progress
> reporting. They aren't motivated to use giant request sizes. In a DOS
> situation, the attacker isn't going to be respecting any advertised limits.
> Note that the server is always free to respond 413 Request Entity Too
> Large and all modern service frameworks have a configurable limit for
> this.
>
> The spec is also silent on whether you can send simultaneous requests in
> parallel, rate limits, etc. I think this would be a more valuable area to
> improve from the point of view of protecting services. The 429 Too Many
> Requests code and Retry-After: header provide a starting point, but it
> may be useful to make use of some extended headers in the X-RateLimit-*
> space.
>
> There are two resources buckets associated with large requests: space &
> time. Once you've accepted the request, the space is used up, but there are
> no requirements or guarantees on how quickly the request will be processed.
> If you want to meter work on a per-request basis and take longer to respond
> to bigger batches, that's completely within the service implementers right
> to do.
>
> OpenRefine currently uses a fixed batch size of 10 and processes batches
> serially in a single threaded fashion, which is inherently rate limiting,
> but it would be nice to improve the latency hiding and be able to have
> multiple requests in flight, while still being polite to reconciliation
> services.
>
> Tom
>
> On Thu, Jun 11, 2020 at 12:00 PM Ford, Kevin <kevinford@loc.gov> wrote:
>
>> Hello all:
>>
>>
>>
>> I presume this is the best place to ask this question, which I’ve
>> harbored for years but which for a variety of reasons I’ve never had real
>> occasion to ask until now.
>>
>>
>>
>> Do I understand correctly that there is no limit to, and no way to
>> enforce a limit of, the reconciliation query batch size?
>>
>>
>>
>> This sentence from the documentation on the Github site – “OpenRefine
>> queries the reconciliation service in batch mode
>> <https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#multiple-query-mode>
>> on the first ten items of the column to be reconciled.” [1] -  /might/
>> suggest the size of batches is 10, but I believe we’re to understand that
>> this particular call basically represents a test before the real, full
>> reconciliation kicks off.  Yes?
>>
>>
>>
>> The “Note” under this section of the W3C specification work [2] seems to
>> make it abundantly clear that there is no restriction on the length of
>> query batches.
>>
>>
>>
>> I didn’t see a clear way to do this via the service manifest.
>>
>>
>>
>> If there is no limit on the size, is there a way for a service provider
>> to impose a limit?  If so, how?  If not, why not?
>>
>>
>>
>> Assuming it is not possible to impose a limit, how does one protect a
>> service from becoming overwhelmed by one extremely large reconciliation
>> request or a number of big ones?  It seems that this opens up the service
>> to a DoS attack, but perhaps I am mistaken.  Even if that risk is perhaps
>> marginal, it still seems that a provider could nevertheless experience a
>> considerable performance penalty having to field requests with huge query
>> batch sizes.
>>
>>
>>
>> I’m familiar in an academic sense with OpenRefine, but not whether it
>> might control the size of query batches to ensure a provider is not
>> overwhelmed.  That said, if this work is to become a more generic way to
>> provide reconciliation or suggest services to be used by software other
>> than OpenRefine, then it still seems this should be an
>> advertiseable/controllable value since one cannot always count on the
>> client being responsible.
>>
>>
>>
>> Yours,
>>
>> Kevin
>>
>>
>>
>> [1]
>> https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#workflow-overview
>>
>> [2]
>> https://reconciliation-api.github.io/specs/0.1/#sending-reconciliation-queries-to-a-service
>>
>>
>>
>> --
>>
>> Kevin Ford
>>
>> Library of Congress
>>
>> Washington, DC
>>
>>
>>
>

Received on Thursday, 11 June 2020 18:01:32 UTC