Re: Q about reconciliation query batch size

All of the currently defined limits are for controlling the number of
responses sent by the server, rather than the requests sent by the client.

While we could add "recommended batch size" and/or "maximum batch size" to
the manifest, I'm not sure it would add a lot of value. As a practical
matter, clients are going to choose a batch size which balances between
amortizing request overhead/latency and responsiveness for progress
reporting. They aren't motivated to use giant request sizes. In a DOS
situation, the attacker isn't going to be respecting any advertised limits.
Note that the server is always free to respond 413 Request Entity Too Large and
all modern service frameworks have a configurable limit for this.

The spec is also silent on whether you can send simultaneous requests in
parallel, rate limits, etc. I think this would be a more valuable area to
improve from the point of view of protecting services. The 429 Too Many
Requests code and Retry-After: header provide a starting point, but it may
be useful to make use of some extended headers in the X-RateLimit-* space.

There are two resources buckets associated with large requests: space &
time. Once you've accepted the request, the space is used up, but there are
no requirements or guarantees on how quickly the request will be processed.
If you want to meter work on a per-request basis and take longer to respond
to bigger batches, that's completely within the service implementers right
to do.

OpenRefine currently uses a fixed batch size of 10 and processes batches
serially in a single threaded fashion, which is inherently rate limiting,
but it would be nice to improve the latency hiding and be able to have
multiple requests in flight, while still being polite to reconciliation
services.

Tom

On Thu, Jun 11, 2020 at 12:00 PM Ford, Kevin <kevinford@loc.gov> wrote:

> Hello all:
>
>
>
> I presume this is the best place to ask this question, which I’ve harbored
> for years but which for a variety of reasons I’ve never had real occasion
> to ask until now.
>
>
>
> Do I understand correctly that there is no limit to, and no way to enforce
> a limit of, the reconciliation query batch size?
>
>
>
> This sentence from the documentation on the Github site – “OpenRefine
> queries the reconciliation service in batch mode
> <https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#multiple-query-mode>
> on the first ten items of the column to be reconciled.” [1] -  /might/
> suggest the size of batches is 10, but I believe we’re to understand that
> this particular call basically represents a test before the real, full
> reconciliation kicks off.  Yes?
>
>
>
> The “Note” under this section of the W3C specification work [2] seems to
> make it abundantly clear that there is no restriction on the length of
> query batches.
>
>
>
> I didn’t see a clear way to do this via the service manifest.
>
>
>
> If there is no limit on the size, is there a way for a service provider to
> impose a limit?  If so, how?  If not, why not?
>
>
>
> Assuming it is not possible to impose a limit, how does one protect a
> service from becoming overwhelmed by one extremely large reconciliation
> request or a number of big ones?  It seems that this opens up the service
> to a DoS attack, but perhaps I am mistaken.  Even if that risk is perhaps
> marginal, it still seems that a provider could nevertheless experience a
> considerable performance penalty having to field requests with huge query
> batch sizes.
>
>
>
> I’m familiar in an academic sense with OpenRefine, but not whether it
> might control the size of query batches to ensure a provider is not
> overwhelmed.  That said, if this work is to become a more generic way to
> provide reconciliation or suggest services to be used by software other
> than OpenRefine, then it still seems this should be an
> advertiseable/controllable value since one cannot always count on the
> client being responsible.
>
>
>
> Yours,
>
> Kevin
>
>
>
> [1]
> https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#workflow-overview
>
> [2]
> https://reconciliation-api.github.io/specs/0.1/#sending-reconciliation-queries-to-a-service
>
>
>
> --
>
> Kevin Ford
>
> Library of Congress
>
> Washington, DC
>
>
>

Received on Thursday, 11 June 2020 17:45:10 UTC