Re: Q about reconciliation query batch size from Jeremy Jay on 2020-06-11 (public-reconciliation@w3.org from June 2020)

From: Jeremy Jay <jeremy@pbnjay.com>
Date: Thu, 11 Jun 2020 14:31:44 -0400
To: Thad Guidry <thadguidry@gmail.com>
Cc: Tom Morris <tfmorris@gmail.com>, "Ford, Kevin" <kevinford@loc.gov>, "public-reconciliation@w3.org" <public-reconciliation@w3.org>
Message-ID: <CAOT=ff-ZOZYEkL37V66T8qB5MFaTXVEMQL5P444N3NRkS01GPA@mail.gmail.com>
Forgive me if I missed this, but I don't believe there is a contract
requiring the service provider to respond to all queries?
e.g. if a request contains 100 queries the response may return only the
first 10.

The client would have to implement retry and completion to handle that,
e.g. I think OpenRefine assumes it's an empty match at the moment.

Jeremy


On Thu, Jun 11, 2020 at 2:01 PM Thad Guidry <thadguidry@gmail.com> wrote:

> Tom,
>
> Curious, Do you yourself have a particular preference for seeing rate
> limiting from a service?
> What methods do you see services use most often for that? HTTP Client
> Error codes 4xx?  206 Partial Content returned after a Range header field
> sent?
> (I see Amazon, Google, etc.  mostly use HTTP error responses specifically
> of 403 and 429)
>
> Thad
> https://www.linkedin.com/in/thadguidry/
>
>
> On Thu, Jun 11, 2020 at 12:45 PM Tom Morris <tfmorris@gmail.com> wrote:
>
>> All of the currently defined limits are for controlling the number of
>> responses sent by the server, rather than the requests sent by the client.
>>
>> While we could add "recommended batch size" and/or "maximum batch size"
>> to the manifest, I'm not sure it would add a lot of value. As a practical
>> matter, clients are going to choose a batch size which balances between
>> amortizing request overhead/latency and responsiveness for progress
>> reporting. They aren't motivated to use giant request sizes. In a DOS
>> situation, the attacker isn't going to be respecting any advertised limits.
>> Note that the server is always free to respond 413 Request Entity Too
>> Large and all modern service frameworks have a configurable limit for
>> this.
>>
>> The spec is also silent on whether you can send simultaneous requests in
>> parallel, rate limits, etc. I think this would be a more valuable area to
>> improve from the point of view of protecting services. The 429 Too Many
>> Requests code and Retry-After: header provide a starting point, but it
>> may be useful to make use of some extended headers in the X-RateLimit-*
>> space.
>>
>> There are two resources buckets associated with large requests: space &
>> time. Once you've accepted the request, the space is used up, but there are
>> no requirements or guarantees on how quickly the request will be processed.
>> If you want to meter work on a per-request basis and take longer to respond
>> to bigger batches, that's completely within the service implementers right
>> to do.
>>
>> OpenRefine currently uses a fixed batch size of 10 and processes batches
>> serially in a single threaded fashion, which is inherently rate limiting,
>> but it would be nice to improve the latency hiding and be able to have
>> multiple requests in flight, while still being polite to reconciliation
>> services.
>>
>> Tom
>>
>> On Thu, Jun 11, 2020 at 12:00 PM Ford, Kevin <kevinford@loc.gov> wrote:
>>
>>> Hello all:
>>>
>>>
>>>
>>> I presume this is the best place to ask this question, which I’ve
>>> harbored for years but which for a variety of reasons I’ve never had real
>>> occasion to ask until now.
>>>
>>>
>>>
>>> Do I understand correctly that there is no limit to, and no way to
>>> enforce a limit of, the reconciliation query batch size?
>>>
>>>
>>>
>>> This sentence from the documentation on the Github site – “OpenRefine
>>> queries the reconciliation service in batch mode
>>> <https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#multiple-query-mode>
>>> on the first ten items of the column to be reconciled.” [1] -  /might/
>>> suggest the size of batches is 10, but I believe we’re to understand that
>>> this particular call basically represents a test before the real, full
>>> reconciliation kicks off.  Yes?
>>>
>>>
>>>
>>> The “Note” under this section of the W3C specification work [2] seems to
>>> make it abundantly clear that there is no restriction on the length of
>>> query batches.
>>>
>>>
>>>
>>> I didn’t see a clear way to do this via the service manifest.
>>>
>>>
>>>
>>> If there is no limit on the size, is there a way for a service provider
>>> to impose a limit?  If so, how?  If not, why not?
>>>
>>>
>>>
>>> Assuming it is not possible to impose a limit, how does one protect a
>>> service from becoming overwhelmed by one extremely large reconciliation
>>> request or a number of big ones?  It seems that this opens up the service
>>> to a DoS attack, but perhaps I am mistaken.  Even if that risk is perhaps
>>> marginal, it still seems that a provider could nevertheless experience a
>>> considerable performance penalty having to field requests with huge query
>>> batch sizes.
>>>
>>>
>>>
>>> I’m familiar in an academic sense with OpenRefine, but not whether it
>>> might control the size of query batches to ensure a provider is not
>>> overwhelmed.  That said, if this work is to become a more generic way to
>>> provide reconciliation or suggest services to be used by software other
>>> than OpenRefine, then it still seems this should be an
>>> advertiseable/controllable value since one cannot always count on the
>>> client being responsible.
>>>
>>>
>>>
>>> Yours,
>>>
>>> Kevin
>>>
>>>
>>>
>>> [1]
>>> https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#workflow-overview
>>>
>>> [2]
>>> https://reconciliation-api.github.io/specs/0.1/#sending-reconciliation-queries-to-a-service
>>>
>>>
>>>
>>> --
>>>
>>> Kevin Ford
>>>
>>> Library of Congress
>>>
>>> Washington, DC
>>>
>>>
>>>
>>
Received on Thursday, 11 June 2020 19:25:09 UTC