Re: I-D Action: draft-pauly-httpbis-geoip-hint-01.txt from Dustin Mitchell on 2024-10-28 (ietf-http-wg@w3.org from October to December 2024)

From: Dustin Mitchell <djmitche@google.com>
Date: Mon, 28 Oct 2024 12:56:47 -0400
To: Ted Hardie <ted.ietf@gmail.com>
Cc: David Schinazi <dschinazi.ietf@gmail.com>, Ben Schwartz <bemasc@meta.com>, Stephen Farrell <stephen.farrell@cs.tcd.ie>, Watson Ladd <watsonbladd@gmail.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CALMtyTSQssSiMGTFLaz85H5Wzm=hR3_CoAFXm2gwgX8JQ9UBKQ@mail.gmail.com>
Hi Ted --

The context of this proposal is technologies to reduce the utility of IP
addresses as a user fingerprint. The more fluid the pool of IP addresses
users can use, the more effective that can be. Considering only
connectivity, we could do this with a relatively small, globally shared
pool of IP addresses. But servers' association of IP address to location
makes this infeasible. So, breaking that reliance on the IP-to-location
association is a key goal of the proposal.

This also provides an opportunity to incrementally improve the situation
for tracking of users' location by making it an active signal that is under
the control of the client (rather than the client's ISP). As a baseline,
the proposal aims to provide no more information to a server than it would
receive without a privacy proxy. And, it leaves room for clients to further
obscure location, such as by rounding municipalities to the nearest major
metro.

This discussion has identified some issues with that baseline, particularly
in cases where a client's request might arrive at the server with one of
several source addresses - which one should be used to generate the client
hint? How does the client know which one the server sees?

The suggested approach, if I'm understanding correctly, would allow the
client some control over the geolocation associated with its IP address,
but not break that association. This would be fundamentally incompatible
with multiple clients sharing the same IP address. Footnote (2) suggests
instead associating the location with the connection ID (or presumably in
non-QUIC scenarios the TCP 4-tuple), but make that association using
out-of-band signalling. A slight tweak to instead use in-band signalling
puts that location information in the connection itself, perhaps in the
form of a client hint.

Dustin


On Sat, Oct 26, 2024 at 10:40 AM Ted Hardie <ted.ietf@gmail.com> wrote:

> Hi David,
>
> A couple of replies in-line.
>
> On Sat, Oct 26, 2024 at 12:57 AM David Schinazi <dschinazi.ietf@gmail.com>
> wrote:
>
>> Hi Ted,
>>
>> While going back to the drawing board can be sad, I'm definitely open to
>> it. We have specific design requirements, but we're not wedded to any
>> particular solution. I'm not sure I understand your alternative proposal
>> though. In today's world, privacy proxies already publish their egress IPs
>> publicly along with the corresponding geos. (For example, Apple's is at
>> [1].) One issue is that everyone hasn't ingested that list, but that could
>> be solved over time.
>>
>
> First, the proposal I made is little more than a few hand waves, but I am
> willing to work something out in more detail if you and Tommy (or others)
> are interested.  I think anything in the space relies on a couple of
> assumptions about the willingness of different parties to change both their
> technology and their business practices, but I think a proposal without a
> client hint could work out to be simpler.
>
>
>> The other issue is that we'd like to reduce the granularity of this
>> published mapping. This has two advantages: first it saves the proxy
>> provider money now that IPv4 addresses are expensive, and second it
>> improves privacy - because now the egress IP has a more coarse geographic
>> mapping, and only the servers that request the client hint get access to
>> the more detailed location.
>>
>
> To be clear on this design goal, you wish to have a public view of the
> geo-location of the IP which is coarse and broadly available plus a
> different, detailed view of the geo-location which is made available only
> to the clients of the privacy proxy.  When the client cares to, it can
> provide this more detailed data.
>
> This seems at odds with what the document states as the method by which
> the client populates the data, which specifies only that it gets it from a
> geo-ip database.  It's pretty contrary to the basic privacy property I
> thought that you hoped it conferred, which was that sharing that
> information would be no worse than the publicly available data.  Now it
> appears it will be more detailed.  I've tried to account for that design
> goal in my sketch below, but I think the privacy properties depend a great
> deal on how much more detailed it is.
>
>
>> The browser can also now choose to refuse to send the client hint if it
>> determines that the server shouldn't have this information. Unless I'm
>> misunderstanding your proposal, it doesn't provide either of these two
>> advantages.
>>
>
> I think we need a whiteboard, but I'll lay out my thinking in a little
> more detail.
>
> A server advertises via .well-known that it supports the receipt of a
> source-originated geofeed in a specified format (which would be limited to
> avoid the lat/long issues and similar privacy issues).
>
> A proxy may test for that service and provide a geo-ip mapping for a
> single Egress IP, which is authenticated by a return routability check.(1)
> This would have a specified TTL.
>
> The server that has received that geo-location will place it in a local
> geo-ip database, which it consults when it wants to provide
> geography-specific resources.  (Whether it also consults other geo-ip data
> is not something that this proposal can control, but my guess is that it
> would sanity check the provided data against other data).
>
> A client indicates a desire to share more detailed geographic data with a
> particular service in its interaction with a particular proxy.  When that
> happens, the proxy updates the server's view of the geo-ip by associating
> that egress IP with the new data.  See (2) below for the concurrency issue,
> but the TTL for this would be very low and/or the proxy would reset it to
> coarse version after the end of the session.
>
> There are definite trade-offs to this approach, but there are some
> advantages.  First, the client cannot be misconfigured to provide truly
> detailed location data via this.  I think this could happen with your
> proposal because the client may get multiple views of its geolocation if it
> uses multiple proxy services and some of those may be much more detailed
> than the Apple or Google services (yes, I am once again alluding to
> enterprises here). Second, this approach means servers get fresh, if
> coarse, geo-ip data from proxies which is valid even for clients that have
> not been updated to use the  new hints.
>
> Again, my reason for sketching this out isn't to claim that this is the
> best approach.  It's to convince you that we need to have an architectural
> discussion before accepting this document.  I have learned a great deal
> about what you're trying to build, but I think there are other use cases
> with other risks that have to be considered before we standardize anything.
>
> regards,
>
> Ted Hardie
>
> (1) You could expand this to a range, but you would then need something
> like ACME to validate control of the range.
> (2) To handle concurrent connections via the same proxy to the same
> instance of the service with different locations or willingness to share
> location, you would need to disambiguate them via something like a
> connection ID.  That requires a lot more thought than this email contains,
> but there are some advantages (since you can trivially change the
> connection ID)
>
>
>
>> David
>>
>> [1] https://mask-api.icloud.com/egress-ip-ranges.csv
>>
>> On Fri, Oct 25, 2024 at 6:54 AM Ted Hardie <ted.ietf@gmail.com> wrote:
>>
>>> Thanks to Tommy for his previous comments; since this occurs later in
>>> the thread and addresses one of the points I made as well, I'm choosing to
>>> answer here, but I have read the full thread to this point.
>>>
>>> On Thu, Oct 24, 2024 at 9:54 PM David Schinazi <dschinazi.ietf@gmail.com>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm realizing I've been using some terminology without defining it,
>>>> leading to some confusion. Let's create a distinction between two distinct
>>>> kinds of IP-hiding technologies.
>>>>
>>>> 1) privacy proxies. Examples of these include Google's IP Protection
>>>> and Apple's iCloud Private Relay. These are affiliated with a browser, and
>>>> integrated pretty tightly with that browser (and/or operating system). The
>>>> goal of these is to prevent websites from having access to the user's IP
>>>> address, because that represents a stable tracking identifier. However,
>>>> these privacy proxies do not try to hide the user's coarse location. They
>>>> look at the client's IP address, map that to a city (for Google, we map it
>>>> to the closest grouping of 500'000 people for example), and then the
>>>> privacy proxy picks an egress IP address that's registered to that city in
>>>> a public geofeed. While websites have lost the ability to see the client's
>>>> IP address, they can still access the client's coarse location. Note that
>>>> this coarseness is often configurable by the user.
>>>>
>>>>
>>> Combined with Tommy's answer, what we see is a problem with data known
>>> to the geo-ip database about the egress IP selected by the privacy proxy.
>>> If it is stale or wrong, the client gets a worse experience.  You want to
>>> improve that experience by having the privacy proxy select the location
>>> (based on its knowledge of source IP) rather than the server select it
>>> based on its geo-ip lookup of the egress IP.   This would presumably also
>>> allow the privacy proxies to use fewer egress IPs.
>>>
>>> The difficulty I have here is that your technical solution is in no way
>>> limited to that deployment.  As Ben's pointed out, there are a bunch of
>>> related deployments in which a standard VPN provider might want the same
>>> thing, and I am sure that once this is standardized we will see it used in
>>> places where there is no proxy in use at all (enterprises, for example,
>>> using DHCP location on the device to populate this and then give
>>> location-appropriate responses at service portals etc.).
>>>
>>> If we step back to the key issue, a completely different approach would
>>> be for a service to indicate its willingness to get crowd-sourced geofeeds
>>> from privacy proxies or other intermediaries.  Those intermediaries could
>>> test for that service and provide an up-to-date and appropriate geolocation
>>> for their egress IPs.  That sorts the issue of the geolocation being stale
>>> in a database by allowing for the creation of a local database that is
>>> correct, but leaves the rest of the system as it is.  That approach has its
>>> own technical issues (you'd need to manage authentication, for example by a
>>> return routability check), but the simple fact that there are completely
>>> different approaches is why I want to push us back to the architectural
>>> discussion.
>>>
>>> I'm sure that's not terribly welcome feedback given that this document
>>> has already been percolating for 2 years, but I think that there is ample
>>> evidence that folks would be willing to engage in the discussion if you
>>> wanted to set up a design-team mailing list and hash it out.
>>>
>>> Thanks again for your willingness to engage and on the improvements and
>>> comments to date.
>>>
>>> regards,
>>>
>>> Ted Hardie
>>>
>>
Received on Monday, 28 October 2024 16:57:05 UTC