Re: Addresses have no easy identity was Re: Blank Nodes Re: Toward easier RDF: a proposal from Joshua Shinavier on 2018-12-05 (semantic-web@w3.org from December 2018)

From: Joshua Shinavier <joshsh@uber.com>
Date: Tue, 4 Dec 2018 17:08:21 -0800
To: dave.e.reynolds@gmail.com
Cc: hugh@glasers.org, semantic-web@w3.org
Message-ID: <CAPc0Ous=Q2snXSV4ktktn+pm9Ms0z7KwYAmOxW9B4iKpuhBvfQ@mail.gmail.com>
Just to add another data point to the "addresses are hard" thread, at Uber
we have also invested quite some time into standardizing vocabulary around
addresses. Prior to standardization, there were many dozens of address
types in use within the company (and still are), most of which are of the
basic street/city/state/country/zip kind, similar to schema.org's
PostalAddress. After a great deal of discussion, we opted not to support
such a format as a standard. Most of the reasons for this boil down to
items on the page Thomas linked. Instead, we distinguish between structured
addresses (a bag of components which validate against any of a number of
black-box address schemas) and addresses for display. Google makes a
similar distinction in its Places API. Address validation, formatting,
normalization, etc. are API concerns that go well beyond the vocabulary
itself, requiring significant background knowledge. I would not be
optimistic about finding canonical identifiers for addresses, though
geocoded lat/lon is probably the next best thing.

Josh


On Tue, Dec 4, 2018 at 4:16 PM Dave Reynolds <dave.e.reynolds@gmail.com>
wrote:

> Hi Hugh,
>
> On 04/12/2018 22:48, Hugh Glaser wrote:
> > Thanks Dave.
> > Yes, I agree with all the detail.
> >
> > My interpretation is that you are confirming what I was saying - that
> the general case is a nightmare.
>
> On that we are agreed :)
>
> > This is a problem of trying for a standard for the addresses - not only
> is it fiendishly complicated, but no standard will ever satisfy all the
> reasons you might want to identify something, such as an address.
> > I agree, which is why I was negative about trying to capture it
> centrally.
> > On the other hand, SW people *are* representing addresses all the time,
> using sufficient specificity for their purposes.
> > And others will be doing the same thing to the same level.
>
> Sure, *representing* addresses is just fine. It's *identifying*
> addresses that's hard.
>
> > And businesses in the UK find that the number/postcode pair is pretty
> much all they need to deliver almost all online purchases.
>
> If you are only dealing with consumers, not other businesses, and mostly
> focus on houses in urban areas, and don't care about secondary addresses
> (saons - like flat number, unit number, floor etc), and if you only care
> about delivery (so there's a human at the other end interpreting the
> address)  and if we can agree to differ on the semantics of "almost all"
> then that's possibly true.
>
> However, many businesses, even under those constraints, solve it by
> getting a human (the one placing an order) to do the matching. You use
> number/postcode to constrain and order the search on your (very
> expensive) master address list and get the user to pick the right one
> from the result list. *Then* you have an identifier.
>
> > It seems to me that you are concerned with the "global" solution -
>
> No, simply pointing out that matching real world entities is hard for
> domain specific reasons and no amount of RDF/OWL makes much difference
> to that.
>
> Actually, all I was really doing was sharing painfully gathered
> experience that in the UK, postcode + number is far from a nearly unique
> key for all addresses. Trust me on this. I've sacrificed a large part of
> the last three months to learning this lesson in great detail :(
>
> > I want to worry about a more local problem, and what small steps can be
> taken to help people in common cases, so that SW & LD are more useful for
> developers.
>
> I've lost track of how this thread about thing equality relates to the
> goal of making SW/LD/RDF easier. Which is why I opened with "I don't
> want to get embroiled in the main thread(s)" and just commented on the
> nature of addresses.
>
> [While URIs can be off putting I don't think they are *that* much of a
> problem for developers. Even where they are a barrier it's the choice of
> namespace that's the challenge ("you mean we have to host a DNS domain
> and maintain it?"). In my experience most developers are very happy with
> the notion that some domains have "natural" composite keys that you can
> use to identify things and some domains you have to do work to create
> some (often human) process to manage your reference identifiers and then
> use those as keys. Once you have your keys, one way or another, then
> creating identifiers by combining some sort of namespace with an
> encoding/hash of the composite keys is bread and butter stuff, even
> outside of SW/LD.]
>
> Dave
>
>
> > Or are you saying that because specifying addresses as well as you would
> like is so hard, we shouldn't bother trying to do something simpler and
> useful for many purposes?
>
> > It is about URIs, and they aren't in the noise - they are the things
> that people currently generate for themselves, and get little or no help
> with that generation, or linking up.
> >
> >> On 4 Dec 2018, at 11:24, Dave Reynolds <dave.e.reynolds@gmail.com>
> wrote:
> >>
> >> I don't want to get embroiled in the main thread(s) but, just in case
> anyone is *really* dealing with UK addresses rather than using them as
> rhetorical examples, then ...
> >>
> >> On 03/12/2018 23:37, Anthony Moretti wrote:
> >>> I see your point Hugh, especially in your case because for UK
> addresses consisting of only house number and postcode structural equality
> is sufficient for address equality. Decentralized will work very well in
> that case.
> >>
> >> Sadly that's a long way from being true. UK addresses within a postcode
> my be identified by house name, house name + number, business name (with no
> house name or number at all), any of those plus a secondary address etc
> etc. Even when there's a house "number" sometimes its actually a number
> range not a single number and there's considerable ambiguity on how those
> ranges are expressed and what the "definitive" range for a given property
> really is.
> >>
> >> Identity of UK addresses is simply not something you can express in OWL
> or any logic close to it. You need an address reconciliation algorithm to
> map your address to an maintained identifier set such as a UPRN or UDPRN.
> The reconciliation process will have error rates that you will need to
> manage and recover from, there's no closed, guaranteed algorithm.
> >>
> >> Once you have the UPRN or UDPRN or whatever you can create URI's or
> some inverse functional property as you wish. Except that even then the
> official identifier schemes like that aren't perfect and have ... oddities
> ... in them that can still mess you up.
> >>
> >> Generating unique keys for resources based on hashing a few properties
> is all very well in simple cases but, at least in my experience, real world
> problems are nothing like that simple clean. You need serious effort to
> create and maintain identifier schemes and to reconcile source data against
> those schemes. Details like URIs or bNodes seem to me rather down in the
> noise.
> >>
> >> Dave
> >>
> >>> On Mon, Dec 3, 2018 at 3:07 PM Nathan Rixham <nathan@webr3.org
> <mailto:nathan@webr3.org>> wrote:
> >>>     Hugh, do you mean something like bnode.id <http://bnode.id> =
> >>>     sha256(serialise(bnode))
> >>>     On Mon, 3 Dec 2018, 22:58 Hugh Glaser <hugh@glasers.org
> >>>     <mailto:hugh@glasers.org> wrote:
> >>>         This is not directly about blank nodes, but is a reply to a
> >>>         message in the thread.
> >>>         I’m certainly agreeing that we should work towards common
> >>>         understanding of Thing equality.
> >>>         And addresses are a great place to start.
> >>>         In order for equality to be defined, I think that means you
> >>>         first need an idea of what an unambiguous address looks like.
> >>>         Having an oracle that defines what an unambiguous Thing looks
> >>>         like is one organisational structure, and it would be great if
> >>>         schema.org <http://schema.org> could lead the way.
> >>>         It particularly helps people who just want an off the shelf
> >>>         solution, especially if they have no knowledge of the Thing
> domain.
> >>>         However I (and perhaps David Booth) am after something more
> >>>         anarchic, that can function in a decentralised way (if I dare
> to
> >>>         use that term! :-) )
> >>>         For example, I might decide that I think that House Number and
> >>>         PostCode is enough.
> >>>         (UK people will know that this is a commonly-used way of
> >>>         choosing an address, although it may well not be satisfactory
> >>>         for some purposes, I’m sure.)
> >>>         That may well be sufficient for me to interwork with datasets
> >>>         from Companies House, the Land Registry and a bunch of other
> >>>         UK-based organisations, plus many other datasets.
> >>>         Having a simple standard way to create keys for such things
> >>>         facilitates that, without any standardisation process and all
> >>>         that entails in weaknesses and strengths of trying to get
> >>>         agreement on what an unambiguous address might look like on a
> >>>         world scale for all purposes.
> >>>         Just generating a URI, without needing to make any service
> calls
> >>>         (having found where they are and chosen the one you want and
> >>>         compromised on it, etc.) or anything seems to me a way of
> making
> >>>         all the interlinking so much more accessible for us all.
> >>>         It is even future proof:- using such a URI means that if it is
> >>>         about something new (UK postcodes change all the time :-(, and
> >>>         there are more dead ones than live ones), the oracle doesn’t
> >>>         tell me anything it didn’t have until I ask again.
> >>>         In a key-generating world, my new shiny key will slowly align
> >>>         with all the other key URIs as they get created.
> >>>         So yeah, all strength to anyone who wants to take on the
> central
> >>>         roles, but not at the expense of killing the anarchic solution,
> >>>         please.
> >>>         Cheers
> >>>          > On 3 Dec 2018, at 22:10, Anthony Moretti
> >>>         <anthony.moretti@gmail.com <mailto:anthony.moretti@gmail.com>>
> >>>         wrote:
> >>>          >
> >>>          > Cheers for agreeing William. On the topic of incomplete
> blank
> >>>         nodes Henry I'd give them another type, the partial address
> >>>         example you give I'd give the type AddressComponent, or
> >>>         something to that effect. I could be wrong, but it's not a
> valid
> >>>         Address if it's a blank node and no other information in the
> >>>         graph completes it.
> >>>          >
> >>>          > Anthony
> >>>          >
> >>>          > On Mon, Dec 3, 2018 at 1:56 PM William Waites
> >>>         <wwaites@tardis.ed.ac.uk <mailto:wwaites@tardis.ed.ac.uk>>
> wrote:
> >>>          > > standards like schema:PostalAddress should possibly define
> >>>         relevant
> >>>          > > operations like equality checking too.
> >>>          >
> >>>          > Exactly.
> >>>          >
> >>>          >
> >>
> >
>
>
Received on Wednesday, 5 December 2018 01:09:25 UTC