URI Colliding: was (RDF graph merging: How useful is it really? (was Re: Blank Nodes Re: Toward easier RDF: a proposal)) from Hugh Glaser on 2018-11-30 (semantic-web@w3.org from November 2018)

From: Hugh Glaser <hugh@glasers.org>
Date: Fri, 30 Nov 2018 17:22:12 +0000
To: David Booth <david@dbooth.org>
Cc: Semantic Web <semantic-web@w3.org>
Message-Id: <AFF85D6B-3D20-4157-AF96-7AD7133C661B@glasers.org>

Hi David,
This post is further to our little discussion in the Blank Nodes and Graph Merging sub-threads.
But I think we are now talking about causing URIs to collide.

tl;dr;:
Wouldn’t it be wonderful if I could just automatically generate WikiData URIs for lots of my data by applying a common function?

The rest:
I don’t know whether many other people (if any) do what you are suggesting (as I do), to construct URIs for entities from the composition of chosen literals.

But if there are, it prompts some more observations.
Essentially, I get URI collision of these things by using my own standard normalisation and then hashing, but only within the datasets I process.

However there is no reason why this should be restricted to my datasets alone.
If others were using exactly the same algorithm, then I would be able to get good collision when interacting with those datasets.
In fact, they wouldn’t need to be the same URIs, just the composition.
So, if I was using
https://data.glaser.com/id/1b2f49b7-cbd2794a-19009c30-0cb35d9b-75e5bc3e
I could have a high degree of confidence that this was the same postal address as
https://data.david.booth.org/place/1b2f49b7-cbd2794a-19009c30-0cb35d9b-75e5bc3e
if that was useful to me.

And of course, we could choose a common domain for URIs like that, if we wanted to, so the URIs themselves would collide.

Let’s take Arizona.
If I have a State or Area context, and see the string “Arizona” in my data, then I need to find or create a URI for it.
I can use Spotlight or some matching tool, or do it by hand.
Or I create a new URI for it, and then do the matching work.
But what if I just applied a common algorithm to it (with any required other data), and got back a URI I could use?
And I wouldn’t have to choose and be tying myself into using a particular dataset, such as DBpedia, WikiData, OpenCorporates, Companies House or whatever;
I would have my own URI if I wanted it. But other datasets would stand a chance of aligning with mine automatically, and mine with them.
And in fact those dataset maintainers, or other people, could publish the alignment between their identifiers and the “common composition” ones.
You wouldn’t get uniqueness, of course - the URI for “Arizona” would be different from the one for “AZ”.
But again, the alignments would be publishable; and because they would be universal, so worth people putting effort into and making available.
And then I would also get the benefit of having “Arizona” and “AZ” collide in my data, without any work on my part.
(Full disclosure: I happen to have a sameAs service way of gathering and publishing such stuff, but that isn’t the point, I think.
In fact in the first instance this is about making sameAs services redundant!)

Surprisingly, this might be really useful for people’s names (*not* people, to be clear).
Having a standard way of building a URI for a person’s name would be great.
For example the normalisation process might include losing honorifics and post-nominals; again, in a standard way
Then, quickly finding all the candidate Persons for alignment by choosing the ones that have the same name URI would make alignment much simpler.
And this would encompass the knowledge that “William Hill” is the same name as “Bill Hill”, even if they aren’t the same person, if we wanted.

Note that this is *not* about trying to create standard URIs - it is about trying to create a standard for creating them which will help a useful environment of similar URIs evolve.

It sort of feels like it is moving to a slightly higher level than we have at the moment.

Of course it may be that this has been proposed and discarded, or that someone is doing all this already, and I don’t know. :-)
Maybe the Q numbers in Wikidata are such things, but it doesn’t look like it to me.
Wouldn’t it be wonderful if I could just generate WikiData URIs for lots of my data by applying a common function?
Without even hitting the WikiData endpoint.

It feels like what we would be doing is similar to creating DB Keys, but on a global scale.

bNodes? - Well, if we did this, a lot of bNodes would disappear.
Also - people would be less inclined to use xsd:strings as proxies for the entities they represent, as they seem to now.
Which would bring all the goodness of being able to make statements about them, multi-lingual labels, etc..

I have no idea how many sorts of entities could usefully be done in this way - but even if it was just a few, such as addresses and names and organisations, then that may well be a big win for the community.
I would get a personal massive win if WikiData and OpenCorporates did it :-)

Received on Friday, 30 November 2018 17:22:59 UTC