Re: URI Collapsing: was (RDF graph merging: How useful is it really? (was Re: Blank Nodes Re: Toward easier RDF: a proposal)) from David Booth on 2018-11-30 (semantic-web@w3.org from November 2018)

From: David Booth <david@dbooth.org>
Date: Fri, 30 Nov 2018 14:38:17 -0500
To: Hugh Glaser <hugh@glasers.org>
Cc: Semantic Web <semantic-web@w3.org>
Message-ID: <f9e9b9cb-1ebb-a4b8-f61a-14c9aeb9ce54@dbooth.org>
Hi Hugh,

I suggest we call this "URI *Collapsing*" instead of "Colliding", 
because the WebArch already defines "URI Collision" as "Using the same 
URI to directly identify different resources", which is not what we want:
https://www.w3.org/TR/webarch/#URI-collision

More below . . . .

On 11/30/18 12:22 PM, Hugh Glaser wrote:
> Hi David,
> This post is further to our little discussion in the Blank Nodes and Graph Merging sub-threads.
> But I think we are now talking about causing URIs to [collapse].
> 
> tl;dr;:
> Wouldn’t it be wonderful if I could just automatically generate WikiData URIs for lots of my data by applying a common function?

Yes!  As long as they are intended to identify the same thing, of course.

> 
> The rest:
> I don’t know whether many other people (if any) do what you are suggesting (as I do), to construct URIs for entities from the composition of chosen literals.

I think it is very common to base a URI on a natural key (which might be 
a composite key).

> 
> But if there are, it prompts some more observations.
> Essentially, I get URI [collapsing] of these things by using my own standard normalisation and then hashing, but only within the datasets I process.
> 
> However there is no reason why this should be restricted to my datasets alone.
> If others were using exactly the same algorithm, then I would be able to get good [collapsing] when interacting with those datasets.
> In fact, they wouldn’t need to be the same URIs, just the composition.
> So, if I was using
> https://data.glaser.com/id/1b2f49b7-cbd2794a-19009c30-0cb35d9b-75e5bc3e
> I could have a high degree of confidence that this was the same postal address as
> https://data.david.booth.org/place/1b2f49b7-cbd2794a-19009c30-0cb35d9b-75e5bc3e
> if that was useful to me.
> 
> And of course, we could choose a common domain for URIs like that, if we wanted to, so the URIs themselves would collide.

Yes, *that* would be the most beneficial, I think.  That is what I am 
suggesting for auto-generated URIs based on keys of n-ary relations.

> 
> Let’s take Arizona.
> If I have a State or Area context, and see the string “Arizona” in my data, then I need to find or create a URI for it.
> I can use Spotlight or some matching tool, or do it by hand.
> Or I create a new URI for it, and then do the matching work.
> But what if I just applied a common algorithm to it (with any required other data), and got back a URI I could use?
> And I wouldn’t have to choose and be tying myself into using a particular dataset, such as DBpedia, WikiData, OpenCorporates, Companies House or whatever;
> I would have my own URI if I wanted it. 

Yes, if there is a risk that you might be trying to identify a different 
thing than what other people are trying to identify (using those same 
keys), then then you should use your own URI for it -- not a standard URI.

> But other datasets would stand a chance of aligning with mine automatically, and mine with them.
> And in fact those dataset maintainers, or other people, could publish the alignment between their identifiers and the “common composition” ones.

True.  Or they could just switch to using the "common composition" URIs.

> You wouldn’t get uniqueness, of course - the URI for “Arizona” would be different from the one for “AZ”.

Right, that's a different issue that requires different knowledge to 
collapse.

> But again, the alignments would be publishable; and because they would be universal, so worth people putting effort into and making available.
> And then I would also get the benefit of having “Arizona” and “AZ” [collapse] in my data, without any work on my part.
> (Full disclosure: I happen to have a sameAs service way of gathering and publishing such stuff, but that isn’t the point, I think.
> In fact in the first instance this is about making sameAs services redundant!)
> 
> Surprisingly, this might be really useful for people’s names (*not* people, to be clear).
> Having a standard way of building a URI for a person’s name would be great.
> For example the normalisation process might include losing honorifics and post-nominals; again, in a standard way
> Then, quickly finding all the candidate Persons for alignment by choosing the ones that have the same name URI would make alignment much simpler.
> And this would encompass the knowledge that “William Hill” is the same name as “Bill Hill”, even if they aren’t the same person, if we wanted.

Interesting.  I can definitely see how that could be useful.  However, I 
would caution against automatically saying that "William Hill" is the 
same name as "Bill Hill", because there are communities where a child is 
traditionally named after a parent's nickname.   So if William's 
nickname was Bill, he might have a child whose formal name is actually 
Bill.  But this is a slight digression.

> 
> Note that this is *not* about trying to create standard URIs - it is about trying to create a standard for creating them which will help a useful environment of similar URIs evolve.
> 
> It sort of feels like it is moving to a slightly higher level than we have at the moment.

Agreed.  And for ease-of-use, I think we need to bring RDF up to a 
higher level like this.

> 
> Of course it may be that this has been proposed and discarded, or that someone is doing all this already, and I don’t know. :-)
> Maybe the Q numbers in Wikidata are such things, but it doesn’t look like it to me.
> Wouldn’t it be wonderful if I could just generate WikiData URIs for lots of my data by applying a common function?
> Without even hitting the WikiData endpoint.
> 
> It feels like what we would be doing is similar to creating DB Keys, but on a global scale.

Yes, exactly.

> 
> bNodes? - Well, if we did this, a lot of bNodes would disappear.
> Also - people would be less inclined to use xsd:strings as proxies for the entities they represent, as they seem to now.
> Which would bring all the goodness of being able to make statements about them, multi-lingual labels, etc..
> 
> I have no idea how many sorts of entities could usefully be done in this way - but even if it was just a few, such as addresses and names and organisations, then that may well be a big win for the community.
> I would get a personal massive win if WikiData and OpenCorporates did it :-)

Great use case!

David Booth
Received on Friday, 30 November 2018 19:38:40 UTC