Re: Blank Nodes Re: Toward easier RDF: a proposal

> On 26 Nov 2018, at 00:31, David Booth <david@dbooth.org> wrote:
> 
> On 11/23/18 7:59 AM, Hugh Glaser wrote:
>> . . .
>> I find it is actually easier to generate URIs than try to manage Blank Nodes.
> 
> I am curious what conventions you use, in generating URIs.
Oh dear, now people will see some of the filth of the way I do things ;-)
I've never detailed it, but your curiosity should be assuaged.
I'm guessing others have detailed it somewhere, but this is my attempt, which may turn out to be quite long, and answer more than you were expecting:

Firstly, separate to the choice of algorithm, is the decision of how opaque or not you want the URIs to be.
This depends on things like:
do you want to hide the source of the data (there may be stuff in the data from which you can infer it came from, say, Wikidata);
do you want to hide the type - URIs often have indication of type somewhere in them;
do you think your users will try to guess URIs, and so get stuff wrong, and you need to stop that;
are you the sort of person who likes to obscure things.
So I decide on what the input string is (see below) and then decide what the encoding is.
If I want opaque, I use sha1 to do that - either the whole thing or the end part:
https:example.org/id/1b2f49b7-cbd2794a-19009c30-0cb35d9b-75e5bc3e
or
https:example.org/id/address/1b2f49b7-cbd2794a-19009c30-0cb35d9b-75e5bc3e
If not, I just use the (normalised) input string.

Secondly, how to compute the input string (your curiosity) from the Resource (I'll call it) fodder:
Clearly, the task is to pick some part(s) of the fodder that will identify the Resource with the correct uniqueness.
By and large, I am only working with literals as the fodder, but it can be URIs sometimes.
1) If we are lucky, there is a field that does that - it could be an ID field (which may need a prefix on it to avoid clashing with other stuff), or it could even be a URI (which I will treat as a literal for these purposes).
2) Failing that, of course, I look to see if there is a combination of fields that will nicely identify the Resource.
3) It may be that the direct data is not sufficient, so I will need to fetch associated data from referenced resources. This happens in transforming to new RDF from source RDF, which I do a lot. Often the source RDF does not have satisfactory URIs for me (for a number of possible reasons), but I can FYN and get more literals that will help.
3) Failing that, the fallback is to use everything I can find and concatenation and probably hash.
4) And finally, if all else fails, a random hash can be generated, but suffers from lack of reproducibility (below).

In choosing this, it can actually be really helpful - I get to understand what is really the essence of the Resources, and that may help me downstream.

The challenge is to pick the fields that will nicely identify the Resource uniquely, without using too much extra.
This is because I actually want these URIs to potentially be the same as URIs about the same thing from other input Resources - that way I get linkage without the overhead of co-reference analysis.
And I do want stability - I want to be able to re-import the same data and get the same URIs next time - even if some of the data has changed (acquisition date or age or whatever, are obvious ones).
In fact, we are in a precision/recall world here, but that would be another topic.

I then normalise the (combined) string to at least the legal URI characters, but usually more:
I also remove all diacritical marks (people are very bad at being consistent), and map to lower case.
And then I do any sha1 stuff that I want.

I think we need examples.
Here is an example of an example:
:foo :hasAddress ["123, Acacia Avenue, PO56 2DF, Portsmouth, UK"]
That is, it is example input of a bunch of data going into the process.
So I want RDF for the address.
First I need a URI for the address.
If you understand UK postcodes (!), you will know that the pair (123, PO56 2DF) does the business.
So I might use https:example.org/id/address/123---po56-2df or https:example.org/id/52e4729b-9cfc4622-3c214b8a-ae2882fc-ae0c6299
Or https:example.org/id/address/123---po562df (https:example.org/id/0d26760e-5ca009ff-5d074f6f-8220bef1-80a828ee) if I am being a bit clever, and canonicalising the postcode better.
So now I have the first URI - but we aren't finished.
I would never, never, ever, ever say
https:example.org/id/0d26760e-5ca009ff-5d074f6f-8220bef1-80a828ee :hasStreet "Acacia Avenue"
To me, this is the most fundamental error you can make in RDFisation - I could go on, but I won't. :-)
So I need a URI for the street,
Unfortunately, this benefits from yet further domain knowledge (well, you *do* need to use domain knowledge to RDFise things in practice).
The input fodder can't be "Acacia Avenue" - there are loads of those (apparently).
"Acacia Avenue" and "Portsmouth" or ""123" won't do either.
It may be that "Acacia Avenue" and "PO56 2DF" is OK.
So I now get something like
https:example.org/id/0d26760e-5ca009ff-5d074f6f-8220bef1-80a828ee :hasStreet https:example.org/id/095345ff-6d79c9f6-ae867910-b29983e6-740a0aa1
Actually, this is still dangerous - there may be two different "Acacia Avenue"s in the same postcode.
(We are actually asking the question, are street names unique within a postcode in the UK - I happen to know the answer is "no", but would not have risked it even if I didn't - the potential gains fo the co-referencing are outweighed by the risk of a mistake, for most applications.)
It's unlikely, but if I want precision over recall on the implied co-referencing going on, I can't risk it.
In this case, adding anything else, such as "123", is unlikely to help.
I would have to bring in more related stuff, such as the URI I have generated (or the labels) for :foo.
In this way the street URI becomes "the street which is part of foo's address", which is exactly what I wanted - I was only being loose in trying to get away with anything less :-)
I now need to do the "Portsmouth".
Same process.
Would "Portsmouth" be unique? I doubt it. "Portsmouth" + "UK"?
Well, actually the question is, are towns unique in the universe:- clearly not. Nor in the UK. And we can guess that a street in the town wouldn't help.
But "The town that has postcode PO56 2DF in it" looks good to me.
Actually, if I am being really clever, "The town that has postcodes that start with PO in it" would be great -  it will mean that the co-referencing of towns between different addresses will happen rather than just if they are in the same postcode, but I need to do the work to pick apart the postcode.

That's probably more than enough ;-)
One word of warning.
As you do this, you need to look at the output to sanity check your decisions.
It is easily done - resolving a few URIs will quickly show you if things have collided that you didn't want.
And examples of a mistake:
You might have thought that combining composer and title was enough for music?
Look at: Johann Sebastian Bach, Prelude and Fugue in G minor
Or authors, paper title and year? Nope.
If that's all you've got, then you have to go the random route.

Note that I never go off and find out more stuff during the URI creation - that isn't usually an option - I get what is available, and have to simply work with that - that's the web, for the most part.
I can't go and find the BWV catalogue number, or whatever - if I had that I would probably be using it! (Possibly with the title - catalogue numbers can change, which is another mistake that may need to be avoided.)

I must stress that I always feel it is worth doing all this - I get proper URIs for all the topics of discourse, which is why I am using Linked Data in the first place.
Without these, I can't do anything with the address - it is a pretty much useless structure hanging off a bnode where I can't make any statements about the components.
I can't reliably attach a map; find out who lives in the same street; show you the Wikipedia page for the country; show you different language labels for the town, etc. etc..
Even if I manage to make the inferences later, I can't assert the statements that capture that inferential knowledge.

Maybe my applications are very different from other peoples'.
In general, I find myself importing data from sources that have information about "thing"s.
So I hardly ever (never?) feel the need for Bags, Collections, Lists, Trees or that sort of stuff.
If there are multiple values about a thing, then they are simply multiple triples with the same predicate.
If they need to be ordered in some way, it is likely that there is some reason for having the order which is surfaced in the data, in which case I represent the ordering by reflecting the reason. For example, Track order on a CD is a bunch of :hasTrack predicates where each track has its number. Children in a family have ages (and if you want the children to be in an order, then what was the knowledge you used to ascertain that order? - represent that instead, you probably will want it or even find it more useful.)

I am guessing this isn't what you wanted :-)
You were expecting some clever automagic thing.
In that case, you have to take option (3) or (4).
But I think that converting the source information into RDF is a task that gains value from some care, so I am willing to put that work in.
And of course, I have tools that help me do all this, by the way.

> 
> Thanks,

My absolute pleasure, David thank you for asking - I hope you felt it was worth reading through to the end.

I need to pack now for the winter sun trip at sparrow-fart tomorrow - wahay!

> David Booth
> 

-- 
Hugh
023 8061 5652

Received on Monday, 26 November 2018 13:26:04 UTC