Re: Lists of tagged strings in RDF from Hugh Glaser on 2021-06-11 (semantic-web@w3.org from June 2021)

From: Hugh Glaser <hugh@glasers.org>
Date: Fri, 11 Jun 2021 12:36:48 +0100
To: mail@frensjan.nl
Cc: SW-forum <semantic-web@w3.org>
Message-Id: <10BD1588-95AA-4827-8806-FBF28E2AA948@glasers.org>
Thanks for the extended explanation - very interesting.

A question that springs to mind, looking at your examples:
"Is it helpful to distinguish givenName & familyName?"

If I understand your environment correctly, you are processing documents, and performing entity resolution.
So quite often all you will get is a sequence of names, and it is unpredictable whether it will be Bartók Béla or Béla Bartók, or how many of the family names a Spanish entity reference uses.

However, you do want to capture the intel of the exact labels that have been found "in the wild".
(I am guessing this will include miss-spellings too and possibly non-Latin scripts, if my experience is anything to go by - my system has pages of different Tchaikovsky renderings, especially when you include the patronymic!)

What you could do:
Shift the thinking a little - the knowledge you are capturing is more of labels that are used for the entity, not knowledge of the entity itself.
So record every different entity reference you find - simply in a string,
Also, if the context provides it, you can attach language tags to it, which may be useful to you.
This is very reliable, from the users's point of view - they get exactly the name they wanted.
(Semantic Scholar turns Eduard K. de Jong Frz into E. K. D. J. Frz!).
A quick skos:prefLabel will be helpful, should you need to choose one for display purposes.
For searching purposes and entity resolution, having the original data can be very helpful, especially in ranking possible results.
If you get more knowledge from elsewhere, such as given names and family names, they can be added as they come in.
If you are able to infer those from an entity reference, then that knowledge too can be added.
Again, localisation and internationalisation can be recorded too.

You may feel that you end up with maintenance/synchronisation problems doing it this way.
But I think not.
The strings "from the wild" are just that - any more detailed names that are inferred or captured are additional knowledge, possibly based on a wide range of such strings.
Changing one does not require or even imply changing another.

Back to your original question :-)
By the time you have structured things like that, the ordering of given or family names becomes much less relevant.
You can record them just as what they are - name - you don't need the ordering, since you have plenty of examples elsewhere.
 foaf:givenName "José" ;
 foaf:givenName "Plácido" ;
 foaf:familyName "Domingo";
 foaf:familyName "Embil" .
Note that this way you avoid the possibly n-squared problem of all the different options, as you have listed in your latest posting.
If you really wanted, you could keep each category of names in a list, for ordering, but I would say it just complicates the searching use, and does not provide anything useful.
Of course you will find that "Plácido Domingo" is the most common string you get (in the wild), so you can promote that to skos:prefLabel, in curating your dataset (or a user will tell you their preferred label). You can record all sorts of metadata of frequency and sources as well, if you want.

Just some thoughts - I hope that helps.
Hugh

> On 10 Jun 2021, at 21:43, Rumph, Frens Jan <mail@frensjan.nl> wrote:
> 
> Hello Hugh,
> 
> Thank you for your thoughts!
> 
> When people move from an existing application in a programming language to using RDF, it can often seem that things don't move over easily and naturally; and indeed that can be the case.
> 
> I'm active in the area of gathering, processing and organising "intel". Think databases, searching for and matching of entities, etc. Most of the data modelling in the applications I'm working on map very well to RDF; they are already expressed in either large triple tables or in virtually partitioned predicate tables with mostly primitive data (text, numbers, dates, etc.). There's the matter of provenance (all our statements are annotated with source information), but I won't dwell on e.g. RDF-star here.
>  
> Other have commented many times on this list, that RDF is neither a programming language nor a data structure description, so perhaps that is not surprising.
> 
> The reason for my interest in RDF is that our data model is already pretty closely aligned, and I'd like to tap into a richer ecosystem. I don't want 'it' to be like the java I already have, but it should give an idea on where I'm coming from. Let's say that we have some 'literals' that have structure as well as (some) semantics.
> 
> But the main blockers are in the area of person names but also addresses. For the latter we use a format similar to names ('annotated strings' / tagged lists of strings) somewhat similar to how Google Maps models them: https://developers.google.com/maps/documentation/geocoding/overview#GeocodingResponses.
> 
> The applications at hand are tasked with a lot of searching and entity resolution / matching. Some of the sources used actually have fields like first given name, second given name, first family name and second family name (in this case in a spanish context). Another example is sources discerning between given (formal) names and a so-called roepnaam (a fairly Dutch concept). Obviously some sources don't have such high fidelity. My goal of the initial design was to a) not attempt to capture all nuances of person name cultures, but a fair amount and b) to not get stuck in an ever growing but always incomplete set of name formats.
> 
> In any case, we want to support the notion of people going by various names; hence going beyond associating given and family names directly with a person. So Herman Iván would ideally be described as 
> 
> [ a :Person ;
>   :name [ :familyName "Herman" ; :givenName "Iván" ] ;
>   :name [ :givenName "Ivan" ; :familyName "Herman" ]
> ]
> 
> Sacha Baron Cohen would I guess ideally be described as:
> 
> [ a :Person ;
>   :name [ :givenName "Sacha" ; :familyName "Cohen" ] ;
>   :name [ :givenName "Sacha" ; :familyName "Baron Cohen" ] ;
>   :name [ :givenName "Sacha" ; :givenName "Noam" ; :familyName "Baron Cohen" ] ;
>   :name [ :nickName "Ali G" ] ;
>   ...
> ]
> 
> José Plácido Domingo Embil could be described in our systems as:
> 
> [ a :Person ;
>   :name [ :givenName "Plácido" ; :familyName "Domingo" ] ;
>   :name [ :givenName "José" ; :givenName "Plácido" ; :familyName "Domingo" ] ;
>   :name [ :givenName "José" ; :givenName "Plácido" ; :familyName "Domingo"; :familyName "Embil" 
> ]
> 
> Xi Jinping could be described in our systems as:
> 
> [ a :Person ;
>   :name [ :familyName "Xi" ; :givenName "Jinping" ] ;
>   :name [ :familyName "習" ; :givenName "近平" ] ;
> ]
> 
> Note that we're not all that interested in capturing what someone's actual name is, we're mostly interested in what someone goes by; i.e. what could be considered identifying. (I am painfully aware that the namespace in the Netherlands is a lot less crowded than in e.g. China). And we're solely dependent on what data and how much structure is available. And most of the time, there is a difference in what format 'seed' data is available, possible search formats and the structure of actual records that can be matched.
> 
> A final point of interest in the intended context of use is that application users are able to change name input provided earlier. So ideally there is no 'expansion' of earlier input in e.g. a format of unordered but annotated name elements and a full name string.
> 
> Thanks again, for thinking along with me. There are probably concessions to be made. Feedback like yours stresses my thinking; much appreciated!
> 
> Best regards,
> Frens Jan
> 
> 
> On Thu, Jun 10, 2021 at 9:07 PM Hugh Glaser <hugh@glasers.org> wrote:
> Hi Frens Jan,
> 
> Sorry to perhaps be a bit difficult here, rather than answer the question as put.
> 
> I read your posting with some unease.
> In general:
> When people move from an existing application in a programming language to using RDF, it can often seem that things don't move over easily and naturally; and indeed that can be the case.
> Other have commented many times on this list, that RDF is neither a programming language nor a data structure description, so perhaps that is not surprising.
> 
> Without the specific set of ways in which you will be using the knowledge (rather than an abstract "well I want it to be like the Java I already have"), it is hard to suggest alternatives.
> > This allows reconstruction of the name into a string while at the same time expressing the components of the name. So it captures the roles of the elements of a name (e.g. given names, family names) *as well as* their order (given names aren't first everywhere). Also, it allows expressing multiple names. E.g. in multiple languages / scripts. Or even aliases used in different areas of the world.
> Since you talk about "given names", it seems to me that you could use
>         :givenNames "Frens Jan"
> 
> More specifically, you seem to want to tread an almost impossible line of small amount of the knowledge of a person's name, without having anything extra.
> If you really want to be able to embrace the multi-cultural stuff of even just UK, HUN & ESP, for example, you need to think what you will do with people like
> Bartók Béla and our own Ivan Herman, who might also been know as Herman Ivan;
> José Plácido Domingo Embil;
> Pablo Ruiz Picasso;
> Sacha Noam Baron Cohen;
> 
> I actually have a feeling you can get away with 
> :givenNames
> :familyNames
> for quite a while, if you are lucky, but as I said, it will depend on the context of your application.
> 
> Good luck
> Hugh
> 
> 
> 
> > On 10 Jun 2021, at 18:37, Rumph, Frens Jan <mail@frensjan.nl> wrote:
> > 
> > Dear Christophe,
> > 
> > Thank you for the pointer. I wasn't aware of this ontology! There are some elements missing from the vocabulary, but it comes a long way. But knowing that others went down this route is somewhat reassuring.
> > 
> > As for the use of blank nodes: agreed, this is not necessary. Given the inability to delete them (with SPARQL) I am steering away from them anyway.
> > 
> > Best regards,
> > Frens Jan
> > 
> > On Thu, Jun 10, 2021 at 1:08 PM Christophe Debruyne <christophe.debruyne@gmail.com> wrote:
> > MADS (https://www.loc.gov/standards/mads/rdf/) provides you a way to represent parts of a name using a collection. A madsrdf:PersonalName has a madsrdf:elementList that refers to a list (thus keeping order). In that list, you can have various typed resources with a madsrdf:elementValue containing the literals.
> > The nodes do not necessarily have to be blank. So this looks like your second approach but using a vocabulary published by the Library of Congres.
> > With my best regards,
> > Christophe
> > 
> > On Thu, Jun 10, 2021 at 12:39 PM Martynas Jusevičius <martynas@atomgraph.com> wrote:
> > Why is the list syntax ( ) not satisfactofy?
> > 
> > On Thu, 10 Jun 2021 at 12.07, Rumph, Frens Jan <mail@frensjan.nl> wrote:
> > Dear readers,
> > 
> > I am investigating transitioning an application to use RDF. One roadblock is how this application models names of persons. It supports straight-forward full names as a single string, but also supports decomposed names, e.g. person X has given name *Frens* followed by a second given name *Jan* followed by the family name *Rumph*.
> > 
> > Note that this is a crosspost of https://stackoverflow.com/questions/65982459/rdf-modelling-of-list-of-name-elements. I hope to get some more 
> > 
> > The data structure is something like:
> > 
> > ```java
> > enum Role {
> >    ...
> >    GIVEN_NAME,
> >    FAMILY_NAME,
> >    ...
> > }
> > 
> > record NameElement(role: Role, value: String) {}
> > 
> > record AnnotatedName(NameElement... elements) {}
> > ```
> > 
> > in order to be instantiated like:
> > 
> > ```java
> > var name = new AnnotatedName(
> >     new NameElement(GIVEN_NAME, "Frens"),
> >     new NameElement(GIVEN_NAME, "Jan"),
> >     new NameElement(FAMILY_NAME, "de Vries")
> > );
> > ```
> > 
> > This allows reconstruction of the name into a string while at the same time expressing the components of the name. So it captures the roles of the elements of a name (e.g. given names, family names) *as well as* their order (given names aren't first everywhere). Also, it allows expressing multiple names. E.g. in multiple languages / scripts. Or even aliases used in different areas of the world.
> > 
> > I have toyed around with some RDF constructs, but none are really satisfactory:
> > 
> > ```turtle
> > # list of strings misusing data types as tags
> > :frens :name ( "Frens"^^:givenName "Jan"^^:givenName "de Vries"^^:familyName ) .
> > 
> > # list of blank nodes
> > :frens :name ( [ :givenName "Frens" ]
> >                [ :givenName "Jan" ]
> >                [ :familyName "de Vries" ] ) .
> > 
> > # single blank node with unordered 'elements'
> > :frens :name [ a           :AnnotatedPersonName ;
> >                :fullName   "Frens Jan de Vries" ;
> >                :givenName  "Frens" ;
> >                :givenName  "Jan" ;
> >                :familyName "de Vries" ] .
> > ```
> > 
> > ---
> > 
> > **Existing ontologies for HD names?**
> > Is there an existing ontology that covers such 'high fidelity'? FOAF and vcard have some relevant properties, but aren't able to capture this level of semantics.
> > 
> > **Lists?** One major 'blocker' in migrating this approach to RDF is the notion of order that is used. If at all possible, I'd like to stay away from the List / Container swamp in RDF land ...
> > 
> > I'd be grateful for any thoughts on the matter!
> > 
> > Best regards,
> > Frens Jan
> 
> -- 
> Hugh
> 023 8061 5652
> 

-- 
Hugh
023 8061 5652
Received on Friday, 11 June 2021 11:37:43 UTC