Re: Lists of tagged strings in RDF from Rumph, Frens Jan on 2021-06-11 (semantic-web@w3.org from June 2021)

From: Rumph, Frens Jan <mail@frensjan.nl>
Date: Fri, 11 Jun 2021 21:44:57 +0200
To: Hugh Glaser <hugh@glasers.org>
Cc: SW-forum <semantic-web@w3.org>
Message-ID: <CAH3f1B8yZkZxv=ueV4gRkORcxzpDOdgs58Oz-2M-LvJcmnMSHg@mail.gmail.com>
Dear Hugh,

Thanks for the feedback; much appreciated.

A question that springs to mind, looking at your examples: "Is it helpful
> to distinguish givenName & familyName?"


Definitely! We do have situations where we just have full name strings and
it's up to us to guess what's what. Name statistics, context, etc. come
into play then. But we also have cases in which we actually have 'high
fidelity' names in multiple sources that we can match.

If I understand your environment correctly, you are processing documents,
> and performing entity resolution.


It may be 'documents' or sometimes structured records.

So quite often all you will get is a sequence of names, and it is
> unpredictable whether it will be Bartók Béla or Béla Bartók, or how many of
> the family names a Spanish entity reference uses.


Usually we deal with either full name strings *or* records from systems in
which a particular naming culture is assumed. So the order is typically

However, you do want to capture the intel of the exact labels that have
> been found "in the wild".


Capturing these names is usually either a full name string (if found in
text) or mapping fields from records to the 'annotated name string' model
that we are using right now.

What you could do: Shift the thinking a little - the knowledge you are
> capturing is more of labels that are used for the entity, not knowledge of
> the entity itself. So record every different entity reference you find -
> simply in a string, Also, if the context provides it, you can attach
> language tags to it, which may be useful to you.


In some contexts, we only have a named person entity, in others we have a
lot more information on an entity: location information, age, some
identifiers, etc. After (positive) matching we link these up with 'edges'
(akin to the skos match properties if the entity types linked are the same
or sometimes something more domain specific if applicable).

This is very reliable, ... You may feel that you end up with
> maintenance/synchronisation problems doing it this way. But I think not.
> ... Changing one does not require or even imply changing another.


From a UX perspective as a user I would really appreciate that if I provide
information and I can change it, that I'm not forced to update an
'exploded' version of the original. E.g. if I enter "Frens (= given name)
Rumph (= family name)" and want to change that to add "Jan" as second given
name I wouldn't like to be forced to add "Jan" to some unordered bag and
change the full name string "Frens Rumph" to "Frens Jan Rumph".

Back to your original question :-) By the time ...


I get that the model developed is perhaps a bit wacky / an acquired taste.
However, it has stood the test of time so far :) Also note that the
applications targeted are not green field ideas! They are operational
systems, with non-trivial datasets and usage, applications built on top of
their structure and most importantly APIs built on top! A move to RDF is no
simple feat; let alone compounding a potential migration of API's etc.

I appreciate your input and line of thinking. I'll chew on that. But I'm
also very much interested in maintaining the list-like concept as they have
done in https://www.loc.gov/standards/mads/rdf/.

Best regards,
Frens Jan


On Fri, Jun 11, 2021 at 1:36 PM Hugh Glaser <hugh@glasers.org> wrote:

> Thanks for the extended explanation - very interesting.
>
> A question that springs to mind, looking at your examples:
> "Is it helpful to distinguish givenName & familyName?"
>
> If I understand your environment correctly, you are processing documents,
> and performing entity resolution.
> So quite often all you will get is a sequence of names, and it is
> unpredictable whether it will be Bartók Béla or Béla Bartók, or how many of
> the family names a Spanish entity reference uses.
>
> However, you do want to capture the intel of the exact labels that have
> been found "in the wild".
> (I am guessing this will include miss-spellings too and possibly non-Latin
> scripts, if my experience is anything to go by - my system has pages of
> different Tchaikovsky renderings, especially when you include the
> patronymic!)
>
> What you could do:
> Shift the thinking a little - the knowledge you are capturing is more of
> labels that are used for the entity, not knowledge of the entity itself.
> So record every different entity reference you find - simply in a string,
> Also, if the context provides it, you can attach language tags to it,
> which may be useful to you.
> This is very reliable, from the users's point of view - they get exactly
> the name they wanted.
> (Semantic Scholar turns Eduard K. de Jong Frz into E. K. D. J. Frz!).
> A quick skos:prefLabel will be helpful, should you need to choose one for
> display purposes.
> For searching purposes and entity resolution, having the original data can
> be very helpful, especially in ranking possible results.
> If you get more knowledge from elsewhere, such as given names and family
> names, they can be added as they come in.
> If you are able to infer those from an entity reference, then that
> knowledge too can be added.
> Again, localisation and internationalisation can be recorded too.
>
> You may feel that you end up with maintenance/synchronisation problems
> doing it this way.
> But I think not.
> The strings "from the wild" are just that - any more detailed names that
> are inferred or captured are additional knowledge, possibly based on a wide
> range of such strings.
> Changing one does not require or even imply changing another.
>
> Back to your original question :-)
> By the time you have structured things like that, the ordering of given or
> family names becomes much less relevant.
> You can record them just as what they are - name - you don't need the
> ordering, since you have plenty of examples elsewhere.
>         foaf:givenName "José" ;
>         foaf:givenName "Plácido" ;
>         foaf:familyName "Domingo";
>         foaf:familyName "Embil" .
> Note that this way you avoid the possibly n-squared problem of all the
> different options, as you have listed in your latest posting.
> If you really wanted, you could keep each category of names in a list, for
> ordering, but I would say it just complicates the searching use, and does
> not provide anything useful.
> Of course you will find that "Plácido Domingo" is the most common string
> you get (in the wild), so you can promote that to skos:prefLabel, in
> curating your dataset (or a user will tell you their preferred label). You
> can record all sorts of metadata of frequency and sources as well, if you
> want.
>
> Just some thoughts - I hope that helps.
> Hugh
>
> > On 10 Jun 2021, at 21:43, Rumph, Frens Jan <mail@frensjan.nl> wrote:
> >
> > Hello Hugh,
> >
> > Thank you for your thoughts!
> >
> > When people move from an existing application in a programming language
> to using RDF, it can often seem that things don't move over easily and
> naturally; and indeed that can be the case.
> >
> > I'm active in the area of gathering, processing and organising "intel".
> Think databases, searching for and matching of entities, etc. Most of the
> data modelling in the applications I'm working on map very well to RDF;
> they are already expressed in either large triple tables or in virtually
> partitioned predicate tables with mostly primitive data (text, numbers,
> dates, etc.). There's the matter of provenance (all our statements are
> annotated with source information), but I won't dwell on e.g. RDF-star here.
> >
> > Other have commented many times on this list, that RDF is neither a
> programming language nor a data structure description, so perhaps that is
> not surprising.
> >
> > The reason for my interest in RDF is that our data model is already
> pretty closely aligned, and I'd like to tap into a richer ecosystem. I
> don't want 'it' to be like the java I already have, but it should give an
> idea on where I'm coming from. Let's say that we have some 'literals' that
> have structure as well as (some) semantics.
> >
> > But the main blockers are in the area of person names but also
> addresses. For the latter we use a format similar to names ('annotated
> strings' / tagged lists of strings) somewhat similar to how Google Maps
> models them:
> https://developers.google.com/maps/documentation/geocoding/overview#GeocodingResponses
> .
> >
> > The applications at hand are tasked with a lot of searching and entity
> resolution / matching. Some of the sources used actually have fields like
> first given name, second given name, first family name and second family
> name (in this case in a spanish context). Another example is sources
> discerning between given (formal) names and a so-called roepnaam (a fairly
> Dutch concept). Obviously some sources don't have such high fidelity. My
> goal of the initial design was to a) not attempt to capture all nuances of
> person name cultures, but a fair amount and b) to not get stuck in an ever
> growing but always incomplete set of name formats.
> >
> > In any case, we want to support the notion of people going by various
> names; hence going beyond associating given and family names directly with
> a person. So Herman Iván would ideally be described as
> >
> > [ a :Person ;
> >   :name [ :familyName "Herman" ; :givenName "Iván" ] ;
> >   :name [ :givenName "Ivan" ; :familyName "Herman" ]
> > ]
> >
> > Sacha Baron Cohen would I guess ideally be described as:
> >
> > [ a :Person ;
> >   :name [ :givenName "Sacha" ; :familyName "Cohen" ] ;
> >   :name [ :givenName "Sacha" ; :familyName "Baron Cohen" ] ;
> >   :name [ :givenName "Sacha" ; :givenName "Noam" ; :familyName "Baron
> Cohen" ] ;
> >   :name [ :nickName "Ali G" ] ;
> >   ...
> > ]
> >
> > José Plácido Domingo Embil could be described in our systems as:
> >
> > [ a :Person ;
> >   :name [ :givenName "Plácido" ; :familyName "Domingo" ] ;
> >   :name [ :givenName "José" ; :givenName "Plácido" ; :familyName
> "Domingo" ] ;
> >   :name [ :givenName "José" ; :givenName "Plácido" ; :familyName
> "Domingo"; :familyName "Embil"
> > ]
> >
> > Xi Jinping could be described in our systems as:
> >
> > [ a :Person ;
> >   :name [ :familyName "Xi" ; :givenName "Jinping" ] ;
> >   :name [ :familyName "習" ; :givenName "近平" ] ;
> > ]
> >
> > Note that we're not all that interested in capturing what someone's
> actual name is, we're mostly interested in what someone goes by; i.e. what
> could be considered identifying. (I am painfully aware that the namespace
> in the Netherlands is a lot less crowded than in e.g. China). And we're
> solely dependent on what data and how much structure is available. And most
> of the time, there is a difference in what format 'seed' data is available,
> possible search formats and the structure of actual records that can be
> matched.
> >
> > A final point of interest in the intended context of use is that
> application users are able to change name input provided earlier. So
> ideally there is no 'expansion' of earlier input in e.g. a format of
> unordered but annotated name elements and a full name string.
> >
> > Thanks again, for thinking along with me. There are probably concessions
> to be made. Feedback like yours stresses my thinking; much appreciated!
> >
> > Best regards,
> > Frens Jan
> >
> >
> > On Thu, Jun 10, 2021 at 9:07 PM Hugh Glaser <hugh@glasers.org> wrote:
> > Hi Frens Jan,
> >
> > Sorry to perhaps be a bit difficult here, rather than answer the
> question as put.
> >
> > I read your posting with some unease.
> > In general:
> > When people move from an existing application in a programming language
> to using RDF, it can often seem that things don't move over easily and
> naturally; and indeed that can be the case.
> > Other have commented many times on this list, that RDF is neither a
> programming language nor a data structure description, so perhaps that is
> not surprising.
> >
> > Without the specific set of ways in which you will be using the
> knowledge (rather than an abstract "well I want it to be like the Java I
> already have"), it is hard to suggest alternatives.
> > > This allows reconstruction of the name into a string while at the same
> time expressing the components of the name. So it captures the roles of the
> elements of a name (e.g. given names, family names) *as well as* their
> order (given names aren't first everywhere). Also, it allows expressing
> multiple names. E.g. in multiple languages / scripts. Or even aliases used
> in different areas of the world.
> > Since you talk about "given names", it seems to me that you could use
> >         :givenNames "Frens Jan"
> >
> > More specifically, you seem to want to tread an almost impossible line
> of small amount of the knowledge of a person's name, without having
> anything extra.
> > If you really want to be able to embrace the multi-cultural stuff of
> even just UK, HUN & ESP, for example, you need to think what you will do
> with people like
> > Bartók Béla and our own Ivan Herman, who might also been know as Herman
> Ivan;
> > José Plácido Domingo Embil;
> > Pablo Ruiz Picasso;
> > Sacha Noam Baron Cohen;
> >
> > I actually have a feeling you can get away with
> > :givenNames
> > :familyNames
> > for quite a while, if you are lucky, but as I said, it will depend on
> the context of your application.
> >
> > Good luck
> > Hugh
> >
> >
> >
> > > On 10 Jun 2021, at 18:37, Rumph, Frens Jan <mail@frensjan.nl> wrote:
> > >
> > > Dear Christophe,
> > >
> > > Thank you for the pointer. I wasn't aware of this ontology! There are
> some elements missing from the vocabulary, but it comes a long way. But
> knowing that others went down this route is somewhat reassuring.
> > >
> > > As for the use of blank nodes: agreed, this is not necessary. Given
> the inability to delete them (with SPARQL) I am steering away from them
> anyway.
> > >
> > > Best regards,
> > > Frens Jan
> > >
> > > On Thu, Jun 10, 2021 at 1:08 PM Christophe Debruyne <
> christophe.debruyne@gmail.com> wrote:
> > > MADS (https://www.loc.gov/standards/mads/rdf/) provides you a way to
> represent parts of a name using a collection. A madsrdf:PersonalName has a
> madsrdf:elementList that refers to a list (thus keeping order). In that
> list, you can have various typed resources with a madsrdf:elementValue
> containing the literals.
> > > The nodes do not necessarily have to be blank. So this looks like your
> second approach but using a vocabulary published by the Library of Congres.
> > > With my best regards,
> > > Christophe
> > >
> > > On Thu, Jun 10, 2021 at 12:39 PM Martynas Jusevičius <
> martynas@atomgraph.com> wrote:
> > > Why is the list syntax ( ) not satisfactofy?
> > >
> > > On Thu, 10 Jun 2021 at 12.07, Rumph, Frens Jan <mail@frensjan.nl>
> wrote:
> > > Dear readers,
> > >
> > > I am investigating transitioning an application to use RDF. One
> roadblock is how this application models names of persons. It supports
> straight-forward full names as a single string, but also supports
> decomposed names, e.g. person X has given name *Frens* followed by a second
> given name *Jan* followed by the family name *Rumph*.
> > >
> > > Note that this is a crosspost of
> https://stackoverflow.com/questions/65982459/rdf-modelling-of-list-of-name-elements.
> I hope to get some more
> > >
> > > The data structure is something like:
> > >
> > > ```java
> > > enum Role {
> > >    ...
> > >    GIVEN_NAME,
> > >    FAMILY_NAME,
> > >    ...
> > > }
> > >
> > > record NameElement(role: Role, value: String) {}
> > >
> > > record AnnotatedName(NameElement... elements) {}
> > > ```
> > >
> > > in order to be instantiated like:
> > >
> > > ```java
> > > var name = new AnnotatedName(
> > >     new NameElement(GIVEN_NAME, "Frens"),
> > >     new NameElement(GIVEN_NAME, "Jan"),
> > >     new NameElement(FAMILY_NAME, "de Vries")
> > > );
> > > ```
> > >
> > > This allows reconstruction of the name into a string while at the same
> time expressing the components of the name. So it captures the roles of the
> elements of a name (e.g. given names, family names) *as well as* their
> order (given names aren't first everywhere). Also, it allows expressing
> multiple names. E.g. in multiple languages / scripts. Or even aliases used
> in different areas of the world.
> > >
> > > I have toyed around with some RDF constructs, but none are really
> satisfactory:
> > >
> > > ```turtle
> > > # list of strings misusing data types as tags
> > > :frens :name ( "Frens"^^:givenName "Jan"^^:givenName "de
> Vries"^^:familyName ) .
> > >
> > > # list of blank nodes
> > > :frens :name ( [ :givenName "Frens" ]
> > >                [ :givenName "Jan" ]
> > >                [ :familyName "de Vries" ] ) .
> > >
> > > # single blank node with unordered 'elements'
> > > :frens :name [ a           :AnnotatedPersonName ;
> > >                :fullName   "Frens Jan de Vries" ;
> > >                :givenName  "Frens" ;
> > >                :givenName  "Jan" ;
> > >                :familyName "de Vries" ] .
> > > ```
> > >
> > > ---
> > >
> > > **Existing ontologies for HD names?**
> > > Is there an existing ontology that covers such 'high fidelity'? FOAF
> and vcard have some relevant properties, but aren't able to capture this
> level of semantics.
> > >
> > > **Lists?** One major 'blocker' in migrating this approach to RDF is
> the notion of order that is used. If at all possible, I'd like to stay away
> from the List / Container swamp in RDF land ...
> > >
> > > I'd be grateful for any thoughts on the matter!
> > >
> > > Best regards,
> > > Frens Jan
> >
> > --
> > Hugh
> > 023 8061 5652
> >
>
> --
> Hugh
> 023 8061 5652
>
>
Received on Friday, 11 June 2021 19:46:56 UTC