Identifiers from MacRae, Caspar [Engineering] on 2021-07-06 (public-md-odrl-profile@w3.org from July 2021)

From: MacRae, Caspar [Engineering] <Caspar.MacRae@gs.com>
Date: Tue, 6 Jul 2021 13:16:40 +0000
To: "public-md-odrl-profile@w3.org" <public-md-odrl-profile@w3.org>
Message-ID: <LNXP265MB0203FB698A46B6DEE93B4213EA1B9@LNXP265MB0203.GBRP265.PROD.OUTLOOK.COM>
Hi,

In response to the action: "Caspar to write some documentation and recommendation around use of identifiers, what is the strategy we should use for all identified things, use of standard schemes and non-standard schemes."

Summary of recommendations:

*         Use of IRIs for all identifier representation, allowing arbitrary identifier schemes to be mutually agreed

*         Use of ISO for standard identifiers, as complete, widely adopted and has an official URN NID

*         Determination of identity must be temporal, to avoid complexities of identity schemes' versioning


Best regards,
Caspar


Identity

From a relational modelling perspective, you'd consider any discussion of "Identity" remiss without consideration of "Entity" - but the apparent portmanteau is coincidental, they share no etymology.

Dictionary definition of "Entity" simply means "existence" (to be), while "Identity", for our purposes, is "the quality of being identical" (the same) - simple enough to state in the abstract.

Demonstrating these concepts' independence - the notion of entity without explicit identity is already expressed in our draft specification via inclusion of Prov-O<https://www.w3.org/TR/prov-o/#:~:text=The%20PROV%20Ontology%20(PROV%2DO,systems%20and%20under%20different%20contexts.>, the provenance ontology.  For lineage purposes (derivations, audit trail, etc), we extend the ProvO's Entity type - where an Entity is defined as;  the product of, or input to, an Activity governed by an Agent.  Identity simply isn't a concern here; two simple relationships wasRevisionOf and wasDerivedFrom provide the lineage model from source to sink.

Moving on - consider the following examples of identity:

1.       The $10 in your pocket is identical to that in mine for most contexts - until we compare serial numbers

2.       The book on your shelf versus the library copy with identical ISBN, are the same - until we place side-by-side and compare dog-eared pages, coffee stains, etc

3.       I am "the same" person now as I was when five years old - yet many attributes are completely different

4.       The FTSE 500 is distinguishable from the FTSE 100 - yet if all five hundred constituents changed, it would still be the FTSE 500

In #1, for our purposes, money is fungible; the physical $10 bill in your pocket is identical to the digital $10 in my account - this type of identity is trivial, it's a unit in concert with a quantity - ensuring apples-to-apples comparison.

#2 appears similar to #1, but there's a subtle difference, #1 targets a value - necessarily a property (e.g. balance of account) or abstract concept (e.g. price/value).  #2 targets an entity, provides a context-bound notion of identical - where sameness does not imply uniqueness.

Examples #3 and #4 force us to look deeper - specifically what do we mean by "the same"?  Philosophers have grappled with this for eons - Mereology; the study of the whole vs the parts - best summarized by the Ship of Theseus<https://en.wikipedia.org/wiki/Ship_of_Theseus> thought experiment;
Plato, Heraclitus et al. puzzled over a ship that set sail on a long voyage, along the way suffering damage what was repaired with near equivalence, by the time it returns to harbour not one piece of wood or rope is the original part.  Is it the sum of its parts, when not a single original piece remains - can we consider the whole that departed "the same" as the whole that returned?

To avoid heading deeper into this metaphysical rabbit hole, we can just assert: yes, it Theseus's ship and the FSTE 500 is, always, the FSTE 500.   We achieve this via the proposed temporal model, which elegantly manages examples #3 and #4, leveraging Allen's model to introduce an abstraction that essentially provide a permanent identifier within the bounds of a firm/policy-store.

So "the quality of being identical", is contextual (object class and time sensitive) - it can be modelled as a value (or composite) that is unique within context and evaluated in terms of equality; this is enough to provide the mechanical means for different systems to mutually agree the determination of identity.


URL, URN, URI and IRI

A key component of the internet is the concept of Universal Resource Identifiers.  It's worth briefly recapping the acronyms URL (Universal Resource Locator, URN (Universal Resource Name), URI (Universal Resource Identifier) and IRI (Internationalized Resource Identifier).

*         URL is a location (resolvable identity - enter a Wikipedia URL in your web browser, to have it resolve that address to the actual content)
*         URN is a name (location free - ISBN identifies a book uniquely, whether it's on your shelf or in the library,  or a Social Security/National Insurance number in your Corp's HR db)

The difference seems stark - but it's not.  Semantically both concern identity; something universal but only within a known scheme - the only difference is intended use and that's contextual (e.g. a Kindle type application could resolve an ISBN to the actual content, handling this URN as a URL would be handled by a web browser), so;

*         URI recognizing that URL and URN are semantically disjoint but syntactically identical, IETF (Tim Berners-Lee) introduced the URI as the superset of URL and URI
*         IRI only exist to support internationalization - URIs only supported a limited, US focused, character set (ASCII). IRIs are defined with the Unicode character set; as Unicode can be considered a superset of ASCII, it follows that IRI is the super type of URI

So the difference between URI and IRI is just a technical detail, evolution via superset to honour compatibility - as IRIs are broadest and recommended in RDF, we should follow suit.

While the distinction between a unique name (URN) and resolvable location (URL) isn't particularly relevant for the concept of identity, it is of vital importance when we consider pragmatic application.


Consideration of Identity Scheme Management

We can broadly split identity schemes in two; those that guarantee permanent/stable identity (PURL<https://www.oclc.org/research/areas/data-science/purl.html> and Permalinks<https://en.wikipedia.org/wiki/Permalink>) and those that don't.  It's worth considering this from the perspective of the definer/issuer; for a fixed enumeration, a finite set, it makes sense to produce the dataset of identifiers along with the specification - whereas for an unbounded set of identifiers, it does not.

Compatibility (versioning) is not only critical for the evolution of finite identifier sets, but also for the content where an identifier is resolvable.


What Identities and Identifiable Things Exist in the Spec

Identifiable Concepts:
*         Asset
*         Asset Class
*         Markets (Exchange, Venue)
*         Party
*         Location (Region, Named Office, Address)
Identifiers:
*         MIC
*         Compound ID
*         Currency
*         UN M49<https://unstats.un.org/unsd/methodology/m49/> Code (as the geography property<https://w3c.github.io/market-data-odrl-profile/md-odrl-profile.html#geography> of Asset)

Although Person ID and Machine ID are clearly identities, their appearance in the specification is only as a unit of count.


Difficulty with Address as Identifier

A location may appear in contracts as;  address, city, region, country or named site.  The first three logically nest and are identifiable, while the last is an arbitrary classification - it's easy to see how all variations could be evaluated, but cleanly expressing in the specification is harder; likely requiring the constraint locations<https://w3c.github.io/market-data-odrl-profile/md-odrl-profile.html#locations> take a disparate range.

An address is necessarily a compound ID, while there are various standard models for postal addresses (e.g. FIBO Address Ontology<https://spec.edmcouncil.org/fibo/ontology/FND/Places/Addresses/> (RDF), Schema.org<https://schema.org/PostalAddress> (RDF), Universal Postal Union<https://www.upu.int/> and ISO 19160<https://www.iso.org/standard/61710.html> (UML)), none offer an identifier encoding (i.e. a synthetic ID that can be resolved to the address, and therefore used in its place).


International Standards Organisation (ISO)

We need to represent at least seven key identities - for each of these there's an existing ISO standard:

  *   Currency - ISO 4217<https://www.iso.org/iso-4217-currency-codes.html>
  *   CFI - ISO 10962<https://www.iso.org/standard/81140.html>
  *   ISIN - ISO 6166<https://www.iso.org/standard/78502.html>
  *   MIC - ISO 10383<https://www.iso.org/standard/61067.html>
  *   LEI - ISO 17442<https://www.iso.org/standard/78829.html>
  *   Location (Country/Region) - ISO 3166<https://www.iso.org/iso-3166-country-codes.html>
  *   Address - ISO 19160<https://www.iso.org/standard/64242.html>

In order to forge agreement and cooperation across geopolitical boundaries, standards are drawn at an international level, but assignment is typically controlled at a regional level.  The regional subdivision is apparent in many identity schemes; e.g.

1.       Currency codes (ISO 4217) are three characters long, prefixed by the two character country code (ISO 3166-1, alpha-2)

2.       The ISIN format (ISO 6166), consists of the two character country code (ISO 3166-1, alpha-2), followed by a nine digit numeric sequence (the NSIN - National Security Identification Number) which is regionally assigned
An exception is LEI; here the ID encodes the issuer but not the country code, as the specification mandates accompanying reference data (including country code) for assigned identifiers.

It's worth noting that the ISO country codes are derived from the UN M49 codes, but that M49 is more detailed than ISO 3166<https://unstats.un.org/unsd/classifications/Family/Detail/12>.


Valid ISO URNs for Identities

RFC-5141<https://urldefense.proofpoint.com/v2/url?u=https-3A__eur01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fdatatracker.ietf.org-252Fdoc-252Fhtml-252Frfc5141-26data-3D04-257C01-257CAPEL-2540iso.org-257C47504a57b5e240a2f1af08d9355fb412-257C8543418a200d4d6b88c979fb0b651354-257C0-257C0-257C637599508019782587-257CUnknown-257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0-253D-257C3000-26sdata-3Dg-252F3cjyJMmN4iej6J7X50OujcqezM2oL0Pr4Tj6PjOOo-253D-26reserved-3D0&d=DwMFAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=ICXsYhIJDNDHN-MQBUh5Brxtp_NlKYKKYQj2IEeqZOo&m=sAIp2TNtQoqsEqNREeiaCA69zqn49WNlMpPJBhQfRCM&s=wKhd648FJaSvuhZ-QxzSqBerkK840XRwEwfUdg81bB8&e=> defines the URN Namespace ID for ISO, stating in the abstract:

"This URN NID is intended for use for the identification of persistent resources published by the ISO standards body (including documents, document metadata, extracted resources such as standard schemata and standard value sets, and other resources)"

The key part is the inclusion of "standard value sets" - this is ISO's terminology for the actual datasets produced by their standards.  Following this assigned scheme, we can unambiguously name the various ISO datasets.

For example, taking ISO 4217 (Currency) - Uruguay Central Bank applied for a new fund code (UYW), this was ratified in ISO 4217 part 1, in version 1 of the 8th edition (2015), by amendment 169, effective from 2018-08-29, with this specification available in both English and French language  -  the URN is:

urn:iso:std:iso:4217:-1:ed-8:v1-amd169:v1:en,fr

I couldn't find any explicit guidance in the RFC on how to reference elements within a standard value set, it seemed a small and logical step to add an f-component (suffix of '#' hash and identifier).  Searching the web uncovered a few examples of others doing this, but worryingly quite a few had misinterpreted the spec (e.g. using date instead of edition number).  So I reached out to ISO, detailing our requirements and was lucky enough to receive a considered and detailed response from one of the RFC's authors - Holger Apel.

Continuing previous example, but focusing on value sets rather than the governing specification documentation - we can be absolutely exact about the first occurrence of #UYW in the value set, its name is:
urn:iso:std:iso:4217:-1:ed-8:v1:amd:169:v1:tech:#UYW

Contrasting this value set identity example with the previous specification example:  language is obviously irrelevant (the value set codes are necessarily universal) so omitted, but what's notable is the additional NSS (namespace specific string) :tech - this qualification ensures the name unambiguously refers to the value set.

This is too precise - the scheme need not refer to each and every corrigendum and amendment, yet it must be unambiguous and not assume that all identifier schemes employ permanent IDs.

We can remove the versioning details entirely, legally reducing the previous example to:

urn:iso:std:iso:4217:-1:tech:#UYW

This is possible if the determination of identity is temporally bound, e.g. with the contract's effective date.

For the sake of human readable examples, identifiers may be further shortened and syntactically sweeten, with prefixes e.g. Turtle:

                @prefix currency: <urn:iso:std:iso:4217:-1:tech:#> .

Allowing examples to simply use;  currency:UYW, currency:USD, etc.


Other Relevant Standard Identity Schemes

If we accept the obvious recommendation of identifiers as IRIs (URNs/URLs), then any identifier scheme that provides either a registered URN NID,  or issues identifiers as URIs (against a domain they control) may be considered standard.

Open Perm ID<https://permid.org/> and Open FIGI<https://www.openfigi.com/about/figi> both issue identifiers as URIs, and are likely deployed within many firms, making them great candidates for standard identity schemes - additionally they make good runtime options as both are API resolvable for their own native identifiers but also act as bridges between identity schemes by also resolving relevant ISO identifiers.

We should align with other efforts where possible, e.g. FINOS have specified and curated Securities and Issuer mappings<https://www.finos.org/hubfs/SecRef_%20Securities%20&%20Issuer%20ID%20mapping_%20.pdf>, leveraging Open FIGI for securities master and Open Perm ID for legal entity master.


Supporting Multiple Schemes

Whether standard or bespoke, if there's the option of multiple schemes for a particular identifier type, then this needs to be mutually agreed between the producer and consuming parties.  It is an operational detail of the supply chain and should be automated.

During resolution of a URL to the displayed content, a web browser negotiates with the server.  The browser acts as the User's Agent, informing the server of language/locale preferences and stating which content types it can accept - the server then tailors its response, where possible.  In a similar vein, a protocol for exchange of digital contracts could include automated negotiation of identifier schemes.



________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
Received on Tuesday, 6 July 2021 13:17:24 UTC