Modeling acronym and abbreviation Labels scenario from Bradley Shoebottom on 2013-02-05 (public-esw-thes@w3.org from February 2013)

From: Bradley Shoebottom <bradley.shoebottom@Innovatia.net>
Date: Tue, 5 Feb 2013 16:42:29 +0000
To: "public-esw-thes@w3.org" <public-esw-thes@w3.org>
Message-ID: <1B8EDAD4532ABF41A819B3E5845062DB330E5C61@MBX245.domain.local>

Dear mailing list,

I am trying to build a controlled vocabulary schema to be able to model something like RFC 4949 http://tools.ietf.org/html/rfc4949

This controlled vocabulary has "separate" entries for the acronym, abbreviation, each slang/synonym, and canonical term. There are also deprecatedLabel.

I do not want separate entries for each acronym/abbreviation as the MADs/rdf object properties hasAcronymVariant and hasAbbreviationVAriant suggests. Instead I want everything in one canonical entry. (reasons outline in Use Case Scenario below)

For example in the RFC 4949, page 9 :

prefLabel: Triple Data Encryption Algorithm
hiddenLabel: Triple DEZ [I made up this slang]

How would you model these 2 alternatives to the canonical Label in MADS/rdf?

acronym:3DES
abbreviation: Triple DES

Use Case Scenario
We want to build a master controlled vocabulary by text mining many glossaries such as RFC 4949. So we have to be able to process these varying labels and cross references.

One approach is to model RFC 4949 using MDS/rdf as the specification suggests, and then use a some sort of inferencing/query to get the acronyms/abbreviations to "appear" as part of the canonical term using object properties. This leads to more term entries but makes it easy to text mine. This complicates XSLT transformation to .txt for further text mining.
An alternate approach is to make one canonical entry for all label types for the text mining reason listed next which would simply the XSLT transformation from OWL to .txt

We curate the multiple glossary inputs to ensure there is only one canonical idea presented ontologically/conceptually by a SME (either manually curate to ensure syntactically different labels for the same term are matched or SPARQL query to isolate duplicates or both techniques).

Then we export the master term list as a .txt with preferred label, acronyms, symbols (QUDT ontology), abbreviations, and synonyms (altLabel). This acts as an input again for GATE so that we can text mine the true corpus that describes a product to build the knowledge base for that product.

Right now our glossary has over 20,000 telecommunications terms (many complex and simple labels). So the design is important so we do not have a big job correcting populated design errors.

Of course I can just model owl:acronym and owl:abbreviation under the approriate imported SKOS, SKOS-XL, and MADS/rdf data properties, but I would like to remain as close as possible to customary modeling.

Any thoughts?

Bradley Shoebottom
Senior Information Architect - Research and Product Development
Phone: (506) 674-5439 | Toll-Free: (800) 363-3358
Skype: bradley.shoebottom
Email: bradley.shoebottom@innovatia.net<mailto:bradley.shoebottom@innovatia..net>

Received on Tuesday, 5 February 2013 16:44:29 UTC