RDF/OWL Representation of WordNet

NOTE: this document is NOT a W3C draft, it is intended for discussion only.

Editor's Draft 17 October 2005

This version:: 17 October 2005
Latest version:: ...
Previous version:: ...
Editors:: Mark van Assem, Vrije Universiteit Amsterdam; Aldo Gangemi, ISTC-CNR, Rome; Guus Schreiber, Vrije Universiteit Amsterdam

Abstract

WordNet has been adopted in the Semantic Web research community for use in annotation, reasoning, and as background knowledge in ontology mapping tools. Currently there exist several conversions of WordNet to RDF(S) or OWL. The WordNet Task Force aims at providing a conversion of WordNet as a reference point for developers.

This is not an abstract.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is a editor's draft, considered for publication as First Public Working Draft by the Semantic Web Best Practices and Deployment Working Group, part of the W3C Semantic Web Activity.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Open issues are described in Sec. 5.

Acknowledgements

Dan Brickley and Brian McBride have contributed to the WordNet conversion described in this note through their work in the WordNet Task Force and additional comments and suggestions.

1. Introduction
2. WordNet data model
3. Prolog source
4. Concersion typo to RDF/OWL
5. Open issues
Appendix A: Conversion details
Appendix B: References

1. Introduction

WordNet [Fellbaum, 1998] is a heavily-used lexical resource in natural-language processing and information retrieval. More recently, it has also been adopted in Semantic Web research community for use in annotation, reasoning, and as background knowledge in ontology mapping tools. Currently there exist several conversions of WordNet to RDF(S) or OWL. references? Most of these have been derived from the Prolog version of WordNet (there is also a version in a proprietary text format).

The WordNet Task Force of the SWBPD WG aims at providing a standard conversion of WordNet as a reference point for developers. The basis is formed by the previous conversions. As there are many similarities between these conversions, it seems that the opinions on what entails a suitable form of WordNet in RDF/OWL are converging. This conversion may be used directly in Semantic Web applications, or as a source for modified WordNet versions (e.g. turning WordNet into an ontology).

Requirements in the design of the RDF/OWL version are:

it should be a full conversion (i.e. be as complete as possible);
it should be convenient to work with;
it should as much as possible reflect the original structure of WordNet; and
it should provide OWL semantics while still being intepretable by pure RDFS tools (i.e. OWL semantics are provided but can be ignored).

There is tension between these requirements. For example, while one way of representing in RDF may reflect the original structure better than another, it may be less convenient to use. This document details the trade-offs and design decisions taken.

As a general comment about the nature of this document, it is describing the process of conversion, rather than the results of the conversion. This aim is misdirected. A document which described the conversion and aided users using it would be significantly more useful to significantly more people. Some of the other comments below reflect this view.

For performing the actual conversion Prolog version 2.0 of WordNet has been used.

2. WordNet datamodel

The core concept in WordNet is the synset, such as {car, auto, automobile, machine, motorcar}. A synset groups word senses introduction of new technical term.with a synonymous meaning. Another sense of the word "car" is recorded in the synset {car, railcar, railway car, railroad car}. Hence WordNet distinguishes between a word such as "car" and the different senses in which it can be used. Ah - here you are introducing the notion of wordsense. I think that needs to come earlier in the text. There are four disjoint kinds of synset, containing either nouns, verbs, adjectives or adverbs. Furthermore, WordNet defines seventeen relations, of which ten between synsets (hyponymy, entailment, similarity, member meronymy, substance meronymy, part meronymy, classification, cause, verb grouping, attribute) and five between word senses (derivational relatedness, antonymy, see also, participle, pertains to). The remaining relations are "gloss" (between a synset and a sentence), and "frame" (between a synset and a verb construction pattern).

This description of the wordnet datamodel sheds very little on the model to those who are not already familiar with it. Reflecting my concern to aid the user of the conversion, I suggest a much fuller explanation is called for.

3. Prolog Format

I don't see any point in having this section.

The Prolog distribution consists of eighteen files: one file that represents synsets and then one for each of the seventeen relationships. The file with synsets contains Prolog facts such as:

  s(100003009,1,"living_thing",n,1,1).
  s(100003009,2,"animate_thing",n,1,0).

Each fact denotes exactly one word sense. The word senses with the same synset ID together form a synset. The two facts above together form the synset with the ID 100003009.

Relations are identified by lists of facts like the following:

  hyp(100002056,100001740).
  mp(100004824,100003226).
  ant(100017087,1,100019244,1).

The first identifies a hyponymy relation between two synsets, the second part meronymy between synsets, the third antonymy between two word senses (second and fourth argument are word numbers). The documentation defines characteristics for each relationship, such as (anti-)symmetry, inverseness and value restrictions on the lexical groups (e.g. nouns, verbs) that may appear in relations. Most of these informally stated requirements can be formalized in OWL and are present in the conversion. Investigation of the source files and documentation revealed several conflicts between source and documentation For example, the order of synset arguments of the member meronym relation seems to be different than the documentation asserts. For each conflict we have proposed a resolution.

Details of the conversion can be found in Appendix A. The RDF/OWL files and the conversion files can be found at:

WordNet datamodel: http://www.cs.vu.nl/~mark/wn/wn.rdfs this references a document not in W3C space. Also the .rdfs extension is odd for an owl ontology. I also suggest datamodel is the wrong term - schema feels closer but that has problems too. Maybe just need more describes what it is.
WordNet data files: http://www.cs.vu.nl/~mark/wn/rdf/ is there a description of the structure of these files somewhere? If so a reference to it would be good here.
Prolog conversion program with inline comments explaining how to install and use it: http://www.cs.vu.nl/~mark/wn/convertwn.pl

The conversion program makes use of the open-source SWI-Prolog.

4. Conversion to RDF/OWL

The proposed conversion has three main classes: Synset, Word and WordSense. Synset and WordSense get subclasses based on the distinction of lexical groups. For Synset this means subclasses NounSynset, VerbSynset, AdjectiveSynset (in turn subclass AdjectiveSatelliteSynset) and AdverbSynset. For WordSense this means subclasses NounWordSense, VerbWordSense, etcetera. [WHY DOES WORD NOT HAVE THESE SUBCLASSES?]. Word has a subclass Collocation, because the Prolog documentation states that hyphens or underscores (replacing spaces) have been used in words to indicate collocated words (e.g. mix-up and eye contact').

	Synset
		AdjectiveSynset
			AdjectiveSatelliteSynset
		AdverbSynset
		NounSynset
		VerbSynset
	
	WordSense
		AdjectiveWordSense
		AdverbWordSense
		NounWordSense
		VerbWordSense
	Word
		Collocation

The clas hierarchy of WordNet:

This conversion builds on three previous WordNet conversions, namely by:

Dan Brickley
Stefan Decker & Sergey Melnik
University of Neuchatel

The references I suggested earlier are here.

In this document we have not tried to come up with a completely new conversion. Rather, we have studied these existing conversions and filled in some of the gaps. Here are some of the typical differences w.r.t the existing conversions:

Brickley represents the hyponym relationship as a rdfs:subClassOf. This is an attractive interpretation, but we argue that not all hyponyms can be interpreted in that way. An attempt to provide a consistent semantic translation of hyponymy has been done [Gangemi, 2003], but in this work we do not attempt a semantic translation of the intended meaning of WordNet relations, while we aim at a logically valid translation of the WordNet data model into RDF/OWL.
We represent Words and WordSenses URIs. The conversion by the University of Neuchatel represents Words as URIs, but not word senses. For the motivation, see the discussion below on URI generation.
We have split some relations into sub-relations. For example, the Prolog relationship per denotes (a) a relation between an adjective and a noun or adjective or (b) a relation between an adverb and an adjective. We convert per into adjectivePertainsTo and adverbPertainsTo.

The conversion of Neuchatel is closest to the one in this document. The Neuchatel conversion omits A style thing: suggest you phrase this positively rather than negatively, i.e. "We have added ..." or some something like that, so it doesn't sound like a criticism. relations "derivation" and "classification". It does not provide sub-relations and inverses for all relationships. Both conversions differ from the other two which are both and which are the other two. I can't figure it out. in that they provide OWL extensions in which property characteristics such as symmetry, inverseness and value restrictions are defined.

The motivation did you mention this already? Maybe I missed it but you seem to be motivating a difference you have not introduced. A maybe when you say in 2. about you don't mean that you've given a URI to what elsewhere is a blank node. You mean you have separated the concepts. In that case don't talk about URIs in 2. for representing Words separately is that words are language-specific; the word "chat" in english has a different meaning than the same wordform in French. You refer to "chat" as a word in English and a wordform in French. For future integration of WordNet with other multilingual resources it is essential that one can refer to two different words with the same wordform.

Generating URIs

The title talks of "generating" rather than of naming. This suggests that it is written from the viewpoint of technical documentation on the operation of the conversion, rather than documentation for the user.

Besides introducing WordSenses and Words as separate entities, we also introduce URIs for them. Generating URIs is generally composed of two parts: choosing a namespace and choosing unique identifiers within that namespace for each separate entity. The choice of the namespace should be discussed with Princeton. Here we discuss the second choice.

In some previous conversion WordSenses did not have a URI. The motivation was that the source does not provide unique identifiers for them. This makes it impossible to refer to WordSenses directly and to use them e.g. for annotation. We have chosen to introduce identifiers for WordSenses by using a compound key: the base uri + a locally unique ID. There are two straightforward options for the local ID. Firstly, the combination of synset + sense number. Secondly, the first word in the synset + lexical group + sense number. We chose the second option as it is more readable. Example:

http://wordnet.princeton.edu/wn#bank-noun-1

I'm not sure about the # here because of the problem of dereferencing and fragids not being passed to the server. I suggest replace the '#' with '/'.

There is nothing in the uri to indicate the version of wordnet. This is an important issue for wordnet because it does grow and change. What is the plan for dealing with versions.

For the local ID of Synsets we have chosen the synset identifier provided in the source. For human readability we add two redundant elements: the first word in the synset and the lexical group symbol. Example:

http://wordnet.princeton.edu/wn#107909067-bank-n

For the URI for Words we use the lexical form, which is unique within English:

http://wordnet.princeton.edu/wn#bank

[THIS IGNORES LANGUAGE ISSUE! should we append language indicator?]

Note that because the synset ID is now incorporated in the Synset URI, an application can only retrieve the ID by parsing the URI. To circumvent this awkward parsing, we introduce a property wn:synsetId for Synset to store the ID in.

This is a trivial style thing, but why talk about introducing when can simply be descriptive and say that synsets have a wn:synsetId property whose value uniquely identifies the synset. Oh - that's interesting - is it inverseFunctional? How would that work with versions. Checking it not InverseFunctional. There must be a comment about the scope of its uniqueness.

5. Open Issues

You will see I made some points above that are listed here. I can see that it is maybe easier to collect open issues into one place, but I think it is also useful to the reader to make them aware of the issue when they read the relevant bit of text. I'd suggest putting the issues in line (with a marker so that a text search can easily find them all)- or at least a reference to the issue at the relevant point(s) in the text.

Questions regarding class hierarchy

Why is there not a class AdjectiveSatelliteWordSense?
Why does Word not have subclasses NounWord, VerbWord, etcetera? Because we do not distinguish between the noun "bank" and the verb "bank"? Should we make this distinction? Or add the instance both in NounWord and VerbWord?

Versioning

Princeton periodically issues a new WordNet version, which will require a versioning strategy for the RDF/OWL versions. It should be prevented that an "old" and "new" synset are collapsed into one synset by RDF because they have the same URI. If this does happen, the properties of the old and new synset are mixed, which is not appropriate (it becomes impossible to distinguish the different synset versions). A solution is to have version-specific URIs and somehow establish the relationship between the old and the new version. The first action is relatively simple, the second may be very complex.

Use versuioning-specifc URI?

http://wordnet.princeton.edu/wn20#

URIs

Have to contact Princeton about chosing a base URI we may use. Current files uses a proposal:

http://wordnet.princeton.edu/wn#

Current version uses "hash" instead of "slash" URIs. Is this OK?

Language Issues

Currently the RDF files are given a language tag on the document level (in RDF tag).

The URIs for Words are composed of BASE URI + WORD FORM. This does not enable us to distinguish between words in different languages (e.g.. "chat" in English and French).

Frames

Current conversion does not contain the frames, because Prolog source does not have them. We can import them from the text-format distribution.

Use of `rdfs:label`

It is good practice to give labels to instances, in this case of Word, WordSense and Synset. For Word this is solved by adding wn:lexicalLabel rdfs:subpropertyOf rdfs:label. For WordSense the contents for the rdfs:label is chosen by copying the contents of the wn:lexicalLabel of the Word. For Synset the first word (according to W_num) is chosen as the label.

RDF/OWL interoperability

Added additional statements so OWL source can also be interpreted by RDFS infrastructure.

Thank you.

One of the issues is whether or not to make the inverse properties "visible" to RDFS tools.

I was going to ask what an RDFS processor could 'see' of the schema. I've done a cursory check, but it would be good to test this. My personal opinion is that the RDFS version should be complete, i.e. all classes, properties and instances should be visible to RDFS, all subclass and subproperty relationships and all domain and range constraints. I wonder if there is scope for a tool to take an Owl ontology and generate the RDFS 'view' of it.If its not complete, we should explicitly list the bits that are missing.

Diacritics and spaces

The source uses escape sequences for diacritics, and underscores to indicate spaces. These haven't been handled yet.

W_num and sense_number

Each WordSense in a Synset has a "W_num" (starting from 1). It seems that this is not essential ordering information (i.e. only used to distinguish between word senses in the prolog source), so it has not been included in the conversion. Similar point for the sense_number in the prolog source.

Have to check with Princeton if indeed this information is not vital and also check with user community if they are not using these numbers.

Inverses

For the following properties there are no inverses, although they should be added. What would be appropriate naming?

classifiedByTopic
classifiedByRegion
classifiedByUsage

Generating instances of symmetric properties

The Prolog source sometimes contains symmetrical pairs, e.g. the source file for antonyms should contain ant(A,B) but also ant(B,A) according to the documentation. However, the conversion program finds clauses where this is not the case. Currently the program does NOT add an antonym in the RDF for such cases.

Need to check with Princeton if these are either omissions or errors.

Other

Should wn:seeAlso be a subproperty of rdfs:seeAlso?
We assume wn:sameVerbGroupAs is between synsets, but have to check with Princeton if it should be between WordSenses.
Have to check if appropriate rdfs:comment statements are present in the schema (e.g. meronymOf, WordSense subclasses "meaning" vs. "sense").
Need to check for possible bugs in Prolog conversion program that make it generate wrong RDF output. Also use DL reasoner to check for problems.
Provide a mapping to SKOS

A natural extension of this work would be to integrate the OWL model with LMF (Lexical Markup Framework) under development by the ISO TC37/SC4/WG4.

Appendix A: Conversion details

The following lists the definition of each Prolog clause as stated in the Prolog distribution's documentation, followed by notes on the meaning of the clause, an example, the mapping to RDF/OWL and possible conflicts between documentation and source files.

Sometimes the "synset_id" arguments of the documentation are changed into "synset_id_A" and "synset_id_B" to simplify discussion in the notes.

s(synset_id,w_num,’word’,ss_type,sense_number,tag_count).

A s operator is present for every word sense in WordNet. In wn_s.pl, w_num specifies the word number for word in the synset.

ss_type = {n, v, a, s, r} [stands for respectively noun, verb, adjective, adjective satellite, adverb

Maps to: wn:Synset, wn:Word, wn:WordSense (and various properties to connect them to each other), the actual word is stored in the property wn:lexicalForm.

g(synset_id,’(gloss)’).

The g operator specifies the gloss for a synset.

Maps to: wn:gloss(synset_id, '(gloss)')

hyp(synset_id_A,synset_id_B).

The hyp operator specifies that the second synset is a hypernym of the first synset. This relation holds for nouns and verbs. The reflexive operator, hyponym, implies that the first synset is a hyponym of the second synset.

Example: hyp(100003226,100003009). [organism, living_thing]

Maps to: wn:hyponymOf(synset_id_A, synset_id_B)

ent(synset_id_A,synset_id_B).

The ent operator specifies that the second synset is an entailment of first synset. This relation only holds for verbs.

Example: ent(200001740,200004923) [breathe, inhale], ent(200004701,200004127) [sneeze, exhale]

Maps to: wn:entails(synset_id_A, synset_id_B)

sim(synset_id_A,synset_id_B).

The sim operator specifies that the second synset is similar in meaning to the first synset. This means that the second synset is a satellite the first synset, which is the cluster head. This relation only holds for adjective synsets contained in adjective clusters.

Maps to: wn:similarTo(synset_id_A, synset_id_B) (note that order unimportant here)

mm(synset_id_A, synset_id_B).

The mm operator specifies that the second synset is a member meronym of the first synset. This relation only holds for nouns. The reflexive operator, member holonym, can be implied.

Example: mm(100006026,107463651). [Person, People]

Documentation seems to be wrong here. It's the other way around.

Maps to: wn:memberMeronymOf(synset_id_A, synset_id_B)

ms(synset_id_A, synset_id_B).

The ms operator specifies that the second synset is a substance meronym of the first synset. This relation only holds for nouns. The reflexive operator, substance holonym, can be implied.

Documentation seems to be wrong here. It's the other way around.

Example: ms(102073849,107118730). [oxtail, oxtail soup]

Maps to: wn:substanceMeronymOf(synset_id_A, synset_id_B)

mp(synset_id_A, synset_id_B).

The mp operator specifies that the second synset is a part meronym of the first synset. This relation only holds for nouns. The reflexive operator, part holonym, can be implied.

Documentation seems to be wrong here. It's the other way around.

Example: mp(100004824,100003226). [cell, organism]

Maps to: wn:partMeronymOf(synset_id_A, synset_id_B)

der(synset_id_A, synset_id_B).

The der operator specifies that there exists a reflexive lexical morphosemantic relation between the first and second synset terms representing derivational morphology.

Documentation seems to be wrong here. The pattern is der(synset_id_A,nr1,synset_id_B,nr2), don't know what the numbers mean. It seems that the numbers refer to WordSenses within the synsets. "Reflexive" probably means symmetric. Not sure if there are "doubles" in the prolog source like for other predicates (can be excluded when creating triples, but it produces the same triple so does not matter - one could argue whether to create the triple or not when its symmetric counterpart is missing in the source).

Example: der(100002645,3,201420446,4). [unit, unify]

Maps to: wn:derivationallyRelated(WordSense_A, WordSense_id_B) (note that order unimportant here)

cls(synset_id_A, synset_id_B,class_type).

The cls operator specifies that the first synset has been classified as a member of the class represented by the second synset.

class_type: t:topical, u:usage, r:regional

Example: cls(100004824,105681603,t). [cell, biology]

Maps to:

t: wn:classifiedByTopic(synset_id_A,synset_id_B)
u: wn:classifiedByUsage(A,B)
r: wn:classifiedByRegion(A,B)

cs(synset_id_A, synset_id_B).

The cs operator specifies that the second synset is a cause of the first synset. This relation only holds for verbs.

Examples:
cs(200018968,200014429). [cause_to_sleep, sleep/catch_some_Z's]
cs(200020073,200019883). [keep_up, sit_up/stay_up]
cs(200020689,200014429). [anaestesize/put_to_sleep/... , slumber/sleep/catch_some_Z's]

Documentation seems to be wrong here. It's the other way around (e.g. anaethesize causes to sleep)

Maps to: wn:causes(A,B)

vgp(synset_id_A, synset_id_B).

The vgp operator specifies verb synsets that are similar in meaning and should be grouped together when displayed in response to a grouped synset search.

Documentation is unclear. The actual format in the file is vgp(sidA, w_num1, sidB, w_num2). But in wn_vgp.pl the w_num's are always '0'. This seems to mean that the relation holds for all the words in the synset, i.e. the relation holds between synsets.

It seems that the file contains all the symmetric definitions, i.e. vgp(A,0,B,0) means that the file also contains vgp(B,0,A,0). One of the two can be ignored. No problem if the conversion code does not do this, because the asserted double triple is exactly the same. But see comment under "der".

Maps to: wn:sameVerbGroupAs(A,B)

at(synset_id_A, synset_id_B).

The at operator defines the attribute relation between noun and adjective synset pairs in which the adjective is a value of the noun. For each pair, both relations are listed (ie. each synset_id is both a source and target).

Example: at(101028287,300455926). [mercantilism, commercial]

The inverse version is also listed, so both at(A,B) and at(B,A) are in the source file.

Maps to:

if synset A is a noun (so B is adjective): attribute(A,B)
if synset A is adjective: attributeOf(A,B)

ant(synset_id_A,w_num_1,synset_id_B,w_num_2).

The ant operator specifies antonymous words. This is a lexical relation that holds for all syntactic categories. For each antonymous pair, both relations are listed (ie. each synset_id,w_num pair is both a source and target word.)

The synset_id + w_num identifies a word sense.

Maps to: wn:antonymOf(WordSense1, WordSense2)

sa(synset_id,w_num,synset_id,w_num).

The sa operator specifies that additional information about the first word can be obtained by seeing the second word. This operator is only defined for verbs and adjectives. There is no reflexive relation (ie. it cannot be inferred that the additional information about the second word can be obtained from the first word).

The synset_id + w_num identifies a word sense.

Maps to: wn:seeAlso(WordSense1, WordSense2)

ppl(synset_id,w_num,synset_id,w_num).

The ppl operator specifies that the adjective first word is a participle of the verb second word.

The synset_id + w_num identifies a word sense.

Maps to: wn:participleOf(WordSense1, WordSense2)

per(synset_idA,w_num,synset_idB,w_num).

The per operator specifies two different relations based on the parts of speech involved. If the first word is in an adjective synset, that word pertains to either the noun or adjective second word. If the first word is in an adverb synset, that word is derived from the adjective second word.

Maps to:

A is adjective(satellite), B is noun or adjective(satellite): wn:adjectivePertainsTo(A,B)
A is adverb, B is adjective(satellite): wn:adverbPertainsTo(A,B)

fr(synset_id,f_num,w_num).

The fr operator specifies a generic sentence frame for one or all words in a synset. The operator is defined only for verbs.

Maps to: wn:frame(VerbWordSense, Frame)

Appendix B: References

[Brickley, 1999] D. Brickley. Message to RDF Interest Group: "WordNet in RDF/XML: 50,000+ RDF class vocabulary". http://lists.w3.org/Archives/Public/www-rdf-interest/1999Dec/0002.html

[Decker & Melnik] S. Decker and S. Melnik. WordNet RDF representation. http://www.semanticweb.org/library/

[Fellbaum, 1998] C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.

[Gangemi, 2003] A. Gangemi, N. Guarino, C. Masolo, and A. Oltramari. Sweetening WORDNET with DOLCE. AI Magazine, 24(3):13-24, 2003.

[SWI Prolog] http://www.swi-prolog.org/

[Unicersity of Neuchatel] WordNet OWL Ontology; http://taurus.unine.ch/GroupHome/knowler/wordnet.html

RDF/OWL Representation of WordNet

Editor's Draft 17 October 2005

Abstract

Status of this Document

Acknowledgements

Contents

1. Introduction

2. WordNet datamodel

3. Prolog Format

4. Conversion to RDF/OWL

Generating URIs

5. Open Issues

Questions regarding class hierarchy

Versioning

URIs

Language Issues

Frames

Use of rdfs:label

RDF/OWL interoperability

Diacritics and spaces

W_num and sense_number

Inverses

Generating instances of symmetric properties

Other

Appendix A: Conversion details

s(synset_id,w_num,’word’,ss_type,sense_number,tag_count).

g(synset_id,’(gloss)’).

hyp(synset_id_A,synset_id_B).

ent(synset_id_A,synset_id_B).

sim(synset_id_A,synset_id_B).

mm(synset_id_A, synset_id_B).

ms(synset_id_A, synset_id_B).

mp(synset_id_A, synset_id_B).

der(synset_id_A, synset_id_B).

cls(synset_id_A, synset_id_B,class_type).

cs(synset_id_A, synset_id_B).

vgp(synset_id_A, synset_id_B).

at(synset_id_A, synset_id_B).

ant(synset_id_A,w_num_1,synset_id_B,w_num_2).

sa(synset_id,w_num,synset_id,w_num).

ppl(synset_id,w_num,synset_id,w_num).

per(synset_idA,w_num,synset_idB,w_num).

fr(synset_id,f_num,w_num).

Appendix B: References

Use of `rdfs:label`