- From: Thomas Baker <thomas.baker@izb.fraunhofer.de>
- Date: Wed, 27 Oct 2004 15:51:43 +0200
- To: SW Best Practices <public-swbp-wg@w3.org>
SWBPD "Vocabulary Management" Draft, 2004-10-27 Abstract Metadata element sets, taxonomies, subject headings, thesauri, and ontologies are examples of vocabularies which are increasingly used in a "Semantic Web" environment. Managing vocabularies for use in Semantic Web applications means identifying, documenting, and publishing vocabulary terms in ways that facilitate their citation and re-use in a wide range of applications. This paper examines practices in the maintenance communities for representative vocabularies ranging from small and informal to large and complex. The paper formulates principles of good practice and summarizes discussion on issues for which good practice has yet to emerge. 1. Introduction 1.1. Vocabularies in the Semantic Web The Semantic Web is an open, distributed, loosely-coupled environment with lots of languages (metadata element sets, controlled vocabularies, taxonomies, thesauri, ontologies, etc). Organizations or even individuals can define and publish vocabulary terms in an open, bottom-up, and distributed manner. This paper is addressed to people who want to create and maintain such a Vocabulary. This paper articulates some basic principles for doing so in a Semantic-Web-friendly way. By this we mean vocabularies that can support processes of referencing, repurposing, recombining, or merging data from a diversity of sources; that are evolvable; that are extensible and mixable with other Semantic Web vocabularies; and that are declared in a way that is processable by networked machines in an emerging "semantic infrastructure". [Bernard asks: Which processes are the terms supposed to support -- indexing, vocabulary merging, data integration, search...? Do we say something about those processes or are we agnostic?] TASK: James - One page on "vocabularies in Semantic Web" The two placeholder paragraphs above should be expanded into one short page providing a general introduction to the topic "vocabularies in the Semantic Web" -- what kinds of vocabularies are we talking about here (e.g., the typology in [PIDCOCK]) and what does it mean to use them in a "Semantic Web" environment? Rather than elaborate very much in-line, this section should point off to further reading about Semantic Web. 1.2. Method of this paper In Section 2, this paper will formulate a few principles of good practice applicable to Semantic Web vocabularies in general. To illustrate these principles, the paper will describe practices used in several vocabularies chosen to exemplify a range from small and informal to large and complex: -- FOAF TASK: DanBri and Libby - One paragraph on FOAF FOAF serves as an example of a "relatively small" vocabulary for "descriptive metadata" about people and their interests [FOAF]. Its maintenance processes are "somewhat informal". -- Dublin Core TASK: Tom - One paragraph about Dublin Core Dublin Core serves as an example of a "medium-sized" vocabulary for "descriptive metadata" about information resources [DC]. Its maintenance processes are "lightweight but not weightless" and increasingly formal as DCMI evolves from a workshop-driven movement to a stable maintenance community supported by institutional stakeholders. -- SKOS TASK: Alistair - One paragraph about SKOS SKOS serves as an example of a "medium-sized" vocabulary for describing "thesauri" and similar types of knowledge organization systems. (Not sure about maintenance issues.) The SWBPD thesaurus activity should be cited. -- Princeton Wordnet TASK: Aldo - One paragraph about wordnet issues As a lexical system of synonym sets for the English language, Princeton Wordnet can serve as an example of a "large-scale" vocabulary. (Not sure about maintenance issues.) The SWBPD activity should be cited [SWBP-WNET]. -- A major medical or life-sciences vocabulary? TASK: Alan or Natasha - An example of a large-scale ontology? Do we perhaps need another major example? It would be good to have a "large-scale" vocabulary of the "ontology" sort, preferably with some well-defined maintenance and versioning policies... In addition, this paper cites several prior works on good practice in closely related areas: -- World Wide Web Architecture and Semantic Web principles TASK: DanBri - Bullet point on W3C good-practice documents TBL has written about Web architecture, and TAG has come out with Architecture of the World Wide Web, First Edition [SW-ARCHITECTURE and W3C-TAGARCHITECTURE]. A bullet point should put these various formal and informal position papers into the proper perspective for outsiders to W3C. -- OASIS Published Subjects TASK: Bernard - Bullet point on OASIS Published Subjects The bullet point should provide some context on Topic Maps and Semantic Web and on the PSI Recommendation [OASIS-PUBSUBJ]. The terminology used to talk about vocabularies and their underlying linguistic models differ between user communities. Without wishing to imply that these differences are trivial, this paper uses a small set of words defined with deliberate fuzziness: Term A named concept. Vocabulary A set of terms. URI Reference A globally unique identifier. Description A set of statements about a term or vocabulary. Declaration A machine-processable representation of a term or vocabulary. Vocabulary Owner The maintainer of a term set. Versioning The identification of changes to a term or vocabulary. These words are qualified in the examples which follow and in the Glossary. One potential source of confusion should perhaps be acknowledged and discussed up-front: the term "namespace", which is used in a number of vocabulary communities, W3C in particular, but is (in my opinion) difficult to pin down. If we can agree to use "vocabulary" in this paper (noting the usage of "namespace" where appropriate), I would like to task someone (DanBri?) to explain the W3C use of the term "namespace". TASK: DanBri or Libby - Describe W3C usage of the word "namespace" 2. Principles of Good Practice Short paragraph explaining that in this section, we formulate and illustrate principles of good practice on which we generally agree. 2.1. Identify Terms with URI References. TASK: DanBri - Define "URI Reference", elaborating in the Glossary TASK: DanBri - Sentence or two on FOAF term URIrefs TASK: Tom - Sentence or two on DCMI term URIrefs TASK: Tom - A sentence on the "CORES Resolution" TASK: Alistair - Sentence or two on SKOS term URIrefs TASK: Aldo - Sentence or two on Wordnet term URIrefs TASK: DanBri - What W3C says about identifying terms TASK: Bernard - What PSI says about identifying terms 2.2. Articulate and publish maintenance policies for the Terms and their URI references. A Vocabulary Owner should specify and publish any policies governing the maintenance of the terms and their URI references: e.g. institutional commitments to persistence and semantic stability. This short to medium-length section should simply describe a sample of such policies. [It would be nice if we could agree on something of the substance of those policies, such as stability of URI references in the face of "semantically compatible" evolution, but this may be difficult to define.] TASK: DanBri - Describe maintenance policies for FOAF TASK: Tom - Describe maintenance policies for DCMI TASK: Alistair - Describe maintenance policies for SKOS TASK: Aldo - Describe maintenance policies for Wordnet TASK: DanBri - What W3C says about maintenance policies TASK: Bernard - What PSI says about maintenance policies TASK: Alistair - TAG Versioning on "semantic stability" 2.3. Identify the historical version of a Vocabulary or its Terms. Building on the previous section, this section should look at versioning from the standpoint of identification. At what level of granularity does versioning operate? Are URI references being assigned to individual terms, to sets of terms in the abstract, or to documents or schemas of term sets? Presumably, this section should highlight W3C practice in this area (e.g., the method of distinguishing a timeless Latest Version from a date-stamped This Version and Previous Version). TASK: Ralph - Longer paragraph on versioning in W3C TASK: DanBri - Short paragraph on versioning in FOAF TASK: Tom - Short paragraph on versioning in DCMI TASK: Alistair - Short paragraph on versioning in SKOS TASK: Aldo - Short paragraph on versioning in Wordnet TASK: Bernard - Short paragraph on versioning in PSI TASK: Alistair - What TAG says about versioning TASK: Alan - "What constitutes a change?" 2.4. Provide documentation about the Terms. The Vocabulary Owner should describe and publish a human-readable description of the Terms -- typically, at a minimum, text definitions on a Web page. This short section should merely say what sort of Web documents are made available for the example vocabularies. TASK: DanBri - One sentence pointing to FOAF Web documents TASK: Tom - One sentence pointing to DCMI Web documents TASK: Alistair - One sentence pointing to SKOS Web documents TASK: Aldo - One sentence pointing to Wordnet Web documents TASK: DanBri - One sentence pointing to W3C Web documents TASK: Bernard - One sentence pointing to PSI Web documents 2.5. Declare the Terms using a machine-processable schema language. This short section should merely say what sorts of schemas the example maintenance communities publish. Policies for dereferencing and choice of schema language will be discussed in more detail in Section 3. TASK: DanBri - Two sentences on FOAF schemas. TASK: Tom - Two sentences on DCMI schemas. TASK: Alistair - Two sentences on SKOS schemas. TASK: Aldo - Two sentences on Wordnet schemas. TASK: DanBri - Two sentences on W3C schemas. TASK: Bernard - Two sentences on PSI schemas. 3. Questions on the Bleeding Edge Paragraph explaining that Section 3 discusses issues on which consensus currently seems more elusive. Our goal is to describe the range of positions taken. 3.1. What should the identifier of a Vocabulary or Term (i.e., its URI Reference) resolve to when someone "clicks on it" in a Web browser? We could reword this as the problem of resolving ("dereferencing") Term URIs to human-readable descriptions or machine-processable declarations. Several years ago, Tim Berners-Lee said that "The namespace document (with the namespace URI) is a place for the language publisher to keep definitive material about a namespace. Schema languages are ideal for this." Others have disagreed with this and the question was taken up by TAG. Point 3.1 should summarize the state of discussion. If Terms are documented in multiple ways, should a Vocabulary Owner distinguish between "canonical" versus "derived" sources? TASK: Ralph - Paragraph or two on W3C dereferencing policy TASK: Bernard - Paragraph on PSI dereferencing policy TASK: DanBri - Short paragraph on FOAF dereferencing policy TASK: Tom - Short paragraph on DCMI dereferencing policy TASK: Alistair - Short paragraph on SKOS dereferencing policy TASK: Aldo - Short paragraph on Wordnet dereferencing policy 3.2. Which schema language should be used to declare the Vocabulary machine-processably? Short answer: It depends what you want to say. This section should characterize the assertions made in schemas published by various communities. TASK: DanBri - Short paragraph on what FOAF schemas assert. TASK: Tom - Short paragraph on what DCMI schemas assert. TASK: Aldo - Short paragraph on what Wordnet schemas assert. TASK: DanBri - Short paragraph on what W3C schemas assert. TASK: Bernard - Short paragraph on what PSI schemas assert. TASK: Alistair - Short paragraph on what SKOS schemas assert. In particular, there was a discussion in September on the SWBPD list on different approaches to modeling thesauri [THESAURUS-MODEL]. For example, one could use OWL or RDFS to represent an existing language of thesaurus relations and simply translate an existing thesaurus into those terms. Or one could fundamentally remodel the thesaurus using native OWL constructs -- a much more ambitious task (because the semantics of class, subclass, etc, are not identical to thesaurus terms). When is it "good enough" to express the fuzzy semantics of an existing thesaurus, which can be done rather automatically, and what does the extra effort of remodeling an ontology buy for applications? There is an overlap here with the PORT task force. TASK: Alistair - Discuss alternative ways to model a thesaurus 3.3. What does it mean to "use" Terms from one Vocabulary in another? This issue has at least two aspects: -- The problem of "semantic context". Terms may be embedded in clusters of relations from which they may be seen in part to derive their meaning. It may therefore not always be sensible to use those terms out of context. Examples include the terms of thesauri or ontologies, as well as XML elements, which may be defined with respect to parent elements and may therefore not always be reusable as properties in an RDF sense without violating their semantic intent. TASK: Bernard - Reuse of existing terms in a local context TASK: Tom - DCMI on "terms usable as RDF properties" TASK: Everyone - Using terms outside of their original contexts -- Application profiles. Many (most?) vocabulary maintainers end up with some notion of "profile" to designate either a constrained subset of the vocabulary and/or a language which mixes multiple vocabularies for a particular purpose or application. The VM note could characterize the nature of these constructs. TASK: Tom - Describe the DCMI notion of "application profile" TASK: Everyone - Describe other notions of "application profile" 3.4. What does it mean to "own" a Vocabulary? In this section, we acknowledge that "vocabularies" are inherently a human linguistic phenomenon. As with other forms of language, there is inevitably a tension between the meaning intended by a speaker and meaning as interpreted or imposed by others. If this paper is addressed to vocabulary maintainers (existing and potential) -- and we have in essence articulated some responsibilities for vocabulary maintainers (in Section 2 above) -- we should also question our underlying assumptions. The RDF Concepts and Abstract Syntax draft of 2003-01-23 said that "The social conventions surrounding use of RDF assume that any RDF URI reference gains its meaning from some defining individual, organization or context... For important documents, the use of third-party vocabulary should be restricted to terms defined by trustworthy parties (e.g. recognized standards bodies or reputable organizations)...". In response to that draft, however, there was animated discussion about the "social meaning" versus the "formal meaning" of RDF assertions [SW-MEANING]. This debate should perhaps be summarized from the standpoint of a Vocabulary maintainer. TASK: Jeremy? - Summarize discussion of "social meaning" Even if we acknowledge the notion of "ownership" to be problematic, we should perhaps introduce the notion of "trust". Tom could briefly describe negotiations between the DCMI Usage Board with the Library of Congress whereby LoC asserts certain MARC Relator terms (identified with URI references) to be sub-properties of dc:contributor, and DCMI endorses those assertions ("assertion etiquette"?). TASK: Tom - DCMI endorsing assertions about MARC Relator terms TASK: Everyone - Comment on the role of the "vocabulary owner" 3.5. When a term is needed, when should one adapt an existing term, declare a new one, or get an established vocabulary maintainer to host it? It would be good to end the VM note with this question, because I suspect that alot of the readers will be asking precisely this question. This is where we can summarize our understanding of good practice for maintenance and persistence policy. Andy Powell's sensible advice on these issues could be summarized here [DC-IDENTIFIERS], along with a general characterization of the "vocabulary market" [VOCABULARY-MARKET]. We could introduce the notion of a Vocabulary Host, and Tom would be happy to describe discussion about this within DCMI from the standpoint of long-term maintenance responsibility and related institutional models. Given that one option is to coin a URI references, we should at least characterize choices with regard to forming the identifier strings: "hash or slash" and the implied semantics of words, version numbers, or directory hierarchies embedded in URI strings. TASK: DanBri or Libby - Describe the "vocabulary market" TASK: DanBri or Libby - Formation of URI strings ("hash or slash" etc) TASK: Tom - DCMI guidelines on coining URI references TASK: Tom - DCMI perspective on "namespace hosting" TASK: Everyone - When and how to declare new or reuse existing terms Glossary This section -- if we need it -- can provide annotations for our minimal terminology from the standpoint of other vocabulary maintenance communities. From the standpoint of Dublin Core, for example, one might note here that "term" corresponds to what DCMI calls an Element or Element Refinement (aka Property), or an Encoding Scheme, etc. Alan could point out how this use of "term" differs from "term" in the medical community (as distinct from "concept"). -- Term: a named concept. -- Vocabulary: a set of terms. -- URI Reference: a globally unique identifier. -- Description: a set of statements about a term or vocabulary. -- Declaration: a machine-processable representation of a term or vocabulary -- Vocabulary Owner: the maintainer of a term set. -- Versioning: the identification of changes to a term or vocabulary. TASK: DanBri or Libby - Define URI Reference According to my notes, the "RFC2396bis redraft will, in the Appendix, clearly state why we say URIref not just URI" [RFC2396bis]. TASK: DanBri - Annotate Glossary with FOAF usage where appropriate TASK: Tom - Annotate Glossary with DCMI usage where appropriate TASK: Alistair - Annotate Glossary with SKOS usage where appropriate TASK: Aldo - Annotate Glossary with Wordnet usage where appropriate TASK: Ralph - Annotate Glossary with W3C usage where appropriate TASK: Bernard - Annotate Glossary with PSI usage where appropriate References [I have started to fill out the references. The names next to many of the references are my best guess as to who should cover a particular resource in the context of the paper. Note several related articles or resources are sometimes grouped under one heading - please help decide which of them is most salient for the purposes of citation. Please also let me know if any of the following are no longer needed, and feel free to help fill in any missing citation information.] [CORES-RESOLUTION] - Tom CORES Resolution on Metadata Element Identifiers, http://www.dlib.org/dlib/july03/baker/07baker.html. [DC] - Tom http://dublincore.org/documents/dcmi-terms/ http://dublincore.org/ [DC-IDENTIFIERS] - Tom Powell, A., Guidelines for assigning identifiers to metadata terms, [draft], http://www.ukoln.ac.uk/metadata/dcmi/term-identifier-guidelines/. [DC-NAMESPACE] - Tom DCMI Namespace Policy, http://dublincore.org/documents/dcmi-namespace/ [DC-PROFILES] - Tom Dublin Core Application Profiles, http://www.cenorm.be/isss/cwa14855/. [FOAF] - DanBri and Libby http://xmlns.com/foaf/0.1/ http://www.w3.org/2001/sw/Europe/events/foaf-galway/ http://rdfweb.org/topic/FoafGalway FOAF Community Process, http://rdfweb.org/topic/FOAFCommunityProcess. [OASIS-PUBSUBJ] - Bernard Pepper, S., ed., Public Subjects: Introduction and Basic Requirements, OASIS Published Subjects Technical Committee Recommendation, 2003-06-24, http://www.oasis-open.org/committees/download.php/3050/pubsubj-pt1-1.02-cs.pdf. http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=tm-pubsubj http://www.oasis-open.org/committees/tm-pubsubj/docs/recommendations/issues.htm Also: OASIS (ISO/TS 15000) ebXMLRegistry Semantic Content. [PIDCOCK] Pidcock, W., Relationships between Metamodels, Ontologies, Thesauri, Taxonomies and Controlled Vocabularies, http://www.metamodel.com/article.php?story=20030115211223271 Comments by Mike Uschold: http://www.metamodel.com/article.php?story=20030115211223271#comments [RDF-PRIMER] RDF Primer, http://www.w3.org/TR/rdf-primer/. [RDF-QUERY] - where does this fit? Libby and Dan work on RDF query, http://www.ilrt.bris.ac.uk/discovery/2001/06/process/. [RFC2396bis] - DanBri http://www.ietf.org/internet-drafts/draft-fielding-uri-rfc2396bis-07.txt [SKOS] - Alistair SKOS Core Guide, http://esw.w3.org/topic/SkosCoreGuideToc - SKOS Core Guide http://www.w3.org/2004/skos/core.rdf http://www.w3.org/2001/sw/Europe/reports/thes/1.0/guide/ http://www.w3c.rl.ac.uk/2003/11/21-skos-mapping [SWBP-WNET] - Aldo Gangemi, A., editor. Porting Wordnets to the Semantic Web, http://www.w3.org/2001/sw/BestPractices/WNET/Porting. http://www.cogsci.princeton.edu/%7Ewn/index.shtml [SWAD-THESAURUS] - Dan, Bernard and Alistair participated SWAD-E Thesaurus - "standard" thesaurus change management guidelines are wanted, http://lists.w3.org/Archives/Public/public-esw-thes/2004Apr/ [SW-ARCHITECTURE] - DanBri or Libby? Berners-Lee, T. Getting into RDF and Semantic Web using N3, http://www.w3.org/2000/10/swap/Primer. Berners-Lee, T. Web Architecture from 50,000 feet, 1999, http://www.w3.org/DesignIssues/Architecture#Namespaces [SWBP-THESAURUS] - Dan and Alistair Semantic Web Best Practices: Thesaurus Task Force, http://www.w3.org/2004/03/thes-tf/mission [SW-MEANING] - volunteer needed to summarize! RDF Core discussion on issues related to social meaning (Jeremy), http://www.w3.org/TR/2003/WD-rdf-concepts-20030123/#section-Meaning had WG consensus, then got trashed: http://lists.w3.org/Archives/Public/www-rdf-comments/2003JanMar/0366 then got revised: http://lists.w3.org/Archives/Public/www-rdf-comments/2003JanMar/0486 http://www.w3.org/2001/sw/meetings/tech-200303/social-meaning Mailing list addressing questions of "namespace ownership": http://lists.w3.org/Archives/Public/public-sw-meaning/2004Jun/ [THESAURUS-MODEL] VM discussion thread on SWBPD list, e.g.: http://lists.w3.org/Archives/Public/public-swbp-wg/2004Sep/0035.html http://lists.w3.org/Archives/Public/public-swbp-wg/2004Sep/0036.html http://lists.w3.org/Archives/Public/public-swbp-wg/2004Sep/0042.html [VOCABULARY-MARKET] - DanBri Vocabulary Market, http://esw.w3.org/topic/VocabularyMarket Image Annotation meeting in Madrid, http://rdfig.xmlhack.com/2004/06/07/2004-06-07.html#1086615887.400193 RDFIG Geo vocab workspace, http://www.w3.org/2003/01/geo/. [W3C-VERSIONING] - Ralph W3C Publication Rules, http://www.w3.org/2004/02/02-pubrules.html URIs for W3C Namespaces, http://www.w3.org/1999/10/nsuri [W3C-TAGARCHITECTURE] - DanBri? Jacobs, I., Walsh, N., Architecture of the World Wide Web, First Edition, Technical Architecture Group (TAG), http://www.w3.org/TR/2004/WD-webarch-20040816/. [W3C-TAGISSUES] - DanBri or Libby W3C TAG on "What should a 'namespace document' look like? http://www.w3.org/2001/tag/issues.html#namespaceDocument-8. TAG "consensus" on namespace documents, http://www.w3.org/2003/09/15-tag-summary.html. Resource Directory Description Language (RDDL), http://www.tbray.org/tag/rddl4.html. [W3C-TAG-XMLVERSIONING] - Alistair Orchard, D., Walsh, N., eds. Versioning XML Languages, Proposed TAG Finding 16 November 2003 [Editorial Draft], http://www.w3.org/2001/tag/doc/versioning [WGS84] - DanBri?? Walsh, J. An RDF vocabulary for WGS84 geo positioning [Informational Internet draft], RDF Interest Group, http://space.frot.org/draft-geo-draft.html. -- Dr. Thomas Baker Thomas.Baker@izb.fraunhofer.de Institutszentrum Schloss Birlinghoven mobile +49-160-9664-2129 Fraunhofer-Gesellschaft work +49-30-8109-9027 53754 Sankt Augustin, Germany fax +49-2241-144-2352 Personal email: thbaker79@alumni.amherst.edu
Received on Wednesday, 27 October 2004 13:46:28 UTC