existing approaches related to the identification of terms from Lieske, Christian on 2005-04-26 (public-i18n-its@w3.org from April to June 2005)

From: Lieske, Christian <christian.lieske@sap.com>
Date: Tue, 26 Apr 2005 15:27:18 +0200
To: <public-i18n-its@w3.org>
Message-ID: <0F568FE519230641B5F84502E0979DD102B7254A@dewdfe12.wdf.sap.corp>
Hello,

During the 2005-04-21 teleconference I took the action item to look into
existing approaches related to the identification of terms. Please find
my results below.

Best regards,
Christian
---

Markup related to terminology comes in at least two disguises:

1. dedicated: markup whose primary purpose is the codification of
information related to terminological or lexical data (examples: TBX,
OLIF)

2. non-dedicated: markup whose primary purpose is the codification of
other information but which has provisions for terminological/lexical
data (examples: DocBook, XHTML)

For the purpose of the current ITS discussion about existing term
identification approaches, the remainder focuses on the second type
(non-dedicated).

The general observation about non-dedicated markup for terminology is to
be the following: no standard exists. Different approaches are taken
wrt. at least four dimensions (see below).

1. Approach to Term Classes

Terms can be classified for example as abbreviation, initialism, acronym
etc. Accordingly, we see at least the following approaches in markup

a. The class is an attribute to an element (here, the attribute value is
meant to correspond to the data category "abbreviation" from ISO 12620).

	<term class="ISO12620:2.1.8.1">W3C</term>

b. Selected classes (e.g. abbreviations) get their own representation by
means of an element.

	<abbrev>W3C</abbrev>

2. Approach to Term-related Information

Often, the value of terminology is increased through term-related
information such as usage information (e.g. "deprecated"), alternate
forms or cross-references. Accordingly, we see at least the following
approaches in markup

a. The alternate form is an attribute to an element.

	<abbrev fullForm="World Wide Web Consortium">W3C</abbrev>
   
b.   The alternate form is given its own representation as an element.
This element gets referenced.

	<abbrev fullForm="#ffW3C">W3C</abbrev>
	<fullForm id="ffW3C">World Wide Web Consortium</fullForm>

3. Approach to Location

Several approaches exist wrt. the location of term-related information
(e.g. terms and definitions)

a. inline

<para>This paragraph contains an inline term definition. <termdef>A
software module called an <glossterm>XML processor</glossterm> is used
to read XML documents and provide access to their content and
structure.</termdef> The definition comes from <link
xlink:href="http://www.w3.org/TR/REC-xml">the XML
Recommendation</link>.</para>

b. block

<dl>
	<dt>XML processor</dt>
	<dd>Software module called used to read XML documents and
provide access to their content and structure.</dd>
</dl>

4. Relationship to Automated Text and other Markup

Very often terms are viewed as good candidates to be included in special
types of processing such as generating a back of the book index, or
special weighting in indices build by search engines. However, different
approaches are taken.

a. explicit

If a term is to be used for example as an entry in a back of the book
index, it has to be tagged specifically.

	<indexTerm>W3C</indexTerm>
	The <term>W3C</term> is a standards body.

b. implicit

No special markup is used. Rather, the processing kind of repurposes the
existing markup. An indexer e.g. may be configured in such a way that
all 'term' elements are treated in a special way.

5. Motivation

Term-related markup seems to have different motivations which stretch
from special rendering (all terms should stand out from ordinary text)
to special purpose applications (e.g. linking to a GUI menu item).
Accordingly, term-related markup gets inserted by people with differing
skills sets for specific purposes.
Received on Tuesday, 26 April 2005 13:27:29 UTC