Re: tag for notion and compound indication from Al Gilman on 2005-08-04 (www-html@w3.org from August 2005)

From: Al Gilman <Alfred.S.Gilman@IEEE.org>
Date: Thu, 4 Aug 2005 15:50:15 -0400
To: acc10-2005-67@gmx.de, www-html@w3.org
Message-Id: <p0611040abf181e0a8ea0@[10.0.1.2]>
At 8:33 PM +0200 8/4/05, acc10-2005-67@gmx.de wrote:
>Orion Adrian wrote:
>
>>  The only things that should be marked up
>>  are those things that a computer cannot
>>  do itself.
>
>Ok, but thats why the marking of notions and compound break points is an
>issue for the coder.
>
>How should a machine know on its own the notion structure of a text or the
>compounds? I know that their is a lot of research in artifical inteligence
>but I do not expect my machine to get in touch with it soon ;-). In fact
>only the comound break point analysis could be done automatically, but only
>by checking the text against highly qualified dictionaries and I do not see
>this as an appropriate solution, when you can store this information in the
>document itself.
>
>Maybe my suggestion was missunderstood. By no way I long for marking up all
>and every notion in a text, but only those considred by the author as to be
>of relevance for e.g. indexing or to get an idea of the documents content
>(of interest for more structured web search). So far, maybe <nfi> (notion
>for index) or <nor> (notion of relevance) instead of <n> (preferred because
>of length) would have been clearer suggestions, maybe my English wasn't good
>enough to transport my idea. In Germany we know "Schlagwort" or "Stichwort"
>but there seems to be no equivalent English word in my dictionary.
>
>If interested in an example, please scroll down.
>
>Of course no one should be "forced" to use notion if not interested in
>adding this semantic information (means <n> should be an optional tag), just
>the same way no one is forced to indicate all abbreviations used with <abbr>
>if their is no benefit from it.
>
>Do you remember printed literature? It had such indexes in the appendix. And
>I consider them a good approach for the web as well, between the both
>extremes "site/document search" and "sitemap/table of content".
>
>-----------------
>
>An example to demonstrate potential use of a <n> tag:
>
>Text copied and pasted from
>http://www.pitt.edu/~heinisch/ca_germ.html
>Author: Patricia Dinsmore
>
><p>
>The Germans have traditionally regarded their model as "<n>Sonderweg</n>",
>that is a middle of the road approach between free <n>market liberalism</n>
>and <n>state-centered socialism</n>. The <n>welfare system</n> is an
>integrated part of Germany's "<n>social market economy</n>." Particularly
>significant is the fact that in Germany, more than in most countries,
>welfare policies have been mechanisms of <n>economic governance</n>. That is
>welfare policies are designed to enhance <n>employment effects</n> by
>withdrawing <n>surplus labor</n> from the economy. In short, early
>retirement schemes or long university programs serve to constrain the supply
>of labor when unemployment rates are high. This has prompted critics to
>charge that Germany has the oldest students, youngest retirees and longest
>vacationing workers in the world.
></p>
>
>With a dedicated user agent or server side processing this could result in
>an alphabetical index like:
>
>- economic governance
>- economy, social market
>- effects, employment
>- employment effects
>- governance, economic
>- labor, surplus
>- liberalism, market
>- market economy, social
>- market liberalism
>- socialism, state-centered
>- social market economy
>- Sonderweg
>- state-centered socialism
>- surplus labor
>- system, welfare
>- welfare system

The seeming conventional wisdom on this is that any such index
generator will scan and match the natural language text against some
sort of a lexicon [ in your case a lexicon of key words ]. In your
example, the compound terms (multi-word terms) would appear in the
lexicon as phrases and that is how the scanner would know to match
these N-grams and not just single-word, whitespace-delimited tokens.

The most "next to market" work in W3C that resembles what you are
wishing for is the pronunciation lexicon development in the Voice
Browser Working Group.

http://www.w3.org/Voice/

They already have a 'phoneme' markup in SSML for the case where you
are willing to burden with markup each occurrence of the term in the
text. The lexicon would be a resource that is used as a reference by
the scan+match sort of operation discussed above. That is my
principal evidence for saying "the likely answer is processing, not
data, for this association."

Beyond that, the WAI is interested in concept linking.

http://www.w3.org/WAI/PF/natural-lang-20030326.html

http://lists.w3.org/Archives/Public/wai-xtech/2005Jul/0000.html

The division of labor between data and processing in meeting those
needs is, how shall I say, speculative at this time. Other than that
there would have to be some of each.

The "next to market" solution for more general notion linking could
well be the application of SKOS terms in the metadata structures
in XHTML 2.0.

http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20050510/

http://www.w3.org/TR/xhtml2/mod-meta.html#s_metamodule

Al
Received on Thursday, 4 August 2005 20:30:18 UTC