Re: Links vs. identifiers (Re: [ACTION-94]: go and find examples of concept ontology (semantic features of terms as opposed to domain type ontologies))

Hi Dave,

with your proposal we run into two issues.

First, who decides what people will provide with these URIs? Will people
trust
http://www.sfs.uni-tuebingen.de/lsd/index.shtml
to provide guidance about getting a machine readable form of the german
wordnet?

Second, and more difficult, people will have a hard time to agree on this
list of values
"onto-concept | sem-net-node | terminology-entry | eqiv-translation"
as you mention yourself.

I would rather propose the following approach, first given as markup that
we'd define:

<span its-entity entityref="http://www.w3.org/2012/semantic-resources/"
its-selector="gn-synset_loschen_3">löschen</span>


At

http://www.w3.org/2012/semantic-resources/

we would have a table with two columns:

1) Name of semantic resource and infos on how to get it, including
licensing infos that we need to make people aware of

2) prefix to be used in the "selector" attribute, e.g. "gn" for germanet



The benefit of this proposal is that implementors (and us) don't have to
decide about values like
onto-concept | sem-net-node | terminology-entry | eqiv-translation
Our working group would maintain

http://www.w3.org/2012/semantic-resources/
and add entries for new resources - *if* there is an agreement with the
host of the resource. I know that this puts a burden on us, namely to talk
directly to the hosts. But we need to do that anyway, otherwise no
implementor will be able to make use of the information.

WRT "hosting by W3C or others", see the approach of link types in HTML5:

I) You have hard wired link types at
http://dev.w3.org/html5/spec/links.html#linkTypes
II) You have extensions at
http://microformats.org/wiki/existing-rel-values#HTML5_link_type_extensions

The point is that only with I) you achieve broad adoption, and II) is
driven by a community process: see at
http://microformats.org/wiki/existing-rel-values#HTML5_link_type_extensions
"Before you register a new value .... Entries lacking any of the above
required data will likely be removed"

Again, without us engaging the host of the resources directly and trying to
achieve I) as much as possible and II) with a well defined process, a link
to external information is pretty likely to be useless.

A question on ISOCAT: at
http://www.isocat.org/interface/index.html
I see a browsable structure of categories, but no dump to download them,
and no URI to identify ISOCAT in general. I may just miss that, but could
you point me (or other potential implementors) to that? For Germanet or
wordnet that information is easy to find, btw.

Thanks,

Felix

2012/6/9 Dave Lewis <dave.lewis@cs.tcd.ie>

>  Hi Felix,
> I think option 'a' makes the most sense. If language resource providers
> want their language resources to be access over the web, then they should
> be well motivated to provide stable URIs. There seems plenty who are, like
> the wordnet site you cite, like ISOCAT and this is inherent with providers
> taking the semweb ontology route.
>
> If they aren't willing to provide stable URIs I'm not sure it should be
> the W3C's job to compensate for this. I'm not clear why this was done in
> the unicode codepoint collation case - perhaps they were so key that the
> W3C made this a special case?
>
> I've two other questions we can follow up with next week:
> 1) if there is a stable URI for the particular resource item, do we need a
> separate attribute for the  resource and then a selector for the item if it
> is only ever a fragment ID? Would a single fragment URI not suffice?
>
> 2) I like Pedro categorisation of different resource types. But as pointed
> out in the thread, this still isn't sufficient by itself to enable a client
> to understand how to interpret the resource - this requires some detailed
> knowledge of the resource schema in the general case. So does it make sense
> to hardwire this into an attribute name? Might it be better to have it as a
> value to an attribute like resource type? e.g.
> its-referenced-resource-type : onto-concept | sem-net-node |
> terminology-entry | eqiv-translation
>
> Given we are not sure if this is the right enumeration, at least this way
> we could specify this as non-normative values, that could be added to
> later.
>
> The ideal would be if referenced resources also offered a URL to a
> standardised resource meta-data record, such as the META-SHARE meta-data
> model, which contained sufficient knowledge for a client to interpret the
> fragment URI (or URI and selector) correctly.
>
> There will be many of the right people in Dublin to have a good discussion
> on this.
>
> cheers,
> Dave
>
> On 08/06/2012 16:39, Felix Sasaki wrote:
>
> Hi Pedro all,
>
> 2012/6/8 Pedro L. Díez Orzas <pedro.diez@linguaserve.com>
>
>>   Dear Tadej, Felix, Yves, Dave, all,
>>
>>
>>
>> I checked with some expert people and told me the following:
>>
>>
>>
>> *It would be great if links to wordnet can be included in the
>> annotations. The best thing to do would be to use the open linked data
>> versions of wordnet:*
>>
>> * *
>>
>> *http://thedatahub.org/dataset/vu-wordnet***
>>
>> * *
>>
>> *It has URIs for synsets (actually sense meanings but I convinced them
>> they need to shift to synset IDs, which they will do in the near future).
>> English synsets are good for any language since the other languages link to
>> English (still as an Inter Lingual Index). Eventually, other wordnets will
>> also be published as linked open data.*
>>
>> * *
>>
>> *Another thing is domain tags. WordnetDomain tags are used here (Dewey
>> system). Since it is linked to English Wordnet it is linked to any synset
>> in any language linked to English. That will be a very useful semantic tag
>> also for translation.*
>>
>>
>>
>> I think this is a right way to reinforce the connection between MLS-LT
>> and open linked data. I hope it helps.
>>
>
>
>  The above is great. I just want to make sure that we are on sync with
> one aspect: we need sustainable *identifiers* for the resources you
> mentioned. Let me try to make the difference clear with the "codepoint
> based collation" example below:
>
>  - An application that wants to use code point based collation needs the
> data tables for that
> - http://www.w3.org/2005/xpath-functions/collation/codepoint/ is not a
> way to download the data tables, but to identify that kind of collation
>
>  Take as an example related to our area the way wordnet is used in this
> XQuery processor
>
> http://cf.zorba-xquery.com.s3.amazonaws.com/doc/zorba-2.0/zorba/html/ft_thesaurus.html
>
>  [
>
> let $x := <msg>affluent man</msg>
>
> return $x contains text "wealthy"
>
> using thesaurus at "http://wordnet.princeton.edu"
>
> ]
>
>
>  The "using thesaurus at "http://wordnet.princeton.edu"  statement does
> not mean that the thesaurus is downloaded from the wordnet site at
> princeton. It just means that the XQuery processor evokes the cached
> version of wordnet, which is identified by the
> http://wordnet.princeton.edu
>
>
> For our scenarios, I assume processing steps like this
>
> 1) Automatic annotation leading to e.g. this
>
> <span its-disambiguation its-semantic-network-ref="
> http://www.sfs.uni-tuebingen.de/lsd/index.shtml
> " its-selector="#synset_loschen_3">löschen</span>
>
>  2) An application that knows there to find the resource identified by
>
> http://www.sfs.uni-tuebingen.de/lsd/index.shtml
>
> can cache that resource and use it e.g. for improving MT or other
> (localization) workflows.
>
>
>  The conclusion from this is that from the providers of the resources, we
> need to ask one of the following:
>
> a) a stable URI for identification; resolving that URI should give
> implementors of 2) the information they need for caching the resource in an
> implementation specific manner.
>
> b) that they allow W3C to provide the URI, like in the collation example:
> it is W3C which hosts
> http://www.w3.org/2005/xpath-functions/collation/codepoint/  , not the
> Unicode consortium that provides the codepoint list.
>
>
>
> Which of a) or b) do people prefer?
>
>  Best,
>
>  Felix
>
>
>>
>> Best,
>>
>> Pedro
>>
>>
>>  ------------------------------
>>
>> *De:* Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
>> *Enviado el:* jueves, 07 de junio de 2012 23:58
>> *Para:* public-multilingualweb-lt@w3.org
>>
>> *Asunto:* Re: [ACTION-94]: go and find examples of concept ontology
>> (semantic features of terms as opposed to domain type ontologies)
>>
>>
>>
>> Hi Tadej,
>> I spoke to some people from ISOCAT at LREC. They operate persistent URL
>> for their platform, so with an example perhaps we could add that to the
>> list?
>>
>> cheers,
>> Dave
>>
>> On 07/06/2012 15:19, Felix Sasaki wrote:
>>
>>
>>
>> 2012/6/7 Tadej Stajner <tadej.stajner@ijs.si>
>>
>> Hi Felix,
>> as far as I'm aware, URIs only exist for the English wordnet. Maybe
>> prefixing the a # was not the best stylistic choice here, but yes, what I
>> meant to convey is that that value was a local identifier, valid within a
>> particular semantic network.
>>
>> In the ideal scenario, these selectors would be dereferencible and
>> verifiable via URIs for arbitrary wordnets and terminology lexicons and
>> their entries.
>>
>>
>>
>>
>>
>> OK - the main point would be that they are dereferencible and verifiable.
>> In practice, you will not achieve that for arbitrary wordnets, but you can
>> achieve that for a subset, if the related "players" agree. In the
>> "collation" example mentioned before, the identifier for the Unicode code
>> point based collation
>> http://www.w3.org/2005/xpath-functions/collation/codepoint/ was the
>> lowest common dominator; in addition to that everybody is free to have
>> other URIs for arbitrary collations. I would hope that we could end up with
>> such a list (hopefully longer than one) for the semantic networks too.
>>
>>
>>
>> Felix
>>
>>
>>
>>
>>
>>   Do we have any people involved in developing semantic networks or term
>> lexicons on this list? The compromise is allowing some limited classes of
>> non-URI local selectors, like synset IDs for wordnets, and term IDs for TBX
>> lexicons.
>>
>> -- Tadej
>>
>>
>>
>> On 6/7/2012 3:44 PM, Felix Sasaki wrote:
>>
>> Thanks, Tadej.
>>
>>
>>
>> The value of the its-selector attribute looks like a document internal
>> link. But it is probably an identifier of the synset in the given semantic
>> network, no?
>>
>>
>>
>> About 1) and 2): is your made-up example then the output of the text
>> annotation use case? I am asking since you say "2) markup in raw ITS", so
>> I'm not sure.
>>
>>
>>
>> Also, it seems that an implementation needs to "know" about the resources
>> that are identified via its-semantic-network-ref. This is really an
>> identifier, like
>>
>> http://www.w3.org/2005/xpath-functions/collation/codepoint/
>>
>> is an identifier for a Unicode code point collation; it doesn't give you
>> the collation data, but creating an implementation that "understands" the
>> identifier means probably caching the collation data. The same would be
>> true for the semantic network.
>>
>>
>>
>> This leads to the next question: can we engage the developers of the
>> semantic network (or other disambiguation related) resources to come up
>> with stable URIs for these? It would be great to list these URIs in our
>> specification and say "this is how you identify the English wordnet etc.",
>> for scenarios like the collation data mentioned above.
>>
>>
>>
>> Felix
>>
>> 2012/6/7 Tadej Å tajner <tadej.stajner@ijs.si>
>>
>> Hi,
>>
>> I agree with Pedro on the questions. Automatic word sense disambiguation
>> is in practice still not perfect, so some semi-automatic user interfaces
>> make a lot of sense. And how I think that this could look like in a made-up
>> example, answering Felix's 1) and 2):
>>
>> 1) HTML+ITS: <span its-disambiguation its-semantic-network-ref=
>> "http://www.sfs.uni-tuebingen.de/lsd/index.shtml"<http://www.sfs.uni-tuebingen.de/lsd/index.shtml>its-selector="#synset_loschen_3">löschen</span>
>>
>> 2) Markup in raw ITS
>>  <its:disambiguation
>>     semanticNetworkRef="http://www.sfs.uni-tuebingen.de/lsd/index.shtml"<http://www.sfs.uni-tuebingen.de/lsd/index.shtml>
>>     selector="#synset_loschen_3">löschen</its:disambiguation>
>>
>> -- Tadej
>>
>>
>>
>>
>> On 04. 06. 2012 13 <04.%2006.%202012%2013>:53, Pedro L. Díez Orzas
>> wrote:
>>
>> Dear Felix,
>>
>>
>>
>> Thank you very much. Probably Tadej can prepare the use cases you
>> mention, with the consolidated data category. About the question 3 and 4, I
>> can tell you the following:
>>
>>
>>
>> 3) Would it be produced also by an automatic text annotation tool?
>>
>>
>>
>> For the pointers to the three information referred (concepts in Ontology,
>> meanings in Lexical DB, and terms in Terminological resources) I think it
>> would be possible semiautomatic annotation tools, that is, proposed by the
>> tool and confirmed by user.
>>
>>
>>
>> The fully automatic text annotation  would need more sophisticate
>> “semantic calculus”, and most of these are under research, as far as I
>> know. Maybe, in this cases, it should be combined with
>> textAnalysisAnnotation, specifying in *Annotation agent* – and *Confidence
>> score* – which system and with which reliability has been produced.
>>
>>
>>
>> 4) Would 1-2 be consumed by an MT tool, or by other tools?
>>
>>
>>
>> These can be basically consumed by language processing tools, like MT,
>> and other Linguistic Technology that needs content or semantic info. For
>> instance Text Analytics, Semantic search, etc.. In the localization chains,
>> these information can be also used by automatic or semiautomatic processes
>> (like selection of dictionaries for translations, or selection of
>> translators/revisers by subject area)
>>
>>
>>
>> It could be also used by humans for translation or post-edition in case
>> of ambiguity or lake of context in the content, but mostly by automatic
>> systems.
>>
>>
>>
>> I hope this helps.
>>
>> Pedro
>>
>>
>>  ------------------------------
>>
>> *De:* Felix Sasaki [mailto:fsasaki@w3.org <fsasaki@w3.org>]
>> *Enviado el:* sábado, 02 de junio de 2012 14:13
>> *Para:* Tadej Stajner; pedro.diez
>> *CC:* public-multilingualweb-lt@w3.org
>> *Asunto:* Re: [ACTION-94]: go and find examples of concept ontology
>> (semantic features of terms as opposed to domain type ontologies)
>>
>>
>>
>> Hi Tadej, Pedro, all,
>>
>>
>>
>> this looks like a great chain of producing and consuming metadata.
>>
>>
>>
>> Apologies if this was explained during last weeks call or before, but can
>> you clarify a bit the following:
>>
>>
>>
>> 1) How would the actual HTML markup produced in the original text
>> annotation use case look like?
>>
>> 2) How would the markup in this use case look like?
>>
>> 3) Would it be produced also by an automatic text annotation tool?
>>
>> 4) Would 1-2 be consumed by an MT tool, or by other tools?
>>
>>
>>
>> Thanks again,
>>
>>
>>
>> Felix
>>
>> 2012/5/31 Tadej Stajner <tadej.stajner@ijs.si>
>>
>> Hi Pedro,
>> thanks for the excellent explanation. If I understand you correctly, a
>> sufficient example for this use case would be annotation of individual
>> words with synset URI of the appropriate wordnet? If so, then I believe
>> this route can be practical - I think linking to the synset is a more
>> practical idea than expressing semantic features of the word given the
>> available tools.
>>
>> Enrycher can do automatic all-word disambiguation into the english
>> wordnet, whereas  we don't have anything specific in place for semantic
>> features (which I suspect also holds for other text analytics providers).
>>
>> I'm also in favor of prescribing wordnets for individual languages as
>> valid selector domains as you suggest in option 1). That would make
>> validation easier since we have a known domain.
>>
>> @All: Can we come up with a second implementation for this use case,
>> preferrably a consumer?
>>
>> -- Tadej
>>
>>
>>
>>
>> On 5/29/2012 2:00 PM, Pedro L. Díez Orzas wrote:
>>
>> Dear all,
>>
>>
>>
>> Sorry for the delay. I tried to contact some people I think can
>> contribute to this, but they are not available these weeks.
>>
>>
>>
>> Before providing an example to consider all if it is worthwhile to
>> maintain “semantic selector” attribute in the consolidation of
>> “Disambiguation” I would like to do a couple considerations:
>>
>>
>>
>>    1. Probably we will not have short term any implementation, but there
>>    are for example few semantic networks available in web (see
>>    http://www.globalwordnet.org/gwa/wordnet_table.html) that could be
>>    mapped using semantic selectors. See on line for example, the famous
>>    http://wordnetweb.princeton.edu<http://wordnetweb.princeton.edu/perl/webwn>
>>    ).
>>    2. The W3C working group SKOS (Simple Knowledge Organization System
>>    Reference) are maybe dealing with similar things.
>>
>>
>>
>> The “semántica selector” allows further lexical (simple words or multi
>> words) distinctions than a “domain” or an ontology like NERD. Also, the
>> denotation is different from the “concept reference”, most of all in part
>> of speech like verbs.
>>
>>
>>
>> Within the same domain, referring to very similar concepts, languages
>> have semantic differences. Depending on the semantic theory used, each
>> tries to captivate these differences by means of different systems
>> (semantic features, semantic primitives, semantic nodes (in semantic
>> networks), other semantic representations). An example could be the German
>> verb “löschen”, which in different contexts can take different meanings
>> that can be try to capture using different selectors, with the different
>> systems.
>>
>>
>>
>> –         löschen                        -> clear             (some
>> bits)
>>                                    -> delete           (files)
>>                                    -> cancel          (programs)
>>                                    -> erase            (a scratchpad)
>>                                    -> extinguish     (a fire)
>>
>>
>>
>> Other possible translations of the verb* *“löschen” are:
>>
>> delete
>>
>> löschen, streichen, tilgen, ausstreichen, herausstreichen
>>
>> clear
>>
>> löschen, klären, klarmachen, leeren, räumen, säubern
>>
>> erase
>>
>> löschen, auslöschen, tilgen, ausradieren, radieren, abwischen
>>
>> extinguish
>>
>> löschen, auslöschen, zerstören
>>
>> quench
>>
>> löschen, stillen, abschrecken, dämpfen
>>
>> put out
>>
>> löschen, bringen, ausmachen, ausschalten, treiben, verstimmen
>>
>> unload
>>
>> entladen, abladen, ausladen, löschen, abstoßen, abwälzen
>>
>> discharge
>>
>> entladen, erfüllen, entlassen, entlasten, löschen, ausstoßen
>>
>> wipe out
>>
>> auslöschen, löschen, ausrotten, tilgen, zunichte machen, auswischen
>>
>> slake
>>
>> stillen, löschen
>>
>> close
>>
>> schließen, verschließen, abschließen, sperren, zumachen, löschen
>>
>> blot
>>
>> löschen, abtupfen, klecksen, beklecksen, sich unmöglich machen, sich
>> verderben
>>
>> turn off
>>
>> ausschalten, abbiegen, abstellen, abdrehen, einbiegen, löschen
>>
>> blow out
>>
>> auspusten, löschen, aufblasen, aufblähen, aufbauschen, platzen
>>
>> zap
>>
>> abknallen, düsen, umschalten, löschen, töten, kaputtmachen
>>
>> redeem
>>
>> einlösen, erlösen, zurückkaufen, tilgen, retten, löschen
>>
>> pay off
>>
>> auszahlen, bezahlen, tilgen, abzahlen, abbezahlen, löschen
>>
>> switch out
>>
>> löschen
>>
>> unship
>>
>> ausladen, entladen, abnehmen, löschen
>>
>> souse
>>
>> eintauchen, durchtränken, löschen, nass machen
>>
>> rub off
>>
>> abreiben, abgehen, abwetzen, ausradieren, abscheuern, löschen
>>
>> strike off
>>
>> löschen
>>
>> land
>>
>> landen, an Land gehen, kriegen, an Land ziehen, aufsetzen, löschen
>>
>>
>>
>>
>>
>>
>>
>> According to this, the consolidation of disambiguation/namedEntity/  data
>> categories under “Terminology”
>> http://www.w3.org/International/multilingualweb/lt/wiki/Requirements#disambiguationcould be the following. It is thought to cover operational URI or XPath
>> pointers to the current three most important semantic resources: conceptual
>> (ontology), semantic (semantic networks or lexical databases) and
>> terminological (glossaries and terminological resources), where ontologies
>> are used for both general lexicon and terminology, semantic networks to
>> represent general vocabulary (lexicon), and terminological resources
>> specialized vocabulary.
>>
>>
>>
>> *disambiguation*
>>
>> Includes data to be used by MT systems in disambiguating difficult content
>>
>>
>>
>> *Data model*
>>
>>    - concept reference: points to a *concept in an ontology* that this
>>    fragment of text represents. May be an URI or an XPath pointer.
>>    - semantic selector: points to a *meaning in an semantic network*that this fragment of text represents. May be an URI or an XPath pointer.
>>    - terminology reference: points to *a term in a terminological
>>    resource* that this fragment of text represents. May be an URI or an
>>    XPath pointer.
>>    - equivalent translation: expressions of that concept in other
>>    languages, for example for training MT systems
>>
>>
>>
>>
>>
>> Also, I would keep *textAnalysisAnnotation*, since the purpose is quite
>> different.
>>
>>
>>
>> Anyway, if we consider not to include “semantic selector” now, maybe it
>> can be for future versions or to be treated in liaison with other groups.
>>
>>
>>
>> I hope it helps,
>>
>> Pedro
>>
>>
>>
>> *__________________________________*
>>
>> * *
>>
>> *Pedro L. Díez Orzas*
>>
>> *Presidente Ejecutivo/CEO*
>>
>> *Linguaserve Internacionalización de Servicios, S.A.*
>>
>> *Tel.: +34 91 761 64 60 <%2B34%2091%20761%2064%2060>
>> Fax: +34 91 542 89 28 <%2B34%2091%20542%2089%2028> *
>>
>> *E-mail: **pedro.diez@linguaserve.com*
>>
>> *www.linguaserve.com*
>>
>> * *
>>
>> «En cumplimiento con lo previsto con los artículos 21 y 22 de la Ley
>> 34/2002, de 11 de julio, de Servicios de la Sociedad de Información y
>> Comercio Electrónico, le informamos que procederemos al archivo y
>> tratamiento de sus datos exclusivamente con fines de promoción de los
>> productos y servicios ofrecidos por LINGUASERVE INTERNACIONALIZACIÓN DE
>> SERVICIOS, S.A. En caso de que Vdes. no deseen que procedamos al archivo y
>> tratamiento de los datos proporcionados, o no deseen recibir comunicaciones
>> comerciales sobre los productos y servicios ofrecidos, comuníquenoslo a
>> clients@linguaserve.com, y su petición será inmediatamente cumplida.»
>>
>>
>>
>> "According to the provisions set forth in articles 21 and 22 of Law
>> 34/2002 of July 11 regarding Information Society and eCommerce Services, we
>> will store and use your personal data with the sole purpose of marketing
>> the products and services offered by LINGUASERVE INTERNACIONALIZACIÓN DE
>> SERVICIOS, S.A. If you do not wish your personal data to be stored and
>> handled, or you do not wish to receive further information regarding
>> products and services offered by our company, please e-mail us to
>> clients@linguaserve.com. Your request will be processed immediately."
>>
>>  *____________________________________*
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Felix Sasaki
>>
>> DFKI / W3C Fellow
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Felix Sasaki
>>
>> DFKI / W3C Fellow
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Felix Sasaki
>>
>> DFKI / W3C Fellow
>>
>>
>>
>>
>>
>
>
>
>  --
> Felix Sasaki
> DFKI / W3C Fellow
>
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow

Received on Saturday, 9 June 2012 04:31:46 UTC