W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > July 2012

Re: [All] domain data category section proposal, please review

From: Felix Sasaki <fsasaki@w3.org>
Date: Wed, 4 Jul 2012 12:21:02 +0200
Message-ID: <CAL58czq=jKFUbK5h3+YwCs6sX=3KF4bruQStfPniYU4WvyAzWQ@mail.gmail.com>
To: Dave Lewis <dave.lewis@cs.tcd.ie>
Cc: public-multilingualweb-lt@w3.org
2012/7/4 Dave Lewis <dave.lewis@cs.tcd.ie>

>  comments inline -
>
> On 04/07/2012 05:45, Felix Sasaki wrote:
>
> Hi Dave,
>
> 2012/7/4 Dave Lewis <dave.lewis@cs.tcd.ie>
>
>>  Hi Felix,
>> One question on the domainMapping example you give for the domain data
>> category. This assumes the workflow has a single canonical set of IDs
>> identifying 'auto', 'medicine', 'law', but this may not always be the case,
>> e.g. where SMT engines are trained on a mix of parallel data with their own
>> separate corpora domain naming schemes.
>>
>
>  Couldn't you accomodate that by having several domainRule elements?
>
>
>
> I don't think so since there's nothing in the domainRule that indicates
> that the simple target names are in different name spaces.
>

Well, the ITS general mechanisms allow to have different rules for the same
kind of content, exactly for that purpose: that there are different
metadata items for the same purpose. So you can have
<its:translateRule selector="//code" translate="no"/>
<its:translateRule selector="//code" translate="yes"/>
and the second rule comes into effect, since it's the last rule (like CSS
stylesheets). So you can describe a best practice, rather than changing the
definition of "domain", for using different, vendor specific sets of
mappings.

>
>
>   So a simple naming scheme means that the workflow provider must ensure
>> consistency of that scheme and that the document editor (often the client)
>> has knowledge of that scheme.
>>
>> So could the data category  as is accommodate multiple naming schemes
>> (e.g. from the client and from third parties) within the workflow by simply
>> using a URL instead of a simple name?  e.g.
>>
>> domainMapping="automotive auto, medical medicine, 'criminal law' http://www.taus.org/domain/law, 'property law' http://www.client.com/domain-names/law"
>>
>>   This has to be answered by Thomas and Declan, I think: they (and one
> external provider) agreed on the simple scheme. I'm fine with introducing
> URIs, but we need implementations making use of them.
>
>
> Yes, Thomas, Declan advice appreciated.
>

I would go further here: without a commitment from them to implement the
URI mechanism, we shouldn't go that route. Declan mentioned in his
contribution
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0152.html
that dublin core plays an important role in here - but you only will need a
URI to refer to dublin core in general, not to the keywords themselves.


> To make this example a bit concrete, with domains used in MT training, I
> can easily see cases where the mapping have to accomodate one domain scheme
> that is specific to the client, one that is used by the LSP across clients
> and one relevant to an MT service provider that could be influenced by the
> range of existing training corpora they have.
>
> Reconciling this into a single name space for each workflow instance would
> be onerous for all parties and wouldn't be that helpful for any of the
> parties if they operate with multiple clients/providers. In settings where
> you have one large client working all the time with a fixed set of
> providers, then establishing a common set of domain names would make sense.
>

I see your issue of conflicting metadata for the same piece of content,
however I think the general rule mechanism is also a solution for that, as
stated above.

Another issue with using domains pointers to web content, see e.g.
http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/xml/EX-domain-2.xml
the metadata is a list of keywords. Requiring to have a URI will create a
conflict when people want to use this existing data. And again, from
Declan's and also Thomas'
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0141.html
contributions, it seems that existing MT systems often use lists of
keywords. You would make it hard for these systems to use the data
category.


>
> By the way Felix, i like the way the mapping feature provides a path to
> accomodating conventions as direct meta declaration. That's a pattern that
> we might consider in some other cases for document level meta-data.
>


Thanks, and I very much agree. Arle recently told me that there was a
discussion at the ISO meeting in Madrid about whether MLW-LT will define or
refer to data categories, as provided by DCR. I would go the same route as
for domain: in these areas there is already a lot of existing metadata. ITS
2.0 can serve "as a glue" to make it easier to use the metadata in various
systems.

Best,

Felix


>
> cheers,
> Dave
>
>
>
>  Best,
>
>  Felix
>
>
>>
>> cheers,
>> Dave
>>
>>
>> On 29/06/2012 07:52, Felix Sasaki wrote:
>>
>> Hi all,
>>
>>  FYI, I wrote the domain section based on the initial proposal and this
>> thread, please have a look at
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#domain
>>
>>  This closes ACTION-144. I also updated
>>
>>
>> http://www.w3.org/International/multilingualweb/lt/wiki/Implementation_Commitments#New_ITS_2.0_categories
>> With a link to the section.
>>
>>  Best,
>>
>>  Felix
>>
>> 2012/6/27 Felix Sasaki <fsasaki@w3.org>
>>
>>> Declan, all, thanks a lot for your feedback. I think we are close to
>>> consensus about this, and I have given myself an ACTION-144 to put this
>>> into the draft by next week.
>>>
>>> Best,
>>>
>>>  Felix
>>>
>>>
>>> 2012/6/26 Declan Groves <dgroves@computing.dcu.ie>
>>>
>>>> Felix,
>>>>
>>>> Thanks for your proposal for domain category, which I think outlines
>>>> the best approach for dealing with the complex domain category so good job!
>>>>
>>>> The data category agnostic approach makes more sense, and allows for
>>>> more flexibility, particularly for existing commercial MT service providers
>>>> who will already have their own list of pre-defined domain categories. I am
>>>> not too familiar with DCR so I dont feel qualified to comment on Arle's
>>>> suggestion. o
>>>>
>>>> Using Dublin Core, however, is a good pointer to use due to its fairly
>>>> wide adoption (on this - is it worth providing a URL to the relevant Dublin
>>>> Core content?) - I know that many MT systems that do implement domain
>>>> metadata do so using high-level domains either taken directly from Dublin
>>>> Core or adapted from it (e.g. I think the LetsMT project use dublin core as
>>>> a starting point for defining domain).  One thing to keep in mind is
>>>> that the proposal should be as clear and concise as possible. In terms of
>>>> providing pointers to what codes people can use, I think we are better off
>>>> limiting this as promoting interoperability is key and providing a
>>>> list of alternative implementation strategies may over-complicate things.
>>>>
>>>> It is good to emphasise the optional domainMapping attribute, and I
>>>> would perhaps add to the paragraph concerning the explanation of
>>>> domainMapping that although optional, it is recommended that details for
>>>> the attribute be provided. For our implementation, I expect to carry out
>>>> something similar to Thomas - create a mapping from the provided domain
>>>> metadata to domains that are available for our trained systems.
>>>>
>>>> typo: "In source content... " -> "In the source content..."
>>>>       "no agreed upon set of value sets" -> "no agreed upon value sets"
>>>>
>>>> Declan
>>>>
>>>>
>>>>
>>>> On 25 June 2012 15:43, Felix Sasaki <fsasaki@w3.org> wrote:
>>>>
>>>>> Hi Arle, Thomas, all,
>>>>>
>>>>>  thanks for your feedback, Thomas, I'll fix the typos you found.
>>>>>
>>>>>  2012/6/25 Arle Lommel <arle.lommel@dfki.de>
>>>>>
>>>>>>  Was this an area where the ISO data category registry might come
>>>>>> into play?
>>>>>>
>>>>>
>>>>>  No - this proposal is "data category agnostic". The idea is to
>>>>> provide a mechanism to map existing value lists (like the one Thomas
>>>>> mentioned).
>>>>>
>>>>>
>>>>>>  That is, could we declare an agreed upon selection of fairly broad
>>>>>> top-level domains to promote interoperability while still allowing for
>>>>>> specification by users?
>>>>>>
>>>>>
>>>>>
>>>>>  After our discussion in Dublin and quite a few mails about this, see
>>>>> e.g. the summary at
>>>>>
>>>>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012May/0165.html
>>>>> or David's proposal at
>>>>>
>>>>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012May/0079.html
>>>>>
>>>>>  I don't see an agreement for even top level domains.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>  Unfortunately there is a lot of complexity around this issue in
>>>>>> general that we will not resolve and that may indeed be fundamentally
>>>>>> unresolvable. But perhaps using the DCR as a place where domain ontologies
>>>>>> can be declared in an authoritative resource and pointed to we could at
>>>>>> least provide a way for someone to share what they mean.
>>>>>>
>>>>>
>>>>>
>>>>>  There are so many running systems using their own value lists for
>>>>> domain - I wouldn't expect that Lucy software or others would change their
>>>>> systems. The benefit they would get with the proposal in this thread is
>>>>> that connecting systems (e.g. MT + CMS) gets easier.
>>>>>
>>>>>  Of course one could point users to what codes they should use. The
>>>>> dublin core subject field I have put into the draft is such a pointer. In
>>>>> addition I would be happy to name DCR as another area to look into, like
>>>>> TAUS top level categories, Let's MT top level categories, etc. That is, of
>>>>> course we want people to be aware of DCR.
>>>>>
>>>>>  I also saw your question wrt DCR in the other thread, but I also
>>>>> don't recall an area where we would have a direct dependency. But as I said
>>>>> above, it would be good to inform readers of ITS 2.0 about where relying on
>>>>> DCR makes sense.
>>>>>
>>>>>  A related question: if I want to refer to DCR in an HTML "meta"
>>>>> element, how would the DCR "scheme" be identified? Here is an example from
>>>>> dublin core:
>>>>>
>>>>>   <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF"
>>>>> content="2003-11-01" />
>>>>>
>>>>>
>>>>>  If there is an approach to do that with DCR, I think we should have
>>>>> an example about it in ITS 2.0. Maybe you can check with the DCR experts in
>>>>> Madrid?
>>>>>
>>>>>
>>>>>  Best,
>>>>>
>>>>>  Felix
>>>>>
>>>>>
>>>>>>
>>>>>>  Arle
>>>>>>
>>>>>> --
>>>>>> Arle Lommel
>>>>>> Berlin, Germany
>>>>>> Skype: arle_lommel
>>>>>> Phone (US): +1 707 709 8650 <%2B1%20707%20709%208650>
>>>>>>
>>>>>>  Sent from a mobile device. Please excuse any typos.
>>>>>>
>>>>>> On Jun 25, 2012, at 16:02, "Thomas Ruedesheim" <
>>>>>> thomas.ruedesheim@lucysoftware.com> wrote:
>>>>>>
>>>>>>     Hi Felix,
>>>>>>
>>>>>> I agree with your proposal. (There are just 2 typos in the examples:
>>>>>> "" in domainPointer attributes.)
>>>>>> Lucy's MT engine accepts a global SUBJECT_AREAS parameter holding a
>>>>>> list of domain names. Domains are organized in a hierarchy.
>>>>>> Here is a short excerpt (first 2 levels):
>>>>>>   General Vocabulary
>>>>>>     Common Social Voc.
>>>>>>       Art & Literature
>>>>>>       Ecology, Environment Protection
>>>>>>       Economy & Trade
>>>>>>       Law & Legal Science
>>>>>>       ...
>>>>>>     Common Technical Voc.
>>>>>>       Agriculture & Fishing
>>>>>>       Civil Engineering
>>>>>>       Data Processing
>>>>>>       ...
>>>>>> We will read the meta data and apply the mapping. Of course, the
>>>>>> mapping is specific for the used MT tool.
>>>>>>
>>>>>> Cheers,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>>  ------------------------------
>>>>>> *From:* Felix Sasaki [mailto:fsasaki@w3.org]
>>>>>> *Sent:* Montag, 25. Juni 2012 08:48
>>>>>> *To:* public-multilingualweb-lt@w3.org
>>>>>> *Subject:* [All] domain data category section proposal, please review
>>>>>>
>>>>>>   Hi all,
>>>>>>
>>>>>>  I have created a proposal for the domain data category, see
>>>>>> attachment. This would resolve ISSUE-11, with the input from ACTION-87
>>>>>> taken into account.
>>>>>>
>>>>>>  Declan, Thomas, I think this is esp. important for you - we need to
>>>>>> know whether an implementation as described would be feasible and useful
>>>>>> for you. Of course, others, feel welcome to contribute.
>>>>>>
>>>>>>  Please make comments in this thread - I will use them to provide
>>>>>> another version of the section.
>>>>>>
>>>>>>  Thanks,
>>>>>>
>>>>>>  Felix
>>>>>>
>>>>>>  --
>>>>>> Felix Sasaki
>>>>>> DFKI / W3C Fellow
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> Felix Sasaki
>>>>> DFKI / W3C Fellow
>>>>>
>>>>>
>>>>
>>>>
>>>>  --
>>>> Dr. Declan Groves
>>>> Research Integration Officer
>>>> Centre for Next Generation Localisation (CNGL)
>>>> Dublin City University
>>>>
>>>> email: dgroves@computing.dcu.ie <dgroves@computing.dcu.ie>
>>>>  phone: +353 (0)1 700 6906
>>>>
>>>
>>>
>>>
>>>  --
>>> Felix Sasaki
>>> DFKI / W3C Fellow
>>>
>>>
>>
>>
>>  --
>> Felix Sasaki
>> DFKI / W3C Fellow
>>
>>
>>
>>
>
>
>  --
> Felix Sasaki
> DFKI / W3C Fellow
>
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Wednesday, 4 July 2012 10:21:38 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:31:47 UTC