Re: [All] domain data category section proposal, please review

comments inline -

On 04/07/2012 05:45, Felix Sasaki wrote:
> Hi Dave,
>
> 2012/7/4 Dave Lewis <dave.lewis@cs.tcd.ie <mailto:dave.lewis@cs.tcd.ie>>
>
>     Hi Felix,
>     One question on the domainMapping example you give for the domain
>     data category. This assumes the workflow has a single canonical
>     set of IDs identifying 'auto', 'medicine', 'law', but this may not
>     always be the case, e.g. where SMT engines are trained on a mix of
>     parallel data with their own separate corpora domain naming schemes.
>
>
> Couldn't you accomodate that by having several domainRule elements?

I don't think so since there's nothing in the domainRule that indicates 
that the simple target names are in different name spaces.

>     So a simple naming scheme means that the workflow provider must
>     ensure consistency of that scheme and that the document editor
>     (often the client) has knowledge of that scheme.
>
>     So could the data category  as is accommodate multiple naming
>     schemes (e.g. from the client and from third parties) within the
>     workflow by simply using a URL instead of a simple name?  e.g.
>
>     domainMapping="automotive auto, medical medicine, 'criminal law'http://www.taus.org/domain/law, 'property law'http://www.client.com/domain-names/law"
>
> This has to be answered by Thomas and Declan, I think: they (and one 
> external provider) agreed on the simple scheme. I'm fine with 
> introducing URIs, but we need implementations making use of them.
>

Yes, Thomas, Declan advice appreciated. To make this example a bit 
concrete, with domains used in MT training, I can easily see cases where 
the mapping have to accomodate one domain scheme that is specific to the 
client, one that is used by the LSP across clients and one relevant to 
an MT service provider that could be influenced by the range of existing 
training corpora they have.

Reconciling this into a single name space for each workflow instance 
would be onerous for all parties and wouldn't be that helpful for any of 
the parties if they operate with multiple clients/providers. In settings 
where you have one large client working all the time with a fixed set of 
providers, then establishing a common set of domain names would make sense.

By the way Felix, i like the way the mapping feature provides a path to 
accomodating conventions as direct meta declaration. That's a pattern 
that we might consider in some other cases for document level meta-data.

cheers,
Dave

> Best,
>
> Felix
>
>
>     cheers,
>     Dave
>
>
>     On 29/06/2012 07:52, Felix Sasaki wrote:
>>     Hi all,
>>
>>     FYI, I wrote the domain section based on the initial proposal and
>>     this thread, please have a look at
>>     http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#domain
>>
>>     This closes ACTION-144. I also updated
>>
>>     http://www.w3.org/International/multilingualweb/lt/wiki/Implementation_Commitments#New_ITS_2.0_categories
>>     With a link to the section.
>>
>>     Best,
>>
>>     Felix
>>
>>     2012/6/27 Felix Sasaki <fsasaki@w3.org <mailto:fsasaki@w3.org>>
>>
>>         Declan, all, thanks a lot for your feedback. I think we are
>>         close to consensus about this, and I have given myself an
>>         ACTION-144 to put this into the draft by next week.
>>
>>         Best,
>>
>>         Felix
>>
>>
>>         2012/6/26 Declan Groves <dgroves@computing.dcu.ie
>>         <mailto:dgroves@computing.dcu.ie>>
>>
>>             Felix,
>>
>>             Thanks for your proposal for domain category, which I
>>             think outlines the best approach for dealing with the
>>             complex domain category so good job!
>>
>>             The data category agnostic approach makes more sense, and
>>             allows for more flexibility, particularly for existing
>>             commercial MT service providers who will already have
>>             their own list of pre-defined domain categories. I am not
>>             too familiar with DCR so I dont feel qualified to comment
>>             on Arle's suggestion. o
>>
>>             Using Dublin Core, however, is a good pointer to use due
>>             to its fairly wide adoption (on this - is it worth
>>             providing a URL to the relevant Dublin Core content?) - I
>>             know that many MT systems that do implement domain
>>             metadata do so using high-level domains either taken
>>             directly from Dublin Core or adapted from it (e.g. I
>>             think the LetsMT project use dublin core as a starting
>>             point for defining domain). One thing to keep in mind is
>>             that the proposal should be as clear and concise as
>>             possible. In terms of providing pointers to what codes
>>             people can use, I think we are better off limiting this
>>             as promoting interoperabilityis key and providing a list
>>             of alternative implementation strategies may
>>             over-complicate things.
>>
>>             It is good to emphasise the optional domainMapping
>>             attribute, and I would perhaps add to the paragraph
>>             concerning the explanation of domainMapping that although
>>             optional, it is recommended that details for the
>>             attribute be provided. For our implementation, I expect
>>             to carry out something similar to Thomas - create a
>>             mapping from the provided domain metadata to domains that
>>             are available for our trained systems.
>>
>>             typo: "In source content... " -> "In the source content..."
>>                   "no agreed upon set of value sets" -> "no agreed
>>             upon value sets"
>>
>>             Declan
>>
>>
>>
>>             On 25 June 2012 15:43, Felix Sasaki <fsasaki@w3.org
>>             <mailto:fsasaki@w3.org>> wrote:
>>
>>                 Hi Arle, Thomas, all,
>>
>>                 thanks for your feedback, Thomas, I'll fix the typos
>>                 you found.
>>
>>                 2012/6/25 Arle Lommel <arle.lommel@dfki.de
>>                 <mailto:arle.lommel@dfki.de>>
>>
>>                     Was this an area where the ISO data category
>>                     registry might come into play?
>>
>>
>>                 No - this proposal is "data category agnostic". The
>>                 idea is to provide a mechanism to map existing value
>>                 lists (like the one Thomas mentioned).
>>
>>                     That is, could we declare an agreed upon
>>                     selection of fairly broad top-level domains to
>>                     promote interoperability while still allowing for
>>                     specification by users?
>>
>>
>>
>>                 After our discussion in Dublin and quite a few mails
>>                 about this, see e.g. the summary at
>>                 http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012May/0165.html
>>                 or David's proposal at
>>                 http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012May/0079.html
>>
>>                 I don't see an agreement for even top level domains.
>>
>>
>>                     Unfortunately there is a lot of complexity around
>>                     this issue in general that we will not resolve
>>                     and that may indeed be fundamentally
>>                     unresolvable. But perhaps using the DCR as a
>>                     place where domain ontologies can be declared in
>>                     an authoritative resource and pointed to we could
>>                     at least provide a way for someone to share what
>>                     they mean.
>>
>>
>>
>>                 There are so many running systems using their own
>>                 value lists for domain - I wouldn't expect that Lucy
>>                 software or others would change their systems. The
>>                 benefit they would get with the proposal in this
>>                 thread is that connecting systems (e.g. MT + CMS)
>>                 gets easier.
>>
>>                 Of course one could point users to what codes they
>>                 should use. The dublin core subject field I have put
>>                 into the draft is such a pointer. In addition I would
>>                 be happy to name DCR as another area to look into,
>>                 like TAUS top level categories, Let's MT top level
>>                 categories, etc. That is, of course we want people to
>>                 be aware of DCR.
>>
>>                 I also saw your question wrt DCR in the other thread,
>>                 but I also don't recall an area where we would have a
>>                 direct dependency. But as I said above, it would be
>>                 good to inform readers of ITS 2.0 about where relying
>>                 on DCR makes sense.
>>
>>                 A related question: if I want to refer to DCR in an
>>                 HTML "meta" element, how would the DCR "scheme" be
>>                 identified? Here is an example from dublin core:
>>
>>                 <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF"
>>                 content="2003-11-01" />
>>
>>
>>                 If there is an approach to do that with DCR, I think
>>                 we should have an example about it in ITS 2.0. Maybe
>>                 you can check with the DCR experts in Madrid?
>>
>>
>>                 Best,
>>
>>
>>                 Felix
>>
>>
>>                     Arle
>>
>>                     -- 
>>                     Arle Lommel
>>                     Berlin, Germany
>>                     Skype: arle_lommel
>>                     Phone (US): +1 707 709 8650
>>                     <tel:%2B1%20707%20709%208650>
>>
>>                     Sent from a mobile device. Please excuse any typos.
>>
>>                     On Jun 25, 2012, at 16:02, "Thomas Ruedesheim"
>>                     <thomas.ruedesheim@lucysoftware.com
>>                     <mailto:thomas.ruedesheim@lucysoftware.com>> wrote:
>>
>>>                     Hi Felix,
>>>                     I agree with your proposal. (There are just 2
>>>                     typos in the examples: "" in domainPointer
>>>                     attributes.)
>>>                     Lucy's MT engine accepts a global SUBJECT_AREAS
>>>                     parameter holding a list of domain names.
>>>                     Domains are organized in a hierarchy.
>>>                     Here is a short excerpt(first 2 levels):
>>>                       General Vocabulary
>>>                         Common Social Voc.
>>>                           Art & Literature
>>>                           Ecology, Environment Protection
>>>                           Economy & Trade
>>>                           Law & Legal Science
>>>                           ...
>>>                         Common Technical Voc.
>>>                           Agriculture & Fishing
>>>                           Civil Engineering
>>>                           Data Processing
>>>                           ...
>>>                     We will read the meta data and apply the
>>>                     mapping. Of course, the mapping is specific for
>>>                     the used MT tool.
>>>                     Cheers,
>>>                     Thomas
>>>                     ------------------------------------------------------------------------
>>>                     *From:* Felix Sasaki [mailto:fsasaki@w3.org
>>>                     <mailto:fsasaki@w3.org>]
>>>                     *Sent:* Montag, 25. Juni 2012 08:48
>>>                     *To:* public-multilingualweb-lt@w3.org
>>>                     <mailto:public-multilingualweb-lt@w3.org>
>>>                     *Subject:* [All] domain data category section
>>>                     proposal, please review
>>>
>>>                     Hi all,
>>>
>>>                     I have created a proposal for the domain data
>>>                     category, see attachment. This would resolve
>>>                     ISSUE-11, with the input from ACTION-87 taken
>>>                     into account.
>>>
>>>                     Declan, Thomas, I think this is esp. important
>>>                     for you - we need to know whether an
>>>                     implementation as described would be feasible
>>>                     and useful for you. Of course, others, feel
>>>                     welcome to contribute.
>>>
>>>                     Please make comments in this thread - I will use
>>>                     them to provide another version of the section.
>>>
>>>                     Thanks,
>>>
>>>                     Felix
>>>
>>>                     -- 
>>>                     Felix Sasaki
>>>                     DFKI / W3C Fellow
>>>
>>
>>
>>
>>                 -- 
>>                 Felix Sasaki
>>                 DFKI / W3C Fellow
>>
>>
>>
>>
>>             -- 
>>             Dr. Declan Groves
>>             Research Integration Officer
>>             Centre for Next Generation Localisation (CNGL)
>>             Dublin City University
>>
>>             email: dgroves@computing.dcu.ie
>>             <mailto:dgroves@computing.dcu.ie><mailto:dgroves@computing.dcu.ie>
>>             phone: +353 (0)1 700 6906
>>
>>
>>
>>
>>         -- 
>>         Felix Sasaki
>>         DFKI / W3C Fellow
>>
>>
>>
>>
>>     -- 
>>     Felix Sasaki
>>     DFKI / W3C Fellow
>>
>
>
>
>
>
> -- 
> Felix Sasaki
> DFKI / W3C Fellow
>

Received on Wednesday, 4 July 2012 09:58:14 UTC