- From: Felix Sasaki <fsasaki@w3.org>
- Date: Wed, 4 Jul 2012 12:21:02 +0200
- To: Dave Lewis <dave.lewis@cs.tcd.ie>
- Cc: public-multilingualweb-lt@w3.org
- Message-ID: <CAL58czq=jKFUbK5h3+YwCs6sX=3KF4bruQStfPniYU4WvyAzWQ@mail.gmail.com>
2012/7/4 Dave Lewis <dave.lewis@cs.tcd.ie> > comments inline - > > On 04/07/2012 05:45, Felix Sasaki wrote: > > Hi Dave, > > 2012/7/4 Dave Lewis <dave.lewis@cs.tcd.ie> > >> Hi Felix, >> One question on the domainMapping example you give for the domain data >> category. This assumes the workflow has a single canonical set of IDs >> identifying 'auto', 'medicine', 'law', but this may not always be the case, >> e.g. where SMT engines are trained on a mix of parallel data with their own >> separate corpora domain naming schemes. >> > > Couldn't you accomodate that by having several domainRule elements? > > > > I don't think so since there's nothing in the domainRule that indicates > that the simple target names are in different name spaces. > Well, the ITS general mechanisms allow to have different rules for the same kind of content, exactly for that purpose: that there are different metadata items for the same purpose. So you can have <its:translateRule selector="//code" translate="no"/> <its:translateRule selector="//code" translate="yes"/> and the second rule comes into effect, since it's the last rule (like CSS stylesheets). So you can describe a best practice, rather than changing the definition of "domain", for using different, vendor specific sets of mappings. > > > So a simple naming scheme means that the workflow provider must ensure >> consistency of that scheme and that the document editor (often the client) >> has knowledge of that scheme. >> >> So could the data category as is accommodate multiple naming schemes >> (e.g. from the client and from third parties) within the workflow by simply >> using a URL instead of a simple name? e.g. >> >> domainMapping="automotive auto, medical medicine, 'criminal law' http://www.taus.org/domain/law, 'property law' http://www.client.com/domain-names/law" >> >> This has to be answered by Thomas and Declan, I think: they (and one > external provider) agreed on the simple scheme. I'm fine with introducing > URIs, but we need implementations making use of them. > > > Yes, Thomas, Declan advice appreciated. > I would go further here: without a commitment from them to implement the URI mechanism, we shouldn't go that route. Declan mentioned in his contribution http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0152.html that dublin core plays an important role in here - but you only will need a URI to refer to dublin core in general, not to the keywords themselves. > To make this example a bit concrete, with domains used in MT training, I > can easily see cases where the mapping have to accomodate one domain scheme > that is specific to the client, one that is used by the LSP across clients > and one relevant to an MT service provider that could be influenced by the > range of existing training corpora they have. > > Reconciling this into a single name space for each workflow instance would > be onerous for all parties and wouldn't be that helpful for any of the > parties if they operate with multiple clients/providers. In settings where > you have one large client working all the time with a fixed set of > providers, then establishing a common set of domain names would make sense. > I see your issue of conflicting metadata for the same piece of content, however I think the general rule mechanism is also a solution for that, as stated above. Another issue with using domains pointers to web content, see e.g. http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/xml/EX-domain-2.xml the metadata is a list of keywords. Requiring to have a URI will create a conflict when people want to use this existing data. And again, from Declan's and also Thomas' http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0141.html contributions, it seems that existing MT systems often use lists of keywords. You would make it hard for these systems to use the data category. > > By the way Felix, i like the way the mapping feature provides a path to > accomodating conventions as direct meta declaration. That's a pattern that > we might consider in some other cases for document level meta-data. > Thanks, and I very much agree. Arle recently told me that there was a discussion at the ISO meeting in Madrid about whether MLW-LT will define or refer to data categories, as provided by DCR. I would go the same route as for domain: in these areas there is already a lot of existing metadata. ITS 2.0 can serve "as a glue" to make it easier to use the metadata in various systems. Best, Felix > > cheers, > Dave > > > > Best, > > Felix > > >> >> cheers, >> Dave >> >> >> On 29/06/2012 07:52, Felix Sasaki wrote: >> >> Hi all, >> >> FYI, I wrote the domain section based on the initial proposal and this >> thread, please have a look at >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#domain >> >> This closes ACTION-144. I also updated >> >> >> http://www.w3.org/International/multilingualweb/lt/wiki/Implementation_Commitments#New_ITS_2.0_categories >> With a link to the section. >> >> Best, >> >> Felix >> >> 2012/6/27 Felix Sasaki <fsasaki@w3.org> >> >>> Declan, all, thanks a lot for your feedback. I think we are close to >>> consensus about this, and I have given myself an ACTION-144 to put this >>> into the draft by next week. >>> >>> Best, >>> >>> Felix >>> >>> >>> 2012/6/26 Declan Groves <dgroves@computing.dcu.ie> >>> >>>> Felix, >>>> >>>> Thanks for your proposal for domain category, which I think outlines >>>> the best approach for dealing with the complex domain category so good job! >>>> >>>> The data category agnostic approach makes more sense, and allows for >>>> more flexibility, particularly for existing commercial MT service providers >>>> who will already have their own list of pre-defined domain categories. I am >>>> not too familiar with DCR so I dont feel qualified to comment on Arle's >>>> suggestion. o >>>> >>>> Using Dublin Core, however, is a good pointer to use due to its fairly >>>> wide adoption (on this - is it worth providing a URL to the relevant Dublin >>>> Core content?) - I know that many MT systems that do implement domain >>>> metadata do so using high-level domains either taken directly from Dublin >>>> Core or adapted from it (e.g. I think the LetsMT project use dublin core as >>>> a starting point for defining domain). One thing to keep in mind is >>>> that the proposal should be as clear and concise as possible. In terms of >>>> providing pointers to what codes people can use, I think we are better off >>>> limiting this as promoting interoperability is key and providing a >>>> list of alternative implementation strategies may over-complicate things. >>>> >>>> It is good to emphasise the optional domainMapping attribute, and I >>>> would perhaps add to the paragraph concerning the explanation of >>>> domainMapping that although optional, it is recommended that details for >>>> the attribute be provided. For our implementation, I expect to carry out >>>> something similar to Thomas - create a mapping from the provided domain >>>> metadata to domains that are available for our trained systems. >>>> >>>> typo: "In source content... " -> "In the source content..." >>>> "no agreed upon set of value sets" -> "no agreed upon value sets" >>>> >>>> Declan >>>> >>>> >>>> >>>> On 25 June 2012 15:43, Felix Sasaki <fsasaki@w3.org> wrote: >>>> >>>>> Hi Arle, Thomas, all, >>>>> >>>>> thanks for your feedback, Thomas, I'll fix the typos you found. >>>>> >>>>> 2012/6/25 Arle Lommel <arle.lommel@dfki.de> >>>>> >>>>>> Was this an area where the ISO data category registry might come >>>>>> into play? >>>>>> >>>>> >>>>> No - this proposal is "data category agnostic". The idea is to >>>>> provide a mechanism to map existing value lists (like the one Thomas >>>>> mentioned). >>>>> >>>>> >>>>>> That is, could we declare an agreed upon selection of fairly broad >>>>>> top-level domains to promote interoperability while still allowing for >>>>>> specification by users? >>>>>> >>>>> >>>>> >>>>> After our discussion in Dublin and quite a few mails about this, see >>>>> e.g. the summary at >>>>> >>>>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012May/0165.html >>>>> or David's proposal at >>>>> >>>>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012May/0079.html >>>>> >>>>> I don't see an agreement for even top level domains. >>>>> >>>>> >>>>> >>>>>> >>>>>> Unfortunately there is a lot of complexity around this issue in >>>>>> general that we will not resolve and that may indeed be fundamentally >>>>>> unresolvable. But perhaps using the DCR as a place where domain ontologies >>>>>> can be declared in an authoritative resource and pointed to we could at >>>>>> least provide a way for someone to share what they mean. >>>>>> >>>>> >>>>> >>>>> There are so many running systems using their own value lists for >>>>> domain - I wouldn't expect that Lucy software or others would change their >>>>> systems. The benefit they would get with the proposal in this thread is >>>>> that connecting systems (e.g. MT + CMS) gets easier. >>>>> >>>>> Of course one could point users to what codes they should use. The >>>>> dublin core subject field I have put into the draft is such a pointer. In >>>>> addition I would be happy to name DCR as another area to look into, like >>>>> TAUS top level categories, Let's MT top level categories, etc. That is, of >>>>> course we want people to be aware of DCR. >>>>> >>>>> I also saw your question wrt DCR in the other thread, but I also >>>>> don't recall an area where we would have a direct dependency. But as I said >>>>> above, it would be good to inform readers of ITS 2.0 about where relying on >>>>> DCR makes sense. >>>>> >>>>> A related question: if I want to refer to DCR in an HTML "meta" >>>>> element, how would the DCR "scheme" be identified? Here is an example from >>>>> dublin core: >>>>> >>>>> <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" >>>>> content="2003-11-01" /> >>>>> >>>>> >>>>> If there is an approach to do that with DCR, I think we should have >>>>> an example about it in ITS 2.0. Maybe you can check with the DCR experts in >>>>> Madrid? >>>>> >>>>> >>>>> Best, >>>>> >>>>> Felix >>>>> >>>>> >>>>>> >>>>>> Arle >>>>>> >>>>>> -- >>>>>> Arle Lommel >>>>>> Berlin, Germany >>>>>> Skype: arle_lommel >>>>>> Phone (US): +1 707 709 8650 <%2B1%20707%20709%208650> >>>>>> >>>>>> Sent from a mobile device. Please excuse any typos. >>>>>> >>>>>> On Jun 25, 2012, at 16:02, "Thomas Ruedesheim" < >>>>>> thomas.ruedesheim@lucysoftware.com> wrote: >>>>>> >>>>>> Hi Felix, >>>>>> >>>>>> I agree with your proposal. (There are just 2 typos in the examples: >>>>>> "" in domainPointer attributes.) >>>>>> Lucy's MT engine accepts a global SUBJECT_AREAS parameter holding a >>>>>> list of domain names. Domains are organized in a hierarchy. >>>>>> Here is a short excerpt (first 2 levels): >>>>>> General Vocabulary >>>>>> Common Social Voc. >>>>>> Art & Literature >>>>>> Ecology, Environment Protection >>>>>> Economy & Trade >>>>>> Law & Legal Science >>>>>> ... >>>>>> Common Technical Voc. >>>>>> Agriculture & Fishing >>>>>> Civil Engineering >>>>>> Data Processing >>>>>> ... >>>>>> We will read the meta data and apply the mapping. Of course, the >>>>>> mapping is specific for the used MT tool. >>>>>> >>>>>> Cheers, >>>>>> Thomas >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> *From:* Felix Sasaki [mailto:fsasaki@w3.org] >>>>>> *Sent:* Montag, 25. Juni 2012 08:48 >>>>>> *To:* public-multilingualweb-lt@w3.org >>>>>> *Subject:* [All] domain data category section proposal, please review >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I have created a proposal for the domain data category, see >>>>>> attachment. This would resolve ISSUE-11, with the input from ACTION-87 >>>>>> taken into account. >>>>>> >>>>>> Declan, Thomas, I think this is esp. important for you - we need to >>>>>> know whether an implementation as described would be feasible and useful >>>>>> for you. Of course, others, feel welcome to contribute. >>>>>> >>>>>> Please make comments in this thread - I will use them to provide >>>>>> another version of the section. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Felix >>>>>> >>>>>> -- >>>>>> Felix Sasaki >>>>>> DFKI / W3C Fellow >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Felix Sasaki >>>>> DFKI / W3C Fellow >>>>> >>>>> >>>> >>>> >>>> -- >>>> Dr. Declan Groves >>>> Research Integration Officer >>>> Centre for Next Generation Localisation (CNGL) >>>> Dublin City University >>>> >>>> email: dgroves@computing.dcu.ie <dgroves@computing.dcu.ie> >>>> phone: +353 (0)1 700 6906 >>>> >>> >>> >>> >>> -- >>> Felix Sasaki >>> DFKI / W3C Fellow >>> >>> >> >> >> -- >> Felix Sasaki >> DFKI / W3C Fellow >> >> >> >> > > > -- > Felix Sasaki > DFKI / W3C Fellow > > > -- Felix Sasaki DFKI / W3C Fellow
Received on Wednesday, 4 July 2012 10:21:38 UTC