W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > May 2012

Re: [All] new issue: propose to drop genre, purpose and register data category proposals

From: Dr. David Filip <David.Filip@ul.ie>
Date: Wed, 9 May 2012 16:01:20 +0100
Message-ID: <CANw5LKmSR9+oMouv13FyZJUzyoTFskfKxfm4x_4QsH6iM3zsEw@mail.gmail.com>
To: Declan Groves <dgroves@computing.dcu.ie>
Cc: Felix Sasaki <fsasaki@w3.org>, Milan Karasek <MilanK@moraviaworldwide.com>, public-multilingualweb-lt@w3.org, Georg Rehm <georg.rehm@dfki.de>
Thanks Declan, I also think that at least domain should be primarily
approached as a fixed ontology (it remains to be seen which or which set of

What I suggested was that some and *maybe all* of the categories should
allow for user defined extensions. Of course the machine to machine
automation is hindered if private values are being used, but at least the
consumers would know that private values can occur and would be prepared to
display them, eventually map them based on user preference..

But as George rightly points out the main issue with all the related terms
(rather than categories) is that the community are using them freely and
interchangeably although they are distinct and different concepts..


Dr. David Filip
University of Limerick, Ireland
telephone: +353-6120-2781
*cellphone: +353-86-0222-158*
facsimile: +353-6120-2734
mailto: david.filip@ul.ie

On Wed, May 9, 2012 at 1:17 PM, Declan Groves <dgroves@computing.dcu.ie>wrote:

> Felix,
> I would with David that it is something that warrants discussion at the
> meeting in Dublin.
> In terms of domain/genre, we do have a number of very closely related data
> categories:
>    - Domain
>    - Genre
>    - Purpose
>    - Register
> It is important to capture both domain and style of the text (which is
> determined by both the purpose and register data categories) for
> contextually-accurate translation. I feel that "genre" may be superfluous
> to our needs, but that we should retain purpose (reflects the end consumer
> of the content) and register (reflects the language style of the content).
> It may be an idea to rename these to 'target audience' and 'type',
> respectively, as David has suggested, if it makes the distinction clearer.
> I would suggest domain be mapped to an existing ontology, and therefore
> restricted (i.e. to NOT allow user defined values), but that for the other
> two we can leave these as user defined.
> Declan
> On 9 May 2012 12:43, Dr. David Filip <David.Filip@ul.ie> wrote:
>> Felix, I see where are you coming from and see your argumentation line as
>> simple = more machine to machine interoperability
>> My personal experience with large corpora such as TDA was that a single
>> plain category is not enough to facilitate slicing and dicing needed to
>> prepare a consistent training corpus from data collected in the wild. MT
>> tuners often need more orthogonal categories
>> In LetsMT!, they were addressing the slicing and dicing need by having 3
>> orthogonal data categories (from my recollection as WIP, not necessarily
>> accurate)
>> domain (ISO categories subset mapped onto TDA), worked quite nicely
>> target audience (general, expert, channel partner, internal)
>> type (social, web, UA, printed doc, marcom)
>> Milan should be able to provide more up to date detail, as he (Moravia)
>> is actively in the project..
>> And finally domain has pretty much the same vagueness as the other three
>> you are proposing to drop and the match with existing copora and trained
>> machines categorization won't be great, so I would not expect a big gain in
>> machine to machine automation with domain only..
>> I suggest not to drop them at least till Dublin workshop. the interested
>> parties should be able to come with a workable set of  orthogonal
>> categories, possibly consolidated (but not less than 2 IMHO) and more
>> inline with what the industry is doing. We should also consider user
>> defined values as an attractive option for all but domain, or maybe even
>> for domain..
>> That is my two cents :-)
>> Rgds
>> dF
>> Dr. David Filip
>> =======================
>> LRC | CNGL | LT-Web | CSIS
>> University of Limerick, Ireland
>> telephone: +353-6120-2781
>> *cellphone: +353-86-0222-158*
>> facsimile: +353-6120-2734
>> mailto: david.filip@ul.ie
>> On Wed, May 9, 2012 at 10:51 AM, Felix Sasaki <fsasaki@w3.org> wrote:
>>> Hi all,
>>> See ISSUE-11 : I propose to drop the genre, purpose and register data
>>> category proposals. Main reasons:
>>> - I don't see a way to come up with an agreeable and interoperable set
>>> of values.
>>> - There is no way to generate or check this kind of metadata
>>> automatically. This will lead to same path as the "keywords" attribute in
>>> the HTML meta element.
>>> I propose to have one data category, probably focusing on "domain", that
>>> can be produced at least to some extend automatically (= Tadej) and that
>>> can be taken up by planned implementations (=Declan).
>>> Thoughts?
>>> Felix
>>> --
>>> Felix Sasaki
>>> DFKI / W3C Fellow
> --
> Dr. Declan Groves
> Research Integration Officer
> Centre for Next Generation Localisation (CNGL)
> Dublin City University
> email: dgroves@computing.dcu.ie <dgroves@computing.dcu.ie>
>  phone: +353 (0)1 700 6906
Received on Wednesday, 9 May 2012 15:02:31 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:31:44 UTC