W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > May 2012

Re: [All] new issue: propose to drop genre, purpose and register data category proposals

From: Declan Groves <dgroves@computing.dcu.ie>
Date: Wed, 9 May 2012 13:17:22 +0100
Message-ID: <CAOi_1PZf52hXjHDv7TesEbzvHzLM=bQdSZzdjz+0RPoJJGeevg@mail.gmail.com>
To: Felix Sasaki <fsasaki@w3.org>
Cc: "Dr. David Filip" <David.Filip@ul.ie>, Milan Karasek <MilanK@moraviaworldwide.com>, public-multilingualweb-lt@w3.org
Felix,

I would with David that it is something that warrants discussion at the
meeting in Dublin.

In terms of domain/genre, we do have a number of very closely related data
categories:

   - Domain
   - Genre
   - Purpose
   - Register

It is important to capture both domain and style of the text (which is
determined by both the purpose and register data categories) for
contextually-accurate translation. I feel that "genre" may be superfluous
to our needs, but that we should retain purpose (reflects the end consumer
of the content) and register (reflects the language style of the content).
It may be an idea to rename these to 'target audience' and 'type',
respectively, as David has suggested, if it makes the distinction clearer.

I would suggest domain be mapped to an existing ontology, and therefore
restricted (i.e. to NOT allow user defined values), but that for the other
two we can leave these as user defined.


Declan





On 9 May 2012 12:43, Dr. David Filip <David.Filip@ul.ie> wrote:

> Felix, I see where are you coming from and see your argumentation line as
> simple = more machine to machine interoperability
>
> My personal experience with large corpora such as TDA was that a single
> plain category is not enough to facilitate slicing and dicing needed to
> prepare a consistent training corpus from data collected in the wild. MT
> tuners often need more orthogonal categories
>
> In LetsMT!, they were addressing the slicing and dicing need by having 3
> orthogonal data categories (from my recollection as WIP, not necessarily
> accurate)
> domain (ISO categories subset mapped onto TDA), worked quite nicely
> target audience (general, expert, channel partner, internal)
> type (social, web, UA, printed doc, marcom)
>
> Milan should be able to provide more up to date detail, as he (Moravia) is
> actively in the project..
>
> And finally domain has pretty much the same vagueness as the other three
> you are proposing to drop and the match with existing copora and trained
> machines categorization won't be great, so I would not expect a big gain in
> machine to machine automation with domain only..
>
> I suggest not to drop them at least till Dublin workshop. the interested
> parties should be able to come with a workable set of  orthogonal
> categories, possibly consolidated (but not less than 2 IMHO) and more
> inline with what the industry is doing. We should also consider user
> defined values as an attractive option for all but domain, or maybe even
> for domain..
>
> That is my two cents :-)
>
> Rgds
> dF
>
> Dr. David Filip
> =======================
> LRC | CNGL | LT-Web | CSIS
> University of Limerick, Ireland
> telephone: +353-6120-2781
> *cellphone: +353-86-0222-158*
> facsimile: +353-6120-2734
> mailto: david.filip@ul.ie
>
>
>
> On Wed, May 9, 2012 at 10:51 AM, Felix Sasaki <fsasaki@w3.org> wrote:
>
>> Hi all,
>>
>> See ISSUE-11 : I propose to drop the genre, purpose and register data
>> category proposals. Main reasons:
>> - I don't see a way to come up with an agreeable and interoperable set of
>> values.
>> - There is no way to generate or check this kind of metadata
>> automatically. This will lead to same path as the "keywords" attribute in
>> the HTML meta element.
>>
>> I propose to have one data category, probably focusing on "domain", that
>> can be produced at least to some extend automatically (= Tadej) and that
>> can be taken up by planned implementations (=Declan).
>>
>> Thoughts?
>>
>> Felix
>>
>> --
>> Felix Sasaki
>> DFKI / W3C Fellow
>>
>
>


-- 
Dr. Declan Groves
Research Integration Officer
Centre for Next Generation Localisation (CNGL)
Dublin City University

email: dgroves@computing.dcu.ie <dgroves@computing.dcu.ie>
 phone: +353 (0)1 700 6906
Received on Wednesday, 9 May 2012 12:18:00 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:31:44 UTC