W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > May 2012

Re: [All] new issue: propose to drop genre, purpose and register data category proposals

From: Dr. David Filip <David.Filip@ul.ie>
Date: Wed, 9 May 2012 12:43:36 +0100
Message-ID: <CANw5LK=kTSYtZm65An=Fk7NAUH03tWUoe2aNPN6JTjkE4ywpXA@mail.gmail.com>
To: Felix Sasaki <fsasaki@w3.org>, Milan Karasek <MilanK@moraviaworldwide.com>
Cc: public-multilingualweb-lt@w3.org
Felix, I see where are you coming from and see your argumentation line as
simple = more machine to machine interoperability

My personal experience with large corpora such as TDA was that a single
plain category is not enough to facilitate slicing and dicing needed to
prepare a consistent training corpus from data collected in the wild. MT
tuners often need more orthogonal categories

In LetsMT!, they were addressing the slicing and dicing need by having 3
orthogonal data categories (from my recollection as WIP, not necessarily
accurate)
domain (ISO categories subset mapped onto TDA), worked quite nicely
target audience (general, expert, channel partner, internal)
type (social, web, UA, printed doc, marcom)

Milan should be able to provide more up to date detail, as he (Moravia) is
actively in the project..

And finally domain has pretty much the same vagueness as the other three
you are proposing to drop and the match with existing copora and trained
machines categorization won't be great, so I would not expect a big gain in
machine to machine automation with domain only..

I suggest not to drop them at least till Dublin workshop. the interested
parties should be able to come with a workable set of  orthogonal
categories, possibly consolidated (but not less than 2 IMHO) and more
inline with what the industry is doing. We should also consider user
defined values as an attractive option for all but domain, or maybe even
for domain..

That is my two cents :-)

Rgds
dF

Dr. David Filip
=======================
LRC | CNGL | LT-Web | CSIS
University of Limerick, Ireland
telephone: +353-6120-2781
*cellphone: +353-86-0222-158*
facsimile: +353-6120-2734
mailto: david.filip@ul.ie



On Wed, May 9, 2012 at 10:51 AM, Felix Sasaki <fsasaki@w3.org> wrote:

> Hi all,
>
> See ISSUE-11 : I propose to drop the genre, purpose and register data
> category proposals. Main reasons:
> - I don't see a way to come up with an agreeable and interoperable set of
> values.
> - There is no way to generate or check this kind of metadata
> automatically. This will lead to same path as the "keywords" attribute in
> the HTML meta element.
>
> I propose to have one data category, probably focusing on "domain", that
> can be produced at least to some extend automatically (= Tadej) and that
> can be taken up by planned implementations (=Declan).
>
> Thoughts?
>
> Felix
>
> --
> Felix Sasaki
> DFKI / W3C Fellow
>
Received on Wednesday, 9 May 2012 11:44:49 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 9 June 2013 00:24:55 UTC