RE: [All] new issue: propose to drop genre, purpose and register data category proposals

Hi all,

looking at current Let'sMT! project web site - there are used just two categories ("Subject Domain" and "Text Type"). The values are pre-defined and full list of their values is present at the bottom of the mail.

As I can see, many of Let'sMT! categories are replicated from TDA (TAUS Data Association). There, the data are divided into "Industry" (like Computer SW, computer HW, Telecommunications, Legal Services, etc.) and "Content Type" (like Financial Documentation, Instructions for Use, Software Strings and Documentation, etc.).

Full lists are below.

Regards,
Milan


Let's MT! Categories:

[Subject Domain]
Law
Finance
Business
Information technology and data processing
Electronics
Industrial manufacturing
Biotechnology and health
Environment
Energy
Transport
Communications systems
Tourism
Education
National and international organizations and affairs

[Text Type]
Software Strings and documentation (TDA)
Financial Documentation
Standards, Statues and Regulations
Policies, Process and Procedures
Patents
Instructions for Use (TDA)
News, Announcements, Reports and Research
Sales and Marketing Material (TDA)
Support Content (TDA)
---------------------------------------
TDA Categories:

[Industry]
Automotive Manufacturing
Chemicals
Computer Hardware
Computer Software
Consumer Electronics
Energy, Water and Utilities
Financials
Healthcare
Industrial Electronics
Industrial Manufacturing
Legal Services
Leisure, Tourism, and Arts
Medical Equipment and Supplies
Pharmaceuticals and Biotechnology
Professional and Business Services
Stores and Retail Distribution
Telecommunications
Undefined Sector

[Content Type]
Financial Documentation
Instructions for Use
News Announcements, Reports and Research
Patents
Policies, Process and Procedures
Sales and Marketing Material
Software Strings and Documentation
Standards, Statutes and Regulations
Support Content
Undefined Content Type



From: Dr. David Filip [mailto:David.Filip@ul.ie]
Sent: Wednesday, May 09, 2012 1:44 PM
To: Felix Sasaki; Milan Karasek
Cc: public-multilingualweb-lt@w3.org
Subject: Re: [All] new issue: propose to drop genre, purpose and register data category proposals

Felix, I see where are you coming from and see your argumentation line as simple = more machine to machine interoperability

My personal experience with large corpora such as TDA was that a single plain category is not enough to facilitate slicing and dicing needed to prepare a consistent training corpus from data collected in the wild. MT tuners often need more orthogonal categories

In LetsMT!, they were addressing the slicing and dicing need by having 3 orthogonal data categories (from my recollection as WIP, not necessarily accurate)
domain (ISO categories subset mapped onto TDA), worked quite nicely
target audience (general, expert, channel partner, internal)
type (social, web, UA, printed doc, marcom)

Milan should be able to provide more up to date detail, as he (Moravia) is actively in the project..

And finally domain has pretty much the same vagueness as the other three you are proposing to drop and the match with existing copora and trained machines categorization won't be great, so I would not expect a big gain in machine to machine automation with domain only..

I suggest not to drop them at least till Dublin workshop. the interested parties should be able to come with a workable set of  orthogonal categories, possibly consolidated (but not less than 2 IMHO) and more inline with what the industry is doing. We should also consider user defined values as an attractive option for all but domain, or maybe even for domain..

That is my two cents :-)

Rgds
dF

Dr. David Filip
=======================
LRC | CNGL | LT-Web | CSIS
University of Limerick, Ireland
telephone: +353-6120-2781
cellphone: +353-86-0222-158
facsimile: +353-6120-2734
mailto: david.filip@ul.ie<mailto:david.filip@ul.ie>


On Wed, May 9, 2012 at 10:51 AM, Felix Sasaki <fsasaki@w3.org<mailto:fsasaki@w3.org>> wrote:
Hi all,

See ISSUE-11 : I propose to drop the genre, purpose and register data category proposals. Main reasons:
- I don't see a way to come up with an agreeable and interoperable set of values.
- There is no way to generate or check this kind of metadata automatically. This will lead to same path as the "keywords" attribute in the HTML meta element.

I propose to have one data category, probably focusing on "domain", that can be produced at least to some extend automatically (= Tadej) and that can be taken up by planned implementations (=Declan).

Thoughts?

Felix

--
Felix Sasaki
DFKI / W3C Fellow

Received on Thursday, 10 May 2012 12:25:34 UTC