Re: [All] new issue: propose to drop genre, purpose and register data category proposals from Georg Rehm on 2012-05-09 (public-multilingualweb-lt@w3.org from May 2012)

From: Georg Rehm <georg.rehm@dfki.de>
Date: Wed, 9 May 2012 19:46:42 +0200
To: Tadej Stajner <tadej.stajner@ijs.si>
Cc: public-multilingualweb-lt@w3.org
Message-Id: <49DD431F-F1BC-41AA-99C3-35906A4EA8FD@dfki.de>
If "genre=advertising, purpose=advertisement" is an official example, then it's probably not a very good one. I'd say it's the other way round: genre=advertisement, purpose=advertising.

Years ago I have devised a very solid rule of thumb that can be used in order to assess whether X is a good genre label: if you can mention the label to someone in the sentence "write/instantiate a document of genre X" and this person is able to comprehend the sentence and, in fact, to instantiate genre X, then it's highly likely that it's a good and valid genre label. 

A few (good) examples: shopping list, restaurant menu, love letter, PhD thesis, invoice, poem, conference programme, CD booklet, business card, recipe, weather report, obituary.

Using this rule of thumb all the other terms that are often (incorrectly!) used to describe the genre, register, type of language etc. can be immediately ruled out as genre labels: informal, formal, article, high-brow etc. 

I'd like to point out that very general terms such as "article" or even "web page" are not genres and do not work as genre labels because these terms are higher up in the hierarchy of text types -- you'd need to ask follow-up questions to be able to instantiate these "genres": "what kind of article?", "what kind of web page?"

Best,
Georg


On May 9, 2012, at 19:34 , Tadej Stajner wrote:

> From my perspective, genre and purpose could use some consolidation. Looking at the requirements page, it seems that genre is dangerously close to purpose, even in the examples (genre=advertising, purpose=advertisement). I'm indifferent to whether genre is called type (although 'type' is a very overloaded term when dealing with software), but I'm in favor of targetAudience instead of purpose, as it makes the distinction more obvious. 
> 
> -- Tadej
> 
> On 5/9/2012 6:09 PM, Felix Sasaki wrote:
>> 
>> 
>> 
>> 2012/5/9 Dr. David Filip <David.Filip@ul.ie>
>> Thanks Declan, I also think that at least domain should be primarily approached as a fixed ontology (it remains to be seen which or which set of ontologies)
>> 
>> What I suggested was that some and *maybe all* of the categories should allow for user defined extensions.
>> 
>> I would rather prefer to say: the metadata we are developing doesn't say anything about other metadata that may be related to the content we are adding it to. That means, everybody is free to develop his own, private values. Having private values within one field is problematic.
>> 
>> Felix
>> 
>>  
>> Of course the machine to machine automation is hindered if private values are being used, but at least the consumers would know that private values can occur and would be prepared to display them, eventually map them based on user             preference..
>> 
>> But as George rightly points out the main issue with all the related terms (rather than categories) is that the community are using them freely and interchangeably although they are distinct and different concepts..
>> 
>> Rgds
>> dF
>> 
>> Dr. David Filip
>> =======================
>> LRC | CNGL | LT-Web | CSIS
>> University of Limerick, Ireland
>> telephone: +353-6120-2781
>> cellphone: +353-86-0222-158
>> facsimile: +353-6120-2734
>> mailto: david.filip@ul.ie
>> 
>> 
>> 
>> On Wed, May 9, 2012 at 1:17 PM, Declan Groves <dgroves@computing.dcu.ie> wrote:
>> Felix,
>> 
>> I would with David that it is something that warrants discussion at the meeting in Dublin.
>> 
>> In terms of domain/genre, we do have a number of very closely related data categories:
>> Domain
>> Genre
>> Purpose
>> Register
>> It is important to capture both domain and style of the text (which is determined by both the purpose and register data categories) for contextually-accurate translation. I feel that "genre" may be superfluous to our needs, but that we should retain purpose (reflects the end consumer of the content) and register (reflects the language style of the content). It may be an idea to rename these to 'target audience' and 'type', respectively, as David has suggested, if it makes the distinction clearer. 
>> 
>> I would suggest domain be mapped to an existing ontology, and therefore restricted (i.e. to NOT allow user defined values), but that for the other two we can leave these as user defined.
>> 
>> 
>> Declan
>> 
>> 
>> 
>> 
>> 
>> 
>> On 9 May 2012 12:43, Dr. David Filip <David.Filip@ul.ie> wrote:
>> Felix, I see where are you coming from and see your argumentation line as simple = more machine to machine interoperability
>> 
>> My personal experience with large corpora such as TDA was that a single plain category is not enough to facilitate slicing and dicing needed to prepare a consistent training corpus from data collected in the wild. MT tuners often need more orthogonal categories 
>> 
>> In LetsMT!, they were addressing the slicing and dicing need by having 3 orthogonal data categories (from my recollection as WIP, not necessarily accurate)
>> domain (ISO categories subset mapped onto TDA), worked quite nicely
>> target audience (general, expert, channel partner, internal)
>> type (social, web, UA, printed doc, marcom)  
>> 
>> Milan should be able to provide more up to date detail, as he (Moravia) is actively in the project..
>> 
>> And finally domain has pretty much the same vagueness as the other three you are proposing to drop and the match with existing copora and trained machines categorization won't be great, so I would not expect a big gain in machine to machine automation with domain only..
>> 
>> I suggest not to drop them at least till Dublin workshop. the interested parties should be able to come with a workable set of  orthogonal categories, possibly consolidated (but not less than 2 IMHO) and more inline with what the industry is doing. We should also consider user defined values as an attractive option for all but domain, or maybe even for domain..
>> 
>> That is my two cents :-)
>> 
>> Rgds
>> dF 
>> 
>> Dr. David Filip
>> =======================
>> LRC | CNGL | LT-Web | CSIS
>> University of Limerick, Ireland
>> telephone: +353-6120-2781
>> cellphone: +353-86-0222-158
>> facsimile: +353-6120-2734
>> mailto: david.filip@ul.ie
>> 
>> 
>> 
>> On Wed, May 9, 2012 at 10:51 AM, Felix Sasaki <fsasaki@w3.org> wrote:
>> Hi all,
>> 
>> See ISSUE-11 : I propose to drop the genre, purpose and register data category proposals. Main reasons:
>> - I don't see a way to come up with an agreeable and interoperable set of values.
>> - There is no way to generate or check this kind of metadata automatically. This will lead to same path as the "keywords" attribute in the HTML meta element.
>> 
>> I propose to have one data category, probably focusing on "domain", that can be produced at least to some extend automatically (= Tadej) and that can be taken up by planned implementations (=Declan).  
>> 
>> Thoughts?
>> 
>> Felix
>> 
>> -- 
>> Felix Sasaki
>> DFKI / W3C Fellow
>> 
>> 
>> 
>> 
>> -- 
>> Dr. Declan Groves
>> Research Integration Officer
>> Centre for Next Generation Localisation (CNGL)
>> Dublin City University
>> 
>> email: dgroves@computing.dcu.ie
>> phone: +353 (0)1 700 6906
>> 
>> 
>> 
>> 
>> -- 
>> Felix Sasaki
>> DFKI / W3C Fellow
>> 
> 

-- 
Dr. Georg Rehm
Network Manager META-NET

DFKI GmbH, Alt-Moabit 91c, 10559 Berlin, Germany
Phone: +49 30 23895-1833 – Fax: -1810
Mobile: +49 173 2735829
georg.rehm@dfki.de – georg.rehm@meta-net.eu
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender), Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
Received on Wednesday, 9 May 2012 21:28:08 UTC