Re: [All] new issue: propose to drop genre, purpose and register data category proposals

Fundamentally, at least for SMT, is a set of content characteristics 
that allow us to relate training corpora for SMT to the content that 
needs to be translated. However, I get the feeling that everyone is 
still pretty much still experimenting with what characteristics and 
classifications of those characteristics are appropriate, and what's 
more it may still vary from application to application (in the sense of 
the application of SMT trained on corpora x in translating content y).

So perhaps the best we can manage is to allow people to point to their 
own characteristic types and then select values from that type. The 
simplest way might be via external files with the condition that every 
characteristic value in that file is a dereferenable fragment. So we 
could imagine the example below, noting that we might need to 
accommodate multiple characteristic values for an element:

<body its-content-characterstic-type-ref="http://www.ex.com/mammals.rdf>

<div its-content-characterstic-type-pointer="http://www.ex.com/mammals#cats>
Tom coughed up a fur ball, stretched and purred like a kitten before a 
piano suddenly landed on his head.
</div>

<div 
its-content-characterstic-type-pointer="http://www.ex.com/mammals#dogs 
its-content-characterstic-type-ref="http://www.ex.com/mammals#mices>
The two friends shook paws on a job well done then headed to bed, Butch 
to his kennel, Jerry to his mouse hole.
</div>

</body>

cheers,
Dave


On 09/05/2012 12:43, Dr. David Filip wrote:
> Felix, I see where are you coming from and see your argumentation line 
> as simple = more machine to machine interoperability
>
> My personal experience with large corpora such as TDA was that a 
> single plain category is not enough to facilitate slicing and dicing 
> needed to prepare a consistent training corpus from data collected in 
> the wild. MT tuners often need more orthogonal categories
>
> In LetsMT!, they were addressing the slicing and dicing need by having 
> 3 orthogonal data categories (from my recollection as WIP, not 
> necessarily accurate)
> domain (ISO categories subset mapped onto TDA), worked quite nicely
> target audience (general, expert, channel partner, internal)
> type (social, web, UA, printed doc, marcom)
>
> Milan should be able to provide more up to date detail, as he 
> (Moravia) is actively in the project..
>
> And finally domain has pretty much the same vagueness as the other 
> three you are proposing to drop and the match with existing copora and 
> trained machines categorization won't be great, so I would not expect 
> a big gain in machine to machine automation with domain only..
>
> I suggest not to drop them at least till Dublin workshop. the 
> interested parties should be able to come with a workable set of 
>  orthogonal categories, possibly consolidated (but not less than 2 
> IMHO) and more inline with what the industry is doing. We should also 
> consider user defined values as an attractive option for all but 
> domain, or maybe even for domain..
>
> That is my two cents :-)
>
> Rgds
> dF
>
> Dr. David Filip
> =======================
> LRC | CNGL | LT-Web | CSIS
> University of Limerick, Ireland
> telephone: +353-6120-2781
> *cellphone: +353-86-0222-158*
> facsimile: +353-6120-2734
> mailto: david.filip@ul.ie <mailto:david.filip@ul.ie>
>
>
>
> On Wed, May 9, 2012 at 10:51 AM, Felix Sasaki <fsasaki@w3.org 
> <mailto:fsasaki@w3.org>> wrote:
>
>     Hi all,
>
>     See ISSUE-11 : I propose to drop the genre, purpose and register
>     data category proposals. Main reasons:
>     - I don't see a way to come up with an agreeable and interoperable
>     set of values.
>     - There is no way to generate or check this kind of metadata
>     automatically. This will lead to same path as the "keywords"
>     attribute in the HTML meta element.
>
>     I propose to have one data category, probably focusing on
>     "domain", that can be produced at least to some extend
>     automatically (= Tadej) and that can be taken up by planned
>     implementations (=Declan).
>
>     Thoughts?
>
>     Felix
>
>     -- 
>     Felix Sasaki
>     DFKI / W3C Fellow
>
>

Received on Wednesday, 9 May 2012 23:37:25 UTC