Re: More thoughts on forthcoming standards-based Natural Language Processing on the Web from Tadej Štajner on 2010-11-30 (public-i18n-its-ig@w3.org from December 2010)

From: Tadej Štajner <tadej.stajner@ijs.si>
Date: Tue, 30 Nov 2010 11:07:12 +0100
To: "Lieske, Christian" <christian.lieske@sap.com>
CC: "multilingualweb-partners@w3.org" <multilingualweb-partners@w3.org>, "public-i18n-its-ig@w3.org" <public-i18n-its-ig@w3.org>, Felix Sasaki <felix.sasaki@dfki.de>
Message-ID: <4CF4CCD0.5070401@ijs.si>

Hi, Christian, Felix, all,

in our experience, LT tools often tend to be used in a pipeline to 
achieve a desired effect. This makes the transfer of metadata across 
different steps in the pipeline even more important, as well as keeping 
track what kind of steps were applied to a given information object 
(process awareness  - for instance, to see what was done automatically 
and what manually). We came across roughly the same three gaps that 
Felix talked about in his slides and (internally) developed a format 
which sufficiently solves this.

The reason behind this is that the processing steps (PoS tagging, entity 
identification, co-references, ...) are often interdependent, and is  
worth having them integrated as much as possible (maximizing reuse of 
metadata via RDF) and at the same time having them drop-in replaceable 
(for the sake of pipeline maintenance).

I believe this sort of effort would help a lot of LT-related tasks.
My question here would be:
     - how much additional specific word-level (or token-level) mark-up 
would it make sense to introduce into this model? I imagine that 
annotating RDF on entity-level make a lot of sense (Felix's 
"Online-Banking" example), but I'm not sure if proposing lower-level 
language annotations (PoS, grammatical roles, co-references) is required 
for most use cases.

Our use cases were summarization and extracting structured statements 
from text, and low-level language annotations made sense for us. MT 
tools also employ additional preprocessing steps, although they usually 
integrate them internally. So: what kind (depth) of metadata should we 
be talking about here, given the localization use case?

-- Tadej

On 11/30/2010 09:25 AM, Lieske, Christian wrote:
> Hi there,
> The first workshop of the W3C-coordinated Thematic Network “Multingual 
> Web” (see 
> _http://www.multilingualweb.eu/documents/madrid-workshop/slides-video-irc-notes_) 
> revived some thoughts that have been nagging Felix and myself for some 
> time. In particular, Felix’ and my own talks (see 
> _http://www.w3.org/International/multilingualweb/madrid/slides/sasaki.pdf_ 
> and
> _http://www.w3.org/International/multilingualweb/madrid/slides/lieske.pdf_) 
> made us wonder,how the followingmight be related to forthcoming 
> standards-based Natural Language Processing applications on the web:
>
>    1. W3C Internationalization Tag Set (ITS)
>    2. Standard “packaging” format (as one contribution for covering
>       some of the 3 gaps Felix has mentioned)
>
> As you may remember, we have already been throwing out some ideas 
> related to this (see 
> _http://www.localisation.ie/xliff/resources/presentations/2010-10-04_xliff-its-secret-marriage.pdf_ 
> (slide 22 and 23).
> This time around, we got stuck at the insight that very often, we have 
> two separate steps in between the original language content (e.g. a 
> set of source XML files), and Natural Language Processing:
>
>    1. Preparation related to individual objects – this may for example
>       relate to the insertion of local or global, “term”-related ITS
>       markup
>    2. Preparation related to packages of objects – this may for
>       example relate to packaging all translation-relevant objects
>       into a container
>
> With this in mind, wearrive at two ideas related to standards and 
> tools that we might be lacking for forthcoming standards-based Natural 
> Language Processing on the web:
>
>    1. Something that could be called “Mark-Up Plug-in (MUP)” – This
>       may for example be a plug-in for an Browser-based editor that
>       allows for example authors to mark certain parts with
>       “its:translate=’no’” (this marking may result in local or global
>       ITS markup).
>    2. Something that could be called “Standard Packing Format for
>       Multilingual Processing (STAMP)” – This may for example be
>       something akin to ePUB (one of the formats that is used in eReaders)
>    3. Something that could be called “Resource Annotation Workbench
>       (RAW)” – This may for example be a special capability for an
>       application like Rainbow (see
>       _http://okapi.opentag.com/applications.html#rainbow_) , that
>       allows the following:
>
>    1. Create RDF-based metadata (embedded into the original files, or
>       as additional, standalone/sidecar files) for objects that have
>       to be processed
>    2. Package the translatables, the supplementary files, and the
>       aforementioned “sidecars” into a standardized NLP-processing format
>
> Any thoughts on this?
> Cheers,
> Christian (and Felix)

Received on Wednesday, 1 December 2010 09:31:19 UTC