- From: Tadej Štajner <tadej.stajner@ijs.si>
- Date: Tue, 30 Nov 2010 11:07:12 +0100
- To: "Lieske, Christian" <christian.lieske@sap.com>
- CC: "multilingualweb-partners@w3.org" <multilingualweb-partners@w3.org>, "public-i18n-its-ig@w3.org" <public-i18n-its-ig@w3.org>, Felix Sasaki <felix.sasaki@dfki.de>
- Message-ID: <4CF4CCD0.5070401@ijs.si>
Hi, Christian, Felix, all,
in our experience, LT tools often tend to be used in a pipeline to
achieve a desired effect. This makes the transfer of metadata across
different steps in the pipeline even more important, as well as keeping
track what kind of steps were applied to a given information object
(process awareness - for instance, to see what was done automatically
and what manually). We came across roughly the same three gaps that
Felix talked about in his slides and (internally) developed a format
which sufficiently solves this.
The reason behind this is that the processing steps (PoS tagging, entity
identification, co-references, ...) are often interdependent, and is
worth having them integrated as much as possible (maximizing reuse of
metadata via RDF) and at the same time having them drop-in replaceable
(for the sake of pipeline maintenance).
I believe this sort of effort would help a lot of LT-related tasks.
My question here would be:
- how much additional specific word-level (or token-level) mark-up
would it make sense to introduce into this model? I imagine that
annotating RDF on entity-level make a lot of sense (Felix's
"Online-Banking" example), but I'm not sure if proposing lower-level
language annotations (PoS, grammatical roles, co-references) is required
for most use cases.
Our use cases were summarization and extracting structured statements
from text, and low-level language annotations made sense for us. MT
tools also employ additional preprocessing steps, although they usually
integrate them internally. So: what kind (depth) of metadata should we
be talking about here, given the localization use case?
-- Tadej
On 11/30/2010 09:25 AM, Lieske, Christian wrote:
> Hi there,
> The first workshop of the W3C-coordinated Thematic Network “Multingual
> Web” (see
> _http://www.multilingualweb.eu/documents/madrid-workshop/slides-video-irc-notes_)
> revived some thoughts that have been nagging Felix and myself for some
> time. In particular, Felix’ and my own talks (see
> _http://www.w3.org/International/multilingualweb/madrid/slides/sasaki.pdf_
> and
> _http://www.w3.org/International/multilingualweb/madrid/slides/lieske.pdf_)
> made us wonder,how the followingmight be related to forthcoming
> standards-based Natural Language Processing applications on the web:
>
> 1. W3C Internationalization Tag Set (ITS)
> 2. Standard “packaging” format (as one contribution for covering
> some of the 3 gaps Felix has mentioned)
>
> As you may remember, we have already been throwing out some ideas
> related to this (see
> _http://www.localisation.ie/xliff/resources/presentations/2010-10-04_xliff-its-secret-marriage.pdf_
> (slide 22 and 23).
> This time around, we got stuck at the insight that very often, we have
> two separate steps in between the original language content (e.g. a
> set of source XML files), and Natural Language Processing:
>
> 1. Preparation related to individual objects – this may for example
> relate to the insertion of local or global, “term”-related ITS
> markup
> 2. Preparation related to packages of objects – this may for
> example relate to packaging all translation-relevant objects
> into a container
>
> With this in mind, wearrive at two ideas related to standards and
> tools that we might be lacking for forthcoming standards-based Natural
> Language Processing on the web:
>
> 1. Something that could be called “Mark-Up Plug-in (MUP)” – This
> may for example be a plug-in for an Browser-based editor that
> allows for example authors to mark certain parts with
> “its:translate=’no’” (this marking may result in local or global
> ITS markup).
> 2. Something that could be called “Standard Packing Format for
> Multilingual Processing (STAMP)” – This may for example be
> something akin to ePUB (one of the formats that is used in eReaders)
> 3. Something that could be called “Resource Annotation Workbench
> (RAW)” – This may for example be a special capability for an
> application like Rainbow (see
> _http://okapi.opentag.com/applications.html#rainbow_) , that
> allows the following:
>
> 1. Create RDF-based metadata (embedded into the original files, or
> as additional, standalone/sidecar files) for objects that have
> to be processed
> 2. Package the translatables, the supplementary files, and the
> aforementioned “sidecars” into a standardized NLP-processing format
>
> Any thoughts on this?
> Cheers,
> Christian (and Felix)
Received on Wednesday, 1 December 2010 09:31:19 UTC