More thoughts on forthcoming standards-based Natural Language Processing on the Web

Hi there,

The first workshop of the W3C-coordinated Thematic Network "Multingual Web" (see http://www.multilingualweb.eu/documents/madrid-workshop/slides-video-irc-notes) revived some thoughts that have been nagging Felix and myself for some time. In particular, Felix' and my own talks (see http://www.w3.org/International/multilingualweb/madrid/slides/sasaki.pdf and
http://www.w3.org/International/multilingualweb/madrid/slides/lieske.pdf) made us wonder, how the following might be related to forthcoming standards-based Natural Language Processing applications on the web:

1.      W3C Internationalization Tag Set (ITS)
2.      Standard "packaging" format (as one contribution for covering some of the 3 gaps Felix has mentioned)

As you may remember, we have already been throwing out some ideas related to this (see http://www.localisation.ie/xliff/resources/presentations/2010-10-04_xliff-its-secret-marriage.pdf (slide 22 and 23).

This time around, we got stuck at the insight that very often, we have two separate steps in between the original language content (e.g. a set of source XML files), and Natural Language Processing:
1.

Preparation related to individual objects - this may for example relate to the insertion of local or global, "term"-related ITS markup
2.      Preparation related to packages of objects - this may for example relate to packaging all translation-relevant objects into a container
 With this in mind, we arrive at two ideas related to standards and tools that we might be lacking for forthcoming standards-based Natural Language Processing on the web:
1.

Something that could be called "Mark-Up Plug-in (MUP)" - This may for example be a plug-in for an Browser-based editor that allows for example authors to mark certain parts with "its:translate='no'" (this marking may result in local or global ITS markup).
2.      Something that could be called "Standard Packing Format for Multilingual Processing (STAMP)" - This may for example be something akin to ePUB (one of the formats that is used in eReaders)
3.      Something that could be called "Resource Annotation Workbench (RAW)" - This may for example be a special capability for an application like Rainbow (see http://okapi.opentag.com/applications.html#rainbow) , that allows the following:
a.

Create RDF-based metadata (embedded into the original files, or as additional, standalone/sidecar files) for objects that have to be processed
b.      Package the translatables, the supplementary files, and the aforementioned "sidecars" into a standardized NLP-processing format

Any thoughts on this?
Cheers,
Christian (and Felix)

Received on Tuesday, 30 November 2010 08:26:03 UTC