- From: Tadej Štajner <tadej.stajner@ijs.si>
- Date: Tue, 30 Nov 2010 11:07:12 +0100
- To: "Lieske, Christian" <christian.lieske@sap.com>
- CC: "multilingualweb-partners@w3.org" <multilingualweb-partners@w3.org>, "public-i18n-its-ig@w3.org" <public-i18n-its-ig@w3.org>, Felix Sasaki <felix.sasaki@dfki.de>
- Message-ID: <4CF4CCD0.5070401@ijs.si>
Hi, Christian, Felix, all, in our experience, LT tools often tend to be used in a pipeline to achieve a desired effect. This makes the transfer of metadata across different steps in the pipeline even more important, as well as keeping track what kind of steps were applied to a given information object (process awareness - for instance, to see what was done automatically and what manually). We came across roughly the same three gaps that Felix talked about in his slides and (internally) developed a format which sufficiently solves this. The reason behind this is that the processing steps (PoS tagging, entity identification, co-references, ...) are often interdependent, and is worth having them integrated as much as possible (maximizing reuse of metadata via RDF) and at the same time having them drop-in replaceable (for the sake of pipeline maintenance). I believe this sort of effort would help a lot of LT-related tasks. My question here would be: - how much additional specific word-level (or token-level) mark-up would it make sense to introduce into this model? I imagine that annotating RDF on entity-level make a lot of sense (Felix's "Online-Banking" example), but I'm not sure if proposing lower-level language annotations (PoS, grammatical roles, co-references) is required for most use cases. Our use cases were summarization and extracting structured statements from text, and low-level language annotations made sense for us. MT tools also employ additional preprocessing steps, although they usually integrate them internally. So: what kind (depth) of metadata should we be talking about here, given the localization use case? -- Tadej On 11/30/2010 09:25 AM, Lieske, Christian wrote: > Hi there, > The first workshop of the W3C-coordinated Thematic Network “Multingual > Web” (see > _http://www.multilingualweb.eu/documents/madrid-workshop/slides-video-irc-notes_) > revived some thoughts that have been nagging Felix and myself for some > time. In particular, Felix’ and my own talks (see > _http://www.w3.org/International/multilingualweb/madrid/slides/sasaki.pdf_ > and > _http://www.w3.org/International/multilingualweb/madrid/slides/lieske.pdf_) > made us wonder,how the followingmight be related to forthcoming > standards-based Natural Language Processing applications on the web: > > 1. W3C Internationalization Tag Set (ITS) > 2. Standard “packaging” format (as one contribution for covering > some of the 3 gaps Felix has mentioned) > > As you may remember, we have already been throwing out some ideas > related to this (see > _http://www.localisation.ie/xliff/resources/presentations/2010-10-04_xliff-its-secret-marriage.pdf_ > (slide 22 and 23). > This time around, we got stuck at the insight that very often, we have > two separate steps in between the original language content (e.g. a > set of source XML files), and Natural Language Processing: > > 1. Preparation related to individual objects – this may for example > relate to the insertion of local or global, “term”-related ITS > markup > 2. Preparation related to packages of objects – this may for > example relate to packaging all translation-relevant objects > into a container > > With this in mind, wearrive at two ideas related to standards and > tools that we might be lacking for forthcoming standards-based Natural > Language Processing on the web: > > 1. Something that could be called “Mark-Up Plug-in (MUP)” – This > may for example be a plug-in for an Browser-based editor that > allows for example authors to mark certain parts with > “its:translate=’no’” (this marking may result in local or global > ITS markup). > 2. Something that could be called “Standard Packing Format for > Multilingual Processing (STAMP)” – This may for example be > something akin to ePUB (one of the formats that is used in eReaders) > 3. Something that could be called “Resource Annotation Workbench > (RAW)” – This may for example be a special capability for an > application like Rainbow (see > _http://okapi.opentag.com/applications.html#rainbow_) , that > allows the following: > > 1. Create RDF-based metadata (embedded into the original files, or > as additional, standalone/sidecar files) for objects that have > to be processed > 2. Package the translatables, the supplementary files, and the > aforementioned “sidecars” into a standardized NLP-processing format > > Any thoughts on this? > Cheers, > Christian (and Felix)
Received on Wednesday, 1 December 2010 09:31:19 UTC