W3C

Metadata for the Multilingual Web - Usage Scenarios and Implementations

W3C Working Draft 7 March 2013

This version:
http://www.w3.org/TR/2012/WD-mlw-metadata-us-impl-20130307/
Latest version:
http://www.w3.org/TR/mlw-metadata-us-impl/
Editors:
Christian Lieske (SAP AG)
Contributors
See the list of contributors

Abstract

The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project LT-Web|) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, members of the Working Group and the LT-Web project created various implementations that exemplify how ITS 2.0 supports automated processing of human language into core Web technologies. These implementations/the corresponding usage scenarios are sketched in this document. Each section of the document comprises the following:

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document describes usage scenarios and related implementations for Internationalization Tag Set (ITS) 2.0. ITS 2.0 enhances the foundation to integrate both automated and manual processing of human language into core Web technologies.

The work described in this document receives funding by the European Commission (project MultilingualWeb-LT (LT-Web) ) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815).

This document is a First Public Working Draft published by the MultilingualWeb-LT Working Group, part of the W3C Internationalization Activity. The Working Group expects to advance this Working Draft to Working Group Note (see W3C document maturity levels).

By publishing this working draft the working group does not express any consensus about the implementation approach, the use cases described or the proposed metadata items. The main purpose of this publication is to gather feedback from a wider audience.

Feedback about the content of this document is encouraged. Send your comments to public-multilingualweb-lt-comments@w3.org. Use "Comment on Multilingual Web metadata usage scenarios and implementations WD" in the subject line of your email. The archives for this list are publicly available.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Introduction

The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project MultilingualWeb-LT (LT-Web)) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, project members and members of the Working Group compiled a list of usage scenarios that exemplify how ITS 2.0 integrates automated processing of human language into core Web technologies. These usage scenarios - and implementations realized by the Working Group - are sketched in this document. The usage scenarios comprise information such as the following:

2 Usage scenarios

2.1 Simple Machine Translation

2.1.1 Description

Benefits:

2.1.2 Data category usage

2.1.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Implementation status/issues:

2.2 Translation Package Creation

2.2.1 Description

Benefits:

2.2.2 Data category usage

2.2.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Implementation status/issues:

2.3 Quality Check

2.3.1 Description

Benefits:

2.3.2 Data category usage

2.3.3 More Information and Implementation Status/Issues

Tool: Okapi Framework (ENLASO).

Implementation status/issues:

2.4 Processing HTML5 documents with an XML tool chain

2.4.1 Description

Benefits:

2.4.2 Data category usage

2.4.3 More Information and Implementation Status/Issues

2.5 Validating HTML5 with ITS 2.0 metadata

2.5.1 Description

Benefits:

2.5.2 Data category usage

2.5.3 More Information and Implementation Status/Issues

2.6 Interchange between Content Management System and Translation Management System

2.6.1 Description

Benefits:

2.6.2 Data category usage

Additional data category (not part of ITS 2.0):

2.6.3 More Information and Implementation Status/Issues

Tools (developed by Linguaserve):

Implementation status:

Implementation issues:

2.7 Content Internationalization and Advanced Machine Translation

2.7.1 Description

Benefits:

  1. provides key information to drive the reliable extraction of translation-relevant content from HTML5;
  2. helps to control workflow dimensions such as selection of domain-specific vocabulary to improve the Machine Translation results;
  3. provides information for post-editing.

2.7.2 Data category usage

2.7.3 More Information and Implementation Status/Issues

Tools:

Implementation issues:

2.8 Using ITS 2.0 with GNU gettext utilities/PO files

2.8.1 Description

2.8.2 Data category usage

2.8.3 More Information and Implementation Status/Issues

Implementation status/issues:

2.9 Harnessing ITS 2.0 Metadata to Improve the Human Review Process

2.9.1 Description

2.9.2 Data category usage

Benefits:

2.9.3 More Information and Implementation Status/Issues

2.10 XLIFF-based Machine Translation

2.10.1 Description

2.10.2 Data category usage

Benefits:

2.10.3 More Information and Implementation Status/Issues

2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML

2.11.1 Description

  1. One of SOLAS components is an OKAPI based extra Extractor/Merger service that maps ITS 2.0 categories onto XLIFF 1.2
  2. SOLAS is also integrated with CMS-L10N, can receive/return XLIFF jobs created by CMS-L10N.
  1. Can parse the source including most of the ITS 2.0 metadata and produce XLIFF 1.2 according to a currently agreed mapping. After the roundtrip, that is handled via SOLAS, it updates the RDF triple store accordingly.

Benefits:

2.11.2 Data category usage

2.11.3 More Information and Implementation Status/Issues

Implementer: TCD/UL, Making use of MT components by Moravia and DCU, and JSI Enrycher as Text Analysis service.

This tool is based on an ITS-XLIFF mapping:

Although all ITS categories listed above, as encoded by OKAPI or TCD's CMS-LION, are covered, the demos in mid March show consumption of mainly the following: translate, term, text analysis, domain, localization note, provenance, and MT confidence. The demos involve:

  1. Moravia's implementation of M4Loc and Moses with ITS 2.0 support
  2. DCU MaTrEx with ITS 2.0 support
  3. Fallback handling of the ITS 2.0 information within SOLAS MT Service Mapper with services that are not ITS 2.0 aware, such as Microsoft Bing
  1. Running software: http://mlwlt.moravia.com (testing site)
  2. Running software (web-service): http://mlwlt.moravia.com/mlwlt-service-xliff-mt/mlwlt-service.asmx
  3. Source code: https://github.com/mkarasek/mlwlt-m4loc-xliff-mt
  4. General documentation: https://github.com/mkarasek/mlwlt-m4loc-xliff-mt/wiki

Please note that links to the running software are currently only accessible to the SOLAS system at the moment. They should become public next week.

2.12 ITS 2.0 for localization of content in a Web Content Management System

2.12.1 Description

Benefits:

2.12.2 Data category usage

2.12.3 More Information and Implementation Status/Issues

Tool: Drupal Module for editing and viewing of ITS 2.0 markup (Cocomore AG)

Tool: Drupal Module to connect to TMGMT Translator Linguaserve (Cocomore AG)

Tool: Drupal Module to interact with TMGMT Workflow (Cocomore AG)

Tool: ITS 2.0 jQuery Plugin (Cocomore AG)

2.13 Integrating ITS 2.0, Content Management Interoperability Services, and W3C Provenance

2.13.1 Description

Localization interoperability can be enhanced by using not just ITS 2.0 as standard. In particular, the following standards provide additional opportunities:

  1. OASIS Content Management Information Service (CMIS) to externally associate multiple ITS 2.0 rules files with large sets of documents, and to retrieve those documents regardless of the Content Management System in use
  2. W3C Provenance (PROV) to track which human agents or software agents processed the content; tracking can span multiple agents/components, while allowing individual tracking records to be easily consolidated via linked data approaches

Benefits:

2.13.2 Data category usage

Where available, and not already specified by explicit ITS provenance annotation, annotatorsRef was used to derive PROV-O agent details for specific activities, e.g. text analysis and terminology.

2.13.3 More Information and Implementation Status/Issues

Details:

2.14 Text Analysis - Named Entity Recognition and Enrichment

2.14.1 Description

Benefits:

2.14.2 Data category usage

2.14.3 More Information and Implementation Status/Issues

Implementation issues and need for discussion:

2.15 Automated Terminology Annotation

2.15.1 Description

Benefits: The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localization, terminology management and many other tasks that may benefit from terminology annotation.

2.15.2 Data category usage

2.15.3 More Information and Implementation Status/Issues

The implementation has reached Milestone 2 (Initial HTML5 term tagging with simple visualization). The implementation for the Milestone 3 (Enhanced HTML5 term tagging with full visualization) is ongoing.

2.16 Universal Preview of ITS 2.0 Metadata in XML, XLIFF, and HTML Files

2.16.1 Description

XML-based source content such as XLIFF files is usually provided to translators or reviewers as reduced and partially transformed text without any information about local or global context or support for rendering/visualization of content itself or metadata embedded in the content. In sum this has negative effects on quality of final output and productivity of human workers.

The usage scenario allows rendering of content and metadata for easy and interactive reading it as a reference material in a browser. The rendering includes special visual cues, and interaction possibilities (such as colour-coding and pop-ups for metadata to be displayed). It is based on auxiliary files in HTML5+ITS 2.0 (including JavaScript) that are generated from ITS-annotated source content of any supported formats (XML, XLIFF, HTML).

2.16.2 Data category usage

2.16.3 More Information and Implementation Status/Issues

Implementer: Logrus

Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at the MultilingualWeb Workshop March 2013.

2.17 ITS 2.0 in word processing software

2.17.1 Description

2.17.2 Data category usage

2.17.3 More Information and Implementation Status/Issues

ILO uses OKAPI capabilities for XLIFF handling and will be available in April 2013. The use of ILO will be presented at the MultilingualWeb Workshop March 2013. The results of ILO development will be given back to the public domain under the open licenses LGPL V3 (same as Libre Office).

2.18 Training for Statistical Machine Translation

2.18.1 Description

Benefits:

2.18.2 Data category usage

2.18.3 More Information and Implementation Status/Issues

3 Authors and Implementation Contributors

Renat Bikmatov (Logrus), David Filip (University of Limerick), Leroy Finn (Trinity College Dublin), Karl Fritsche (Cocomore AG), Serge Gladkoff (Logrus), Declan Groves (Centre for Next Generation Localisation (CNGL), Dublin City University), Milan Karasek (Moravia), Jirka Kosek (University of Economics, Prague), Kevin Lew (Spartan Software), Dave Lewis (Trinity College Dublin), Fredrik Liden (ENLASO Corporation), Shaun McCane ((public) Invited expert), Sean Mooney (University of Limerick), Pablo Nieto Caride (Linguaserve), Pēteris Ņikiforovs (Tilde), David O'Carrol (University of Limerick), Philip O'Duffy (University of Limerick), Mauricio del Olmo (Linguaserve), Mārcis Pinnis (Tilde), Phil Ritchie (VistaTEC), Nieves Sande (German Research Center for Artificial Intelligence (DFKI) Gmbh), Felix Sasaki (W3C Fellow), Yves Savourel (ENLASO Corporation), Sebastian Sklarß (]init[ Europe), Ankit Srivastava (Centre for Next Generation Localisation (CNGL), Dublin City University), Tadej Štajner (Jozef Stefan Institute), Chase Tingley (Spartan Software), Asanka Wasala (University of Limerick), Clemens Weins (Cocomore AG).