Metadata for the Multilingual Web - Usage Scenarios and Implementations

Abstract

The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project LT-Web|) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, members of the Working Group and the LT-Web project created various implementations that exemplify how ITS 2.0 supports automated processing of human language into core Web technologies. These implementations/the corresponding usage scenarios are sketched in this document. Each section of the document comprises the following:

Description - An explanation of the scenario
Data category usage - An explanation which of the ITS 2.0 data categories are involved in the automated processing; (for details on the data categories, W3C Internationalization Tag Set 2.0 has to be consulted)
Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated processing
Information on Implementation Status/Issues - Links to tools and implementers (detailed information, running software, source code etc.)

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document describes usage scenarios and related implementations for Internationalization Tag Set (ITS) 2.0. ITS 2.0 enhances the foundation to integrate both automated and manual processing of human language into core Web technologies.

The work described in this document receives funding by the European Commission (project MultilingualWeb-LT (LT-Web) ) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815).

This document is a First Public Working Draft published by the MultilingualWeb-LT Working Group, part of the W3C Internationalization Activity. The Working Group expects to advance this Working Draft to Working Group Note (see W3C document maturity levels).

By publishing this working draft the working group does not express any consensus about the implementation approach, the use cases described or the proposed metadata items. The main purpose of this publication is to gather feedback from a wider audience.

Feedback about the content of this document is encouraged. Send your comments to public-multilingualweb-lt-comments@w3.org. Use "Comment on Multilingual Web metadata usage scenarios and implementations WD" in the subject line of your email. The archives for this list are publicly available.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

1 Introduction

The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project MultilingualWeb-LT (LT-Web)) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part of their activities, project members and members of the Working Group compiled a list of usage scenarios that exemplify how ITS 2.0 integrates automated processing of human language into core Web technologies. These usage scenarios - and implementations realized by the Working Group - are sketched in this document. The usage scenarios comprise information such as the following:

2 Usage scenarios

2.1 Simple Machine Translation

2.1.1 Description

2.1.2 Data category usage

2.1.3 More Information and Implementation Status/Issues

2.2 Translation Package Creation

2.2.1 Description

2.2.2 Data category usage

2.2.3 More Information and Implementation Status/Issues

2.3 Quality Check

2.3.1 Description

2.3.2 Data category usage

2.3.3 More Information and Implementation Status/Issues

2.4 Processing HTML5 documents with an XML tool chain

2.4.1 Description

2.4.2 Data category usage

2.4.3 More Information and Implementation Status/Issues

2.5 Validating HTML5 with ITS 2.0 metadata

2.5.1 Description

2.5.2 Data category usage

2.5.3 More Information and Implementation Status/Issues

2.6 Interchange between Content Management System and Translation Management System

2.6.1 Description

2.6.2 Data category usage

2.6.3 More Information and Implementation Status/Issues

2.7 Content Internationalization and Advanced Machine Translation

2.7.1 Description

2.7.2 Data category usage

2.7.3 More Information and Implementation Status/Issues

2.8 Using ITS 2.0 with GNU gettext utilities/PO files

2.8.1 Description

2.8.2 Data category usage

2.8.3 More Information and Implementation Status/Issues

2.9 Harnessing ITS 2.0 Metadata to Improve the Human Review Process

2.9.1 Description

2.9.2 Data category usage

2.9.3 More Information and Implementation Status/Issues

2.10 XLIFF-based Machine Translation

2.10.1 Description

2.10.2 Data category usage

2.10.3 More Information and Implementation Status/Issues

2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML

2.11.1 Description

2.11.2 Data category usage

2.11.3 More Information and Implementation Status/Issues

Implementer: TCD/UL, Making use of MT components by Moravia and DCU, and JSI Enrycher as Text Analysis service.

Although all ITS categories listed above, as encoded by OKAPI or TCD's CMS-LION, are covered, the demos in mid March show consumption of mainly the following: translate, term, text analysis, domain, localization note, provenance, and MT confidence. The demos involve:

Please note that links to the running software are currently only accessible to the SOLAS system at the moment. They should become public next week.

2.12 ITS 2.0 for localization of content in a Web Content Management System

2.12.1 Description

2.12.2 Data category usage

2.12.3 More Information and Implementation Status/Issues

2.13 Integrating ITS 2.0, Content Management Interoperability Services, and W3C Provenance

2.13.1 Description

Localization interoperability can be enhanced by using not just ITS 2.0 as standard. In particular, the following standards provide additional opportunities:

2.13.2 Data category usage

Where available, and not already specified by explicit ITS provenance annotation, annotatorsRef was used to derive PROV-O agent details for specific activities, e.g. text analysis and terminology.

2.13.3 More Information and Implementation Status/Issues

2.14 Text Analysis - Named Entity Recognition and Enrichment

2.14.1 Description

2.14.2 Data category usage

2.14.3 More Information and Implementation Status/Issues

2.15 Automated Terminology Annotation

2.15.1 Description

Benefits: The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localization, terminology management and many other tasks that may benefit from terminology annotation.

2.15.2 Data category usage

2.15.3 More Information and Implementation Status/Issues

The implementation has reached Milestone 2 (Initial HTML5 term tagging with simple visualization). The implementation for the Milestone 3 (Enhanced HTML5 term tagging with full visualization) is ongoing.

2.16 Universal Preview of ITS 2.0 Metadata in XML, XLIFF, and HTML Files

2.16.1 Description

XML-based source content such as XLIFF files is usually provided to translators or reviewers as reduced and partially transformed text without any information about local or global context or support for rendering/visualization of content itself or metadata embedded in the content. In sum this has negative effects on quality of final output and productivity of human workers.

The usage scenario allows rendering of content and metadata for easy and interactive reading it as a reference material in a browser. The rendering includes special visual cues, and interaction possibilities (such as colour-coding and pop-ups for metadata to be displayed). It is based on auxiliary files in HTML5+ITS 2.0 (including JavaScript) that are generated from ITS-annotated source content of any supported formats (XML, XLIFF, HTML).

2.16.2 Data category usage

2.16.3 More Information and Implementation Status/Issues

Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at the MultilingualWeb Workshop March 2013.

2.17 ITS 2.0 in word processing software

2.17.1 Description

2.17.2 Data category usage

2.17.3 More Information and Implementation Status/Issues

ILO uses OKAPI capabilities for XLIFF handling and will be available in April 2013. The use of ILO will be presented at the MultilingualWeb Workshop March 2013. The results of ILO development will be given back to the public domain under the open licenses LGPL V3 (same as Libre Office).

2.18 Training for Statistical Machine Translation

2.18.1 Description

2.18.2 Data category usage

2.18.3 More Information and Implementation Status/Issues

3 Authors and Implementation Contributors

Renat Bikmatov (Logrus), David Filip (University of Limerick), Leroy Finn (Trinity College Dublin), Karl Fritsche (Cocomore AG), Serge Gladkoff (Logrus), Declan Groves (Centre for Next Generation Localisation (CNGL), Dublin City University), Milan Karasek (Moravia), Jirka Kosek (University of Economics, Prague), Kevin Lew (Spartan Software), Dave Lewis (Trinity College Dublin), Fredrik Liden (ENLASO Corporation), Shaun McCane ((public) Invited expert), Sean Mooney (University of Limerick), Pablo Nieto Caride (Linguaserve), Pēteris Ņikiforovs (Tilde), David O'Carrol (University of Limerick), Philip O'Duffy (University of Limerick), Mauricio del Olmo (Linguaserve), Mārcis Pinnis (Tilde), Phil Ritchie (VistaTEC), Nieves Sande (German Research Center for Artificial Intelligence (DFKI) Gmbh), Felix Sasaki (W3C Fellow), Yves Savourel (ENLASO Corporation), Sebastian Sklarß (]init[ Europe), Ankit Srivastava (Centre for Next Generation Localisation (CNGL), Dublin City University), Tadej Štajner (Jozef Stefan Institute), Chase Tingley (Spartan Software), Asanka Wasala (University of Limerick), Clemens Weins (Cocomore AG).