Abstract
The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many
commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content.
ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange
File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).
The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project LT-Web|) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part
of their activities, members of the Working Group and the LT-Web project created various implementations that exemplify how
ITS 2.0 supports automated processing of human language into core Web technologies. These implementations/the corresponding
usage scenarios are sketched in this document. Each section of the document comprises the following:
- Description - An explanation of the scenario
- Data category usage - An explanation which of the ITS 2.0 data categories are involved in the automated processing; (for
details on the data categories, W3C Internationalization Tag Set 2.0 has to be consulted)
- Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated processing
- Information on Implementation Status/Issues - Links to tools and implementers (detailed information, running software, source
code etc.)
Status of this Document
This section describes the status of this document at the time of its
publication. Other documents may supersede this document. A list of current W3C
publications and the latest revision of this technical report can be found in
the W3C technical reports index at
http://www.w3.org/TR/.
This document describes usage scenarios and related implementations for Internationalization Tag Set (ITS) 2.0. ITS 2.0 enhances the foundation to integrate both automated and manual processing of human language into core Web technologies.
The work described in this document receives funding by the European Commission (project MultilingualWeb-LT (LT-Web) ) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815).
This document is a First Public Working Draft published by the MultilingualWeb-LT
Working Group, part of the W3C Internationalization
Activity. The Working Group expects to advance this Working Draft to Working
Group Note (see W3C
document maturity levels).
By publishing this working
draft the working group does not express any consensus about the implementation
approach, the use cases described or the proposed metadata items. The main purpose
of this publication is to gather feedback from a wider audience.
Feedback about the content of this document is encouraged.
Send your comments to public-multilingualweb-lt-comments@w3.org. Use "Comment on Multilingual Web metadata usage scenarios and implementations WD" in the subject line of your email. The
archives for this list are publicly available.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This
is a draft document and may be updated, replaced or obsoleted by other documents at
any time. It is inappropriate to cite this document as other than work in
progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
1 Introduction
The W3C Internationalization Tag Set 2.0 - developed by the W3C MultilingualWeb-LT Working Group enhances the foundation to integrate automated processing of human language into core Web technologies. ITS 2.0 bears many
commonalities with is predecessor ITS 1.0 but provides additional concepts that are designed to foster the automated creation and processing of multilingual Web content.
ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage processing based on the XML Localization Interchange
File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).
The W3C MultilingualWeb-LT Working Group received funding by the European Commission (project MultilingualWeb-LT (LT-Web)) through the Seventh Framework Programme (FP7) in the area of Language Technologies (Grant Agreement No. 287815). As part
of their activities, project members and members of the Working Group compiled a list of usage scenarios that exemplify how
ITS 2.0 integrates automated processing of human language into core Web technologies. These usage scenarios - and implementations
realized by the Working Group - are sketched in this document. The usage scenarios comprise information such as the following:
- Description - An explanation of the scenario
- Data category usage - An explanation how the individual ITS 2.0 data categories are involved in the automated processing
(for details on the data categories, W3C Internationalization Tag Set 2.0)
- Benefits - Reasons why the ITS 2.0 data categories enable or enhance the automated processing
- Information on Implementation Status/Issues - Links to tools and implementers (detailed information, running software, source
code etc.)
2 Usage scenarios
2.1 Simple Machine Translation
2.1.1 Description
- Translate XML and HTML5 content via a Machine Translation (MT) system such as Microsoft Translator.
- The parts of the content that should be translated are first extracted based on ITS 2.0 markup. The extracted parts are send
to the MT system. After translation, the translated content is merged back with the parts that are not translation-relevant
(recreating the original XML or HTML5 format).
Benefits:
- The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML
and HTML5.
- Processing details such as the need to preserve white space can be passed on.
2.1.2 Data category usage
- Translate - Parts that are not translation-relevant are marked (and protected).
- Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate'
content.
- Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
- Preserve Space - Extracted parts/text units can be annotated with the information that whitespace is relevant and thus needs
to be preserved.
- Domain - Domain values are placed into a property that can be used to select an MT system and/or to provide domain-related
metadata to an MT system.
2.1.3 More Information and Implementation Status/Issues
Tool: Okapi Framework (ENLASO).
Implementation status/issues:
- Only the first occurrence of the Domain value triggers the selection of the engine.
- Preserve Space is currently not respected by the engine.
2.2 Translation Package Creation
2.2.1 Description
- Create a Translation Package in OASIS XML Localization Interchange File Format (XLIFF) from XML or HTML5 content.
- Based on its ITS 2.0 metadata, the content goes through a processing pipeline (e.g. extraction of translation-relevant parts).
At the end of the pipeline, an XLIFF package is stored.
Benefits:
- The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML
and HTML5.
- Processing details such as the need to preserve white space can be passed on.
- Efficient version comparison and leveraging of existing translations is possible.
- Information like domain of the content, external references or localization notes, is made available in the XLIFF package.
Thus, any XLIFF-enabled tool can make use of this information to provide translation assistance.
- Terms in the source content are marked, and thus can be matched against a terminology database.
- Constraints about storage size and allowed characters help to meet physical requirements.
2.2.2 Data category usage
- Translate - Parts that are not translation-relevant are marked (and protected).
- Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate'
content.
- Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
- Preserve Space - The information is mapped to xml:space
- Id Value - The value is connected to the name of the extracted text unit.
- Domain - Values are placed into a corresponding okp:itsDomain attribute.
- Storage Size - The information is placed in native ITS 2.0 markup.
- External Resource - The URI is placed in a corresponding okp:itsExternalResourceRef attribute.
- Terminology - The information about terminology is placed in a special XLIFF note element.
- Localization Note - The text is place in an XLIFF note.
- Allowed Characters - The pattern is placed in its:allowedCharacters.
2.2.3 More Information and Implementation Status/Issues
Tool: Okapi Framework (ENLASO).
Implementation status/issues:
- ITS to XLIFF and XLIFF to ITS mapping needs to be finalized
2.3 Quality Check
2.3.1 Description
- Load XML, HTML5 and XLIFF content for which ITS 2.0 meta data exists into a tool that performs different kind of quality
checks (CheckMate, a tool for checking quality).
- The XML and HTML5 content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified
by CheckMate.
- The XLIFF content is processed based on its ITS 2.0 properties. The constraints defined with ITS 2.0 are verified by CheckMate.
Benefits:
- The ITS 2.0 markup provides key information to drive the reliable extraction of translation-relevant content from both XML
and HTML5.
- The ITS 2.0 markup provides key information to drive quality-related checks.
- The ITS 2.0 markup allows all different file formats to be handled in the same way by the quality checking tool.
2.3.2 Data category usage
- Translate - Parts that are not translation-relevant are marked (and protected).
- Locale Filter - Only the parts that pass the locale filter are extracted. The other parts are treated as 'do not translate'
content.
- Element Within Text - Elements are either extracted as in-line codes or as sub-flows.
- Preserve Space - The information is mapped to the preserveSpace field in the extracted text unit.
- Id Value - The ids are used to identify all entries with an issue.
- Storage Size - The content is verified against the storage size constraints.
- Allowed Characters - The content is verified against the pattern matching allowed characters.
2.3.3 More Information and Implementation Status/Issues
Tool: Okapi Framework (ENLASO).
Implementation status/issues:
- The Okapi's quality checker step does not map its warning levels properly to the ITS severity values.
2.4 Processing HTML5 documents with an XML tool chain
2.4.1 Description
- Turn HTML5 with "its-" attributes into XHTML with "its:" prefixes.
Benefits:
- Allows processing of HTML5 documents with XML tools.
2.4.2 Data category usage
- All data categories are covered.
2.4.3 More Information and Implementation Status/Issues
2.5 Validating HTML5 with ITS 2.0 metadata
2.5.1 Description
- W3C uses validator.nu as experimental validator for HTML5. For HTML5 with ITS 2.0 metadata, validator.nu generates errors,
since "its-" attributes are not valid HTML5.
- The software allows validation of HTML5+ITS 2.0 with validator.nu (soon to be deployed as HTML5+ITS 2.0 validator at W3C
validation service)
Benefits:
- Allows the validation of HTML5 documents which include ITS 2.0 markup.
- Detects errors in ITS 2.0 markup for HTML5.
2.5.2 Data category usage
- All data categories are covered
2.5.3 More Information and Implementation Status/Issues
2.6 Interchange between Content Management System and Translation Management System
2.6.1 Description
- Content is roundtripped between a Content Management System (CMS) and Translation Management System (TMS).
- The content originates in a CMS, and gets exposed/serialized as XHTML + ITS 2.0. This is sent to a TMS, and processed in
a workflow. Upon completion, the TMS exposes/serializes localized/translated XHTML + ITS 2.0 to the CMS.
- See ITS 2.0 for localization of content in a Web Content Management System for the description of the CMS side
Benefits:
- Facilitated coupling/interoperability between CMS and TMS.
- Cost and quality benefits for Language Service Buyer (CMS side) and Language Service Provider (TMS side).
- Language Service Buyer has more control of the localization workflow via ITS 2.0 metadata
- Automatic (e.g. via data category "Translate")
- Semiautomatic (e.g. via data category "Domain")
- Manual (e.g. via data category "Localization Note")
2.6.2 Data category usage
- Translate (global and local usage) - Parts that are not translation-relevant are marked (and protected).
- Localization Note (global and local usage) - Provide additional information for process managers, translators and reviewers
to facilitate processing.
- Domain (global usage) -
- Provide additional information for process managers, translators and reviewers to facilitate processing.
- Control workflow dimensions such as selection of dictionaries and translation memories on the TMS side.
- Language Information (local usage)- Control workflow dimensions such as selecting suitable translators and reviewers. Also
adds context information that helps to decide if a piece of content shall or shall not be translated.
- Allowed Characters (local usage) - The content is verified against the pattern matching allowed characters to ensure that
on the TMS side, no inappropriate characters become part of the content (e.g. due to work of a translator).
- Storage Size (local usage) - The content is verified against the storage size constraints to ensure that on the TMS side,
no capacity limitations related to the content are violated (e.g. due to a lengthy translation).
- Provenance (local usage) - Allows tracking of human agents or software agents that processed the content on the TMS side.
In the case of updates, provenance/tracking information will enable the TMS side to assign or propose the same human agents
(translators, or reviewers) that participated in the initial processing.
Additional data category (not part of ITS 2.0):
- Readiness (global usage) - Provides information to translation process managers (examples: When was the content was ready
to be processed? What is the deadline? What is the priority? Which service/process variant is relevant?)
2.6.3 More Information and Implementation Status/Issues
Tools (developed by Linguaserve):
Implementation status:
- Successfully tested roundtripping Drupal XHTML files utilizing supported ITS 2.0 data categories in workflow
- Used in productive translation
Implementation issues:
- Compliant implementation of ITS 2.0 global rules not finished yet
2.7 Content Internationalization and Advanced Machine Translation
2.7.1 Description
- Enable an HTML5 content reviser (language editor, translation post-editor) to add ITS 2.0 metadata to the contents of web
documents.
- Use the ITS 2.0 metadata to control the behavior of different Machine Translation (MT) Systems and Multilingual Publication
System.
- Covers post-editing of translations generated by MT.
Benefits:
- provides key information to drive the reliable extraction of translation-relevant content from HTML5;
- helps to control workflow dimensions such as selection of domain-specific vocabulary to improve the Machine Translation results;
- provides information for post-editing.
2.7.2 Data category usage
- Translate - Parts that are not translation-relevant are marked (and protected).
- Localization Note - Provides additional information for language or translation editors to facilitate translation.
- Language Information - Controls workflow dimensions such as setting the source language, and the target language (via the
lang attribute of the output), it also protects the translation of contents where the lang attribute is different from the
source language.
- Domain - Domain values are mapped to the domains used by the individual MT systems, and used to select the appropriate vocabulary.
- Provenance - Allows tracking of human agents (language or translation editors) or software agents (MT systems) that processed
the content.
- Localization Quality Issue - Can be provided for the translated content by the reviser. Can be utilized for example by MT
developers to improve the MT System.
- Locale Filter - Reveals that content is only relevant for certain locales (useful in localization).
- MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
2.7.3 More Information and Implementation Status/Issues
Tools:
- Real Time Multilingual Publication System ATLAS PW1 (Linguaserve).
- Statistical MT System MaTrEx (DCU).
- Rule-based MT System (LucySoftware).
Implementation issues:
- Implementation of ITS 2.0 translate data category for attributes currently restricted to global rules
2.8 Using ITS 2.0 with GNU gettext utilities/PO files
2.8.1 Description
- The GNU gettext utilities assist in internationalizing and translating in the context of UNIX-like Operating Systems. The
file format of the utilities is the GNU gettext portable object (PO) file format.
- The implementation - ITS Tool - enables roundtripping between PO files and XML formats like mallard.
- ITS Tool includes default rules for various formats, and uses them for PO file generation.
- ITS Tool is aware of various ITS 2.0 data categories in the PO file generation step.
2.8.2 Data category usage
- Preserve Space
- Locale Filter
- External Resource
- Translate
- Elements Within Text
- Localization Note
- Language Information
2.8.3 More Information and Implementation Status/Issues
Implementation status/issues:
- Need to convert built-in rules to new categories, and to deprecate extensions (not a conformance blocker).
- No support for its:param (blocked by lacking support for setting XPath variables in libxml2 Python bindings; patch pending review).
- No support support for HTML. libxml2's HTML parser does not correctly handle HTML5. Need to evaluate other libraries.
2.9 Harnessing ITS 2.0 Metadata to Improve the Human Review Process
2.9.1 Description
- The implementation - the "Reviewer's Workbench" (a desktop application) - reads HTML, XML and XLIFF files annotated with
ITS 2.0 metadata.
- At each segment of the original content, the ITS metadata is made accessible to reviewers. Reviewers can adapt the access
via user-definable filter/formatting "rules". The metadata allows human reviewers to make efficient decisions.
- During the review of translations, reviewers can add Localization Quality Issue annotations (which are serialized as ITS
2.0 metadata when the file is saved). Provenance annotations are added in the background.
- The combination of captured Localization Quality Issue and Provenance data then becomes valuable data which can be used for
traditional business intelligence, or semantic web applications.
2.9.2 Data category usage
- Provenance
- Localization Quality Issue
Benefits:
- Increases review effectiveness as reviewers can be informed by metadata.
- Harvests data during review.
- Facilitates audit and quality correction.
2.9.3 More Information and Implementation Status/Issues
- Application development currently at alpha stage.
- Awaiting finalization of XLIFF mappings and underlying Okapi filter support.
- Application is closed source.
2.10 XLIFF-based Machine Translation
2.10.1 Description
- Invoke Machine Translation (MT) from a localization workflow using ITS 2.0 integrated with the XML Localization Interchange
File Format (XLIFF)
2.10.2 Data category usage
- Domain - The domain value can be used by the MT system to improve processing accuracy
- Translate - Parts that are not translation-relevant are marked (and protected).
- MT Confidence - Assesses the confidence in the quality of the translation generated by the MT system.
- Terminology - Enforce the MT system to translate specific words or phrases according to terminological information
- Provenance - Allows tracking of human agents (content editors) or software agents (MT systems) that processed the content.
Benefits:
- The use of XLIFF allows an MT system to be integrated seamlessly into automated localization workflows involving commercial
Translation Management Systems and Computer Assisted Translation (CAT) tools.
- The use of XLIFF and ITS 2.0 facilitates the integration of/switch between multiple MT systems to provide alternative translation
within a single project workflow.
- The use of the ITS 2.0 "translate" attribute ensures that content is not altered by the MT system - especially if that content
is included in a translation project as context for human agents such as translation post-editors.
- The ITS 2.0 "domain" metadata in XLIFF ensures that the most relevant MT engine can be selected by the MT system.
- Combining XLIFF and ITS 2.0 "terminology" metadata enforce the MT system to translate specific words or phrases according
to terminological information.
- Integrating ITS 2.0 MT confidence scores into XLIFF target language translation enables them to be presented to translation
post-editors.
- Recording provenance information enables localization managers to compare the performance of different MT engines or systems,
or different translation post-editors.
2.10.3 More Information and Implementation Status/Issues
2.11 XLIFF-based CMS-to-TMS Roundtripping for HTML&XML
2.11.1 Description
- SOLAS - is a service-based architecture for orchestrating localization workflows among XLIFF-aware components.
- One of SOLAS components is an OKAPI based extra Extractor/Merger service that maps ITS 2.0 categories onto XLIFF 1.2
- SOLAS is also integrated with CMS-L10N, can receive/return XLIFF jobs created by CMS-L10N.
- CMS-L10N (aka LION) is basically a middleware component based on an RDF triple store over an arbitrary CMS (tested with Alfresco,
Drupal and Wikimedia).
- Can parse the source including most of the ITS 2.0 metadata and produce XLIFF 1.2 according to a currently agreed mapping.
After the roundtrip, that is handled via SOLAS, it updates the RDF triple store accordingly.
Benefits:
- The use of ITS 2.0 and XLIFF helps to modularize and connect specialized (single-purpose) components.
- SOLAS can handle input of components aware of different ITS 2.0 categories or unaware of ITS at all and combine them. SOLAS
orchestration ensures basic ITS compliance even with ITS unaware components. E.g. If a service provider is unaware of the
translate flag, SOLAS can filter the translation request for that provider, so that the flag is actually interpreted.
2.11.2 Data category usage
- Translate
- Localization Note
- Terminology
- Directionality
- Language Information
- Elements Within Text
- Domain
- Text Analysis
- Locale Filter
- Provenance
- External Resource
- Target Pointer
- Id Value
- Preserve Space
- Localization Quality Issue
- Localization Quality Rating
- MT Confidence
- Allowed Characters
- Storage Size
2.11.3 More Information and Implementation Status/Issues
Implementer: TCD/UL, Making use of MT components by Moravia and DCU, and JSI Enrycher as Text Analysis service.
This tool is based on an ITS-XLIFF mapping:
- The mapping is currently under discussion.
- The goal is to freeze the mapping and to produce a best practice note within lifespan of the LT-Web project.
- The focus is currently on XLIFF 1.2 favoring solutions that can be structurally preserved in XLIFF 2.0. that is the main
target in the long run.
Although all ITS categories listed above, as encoded by OKAPI or TCD's CMS-LION, are covered, the demos in mid March show
consumption of mainly the following: translate, term, text analysis, domain, localization note, provenance, and MT confidence.
The demos involve:
- An XLIFF-based source quality assurance tool (LKR by UL)
- A Project Manager/Localization Engineer friendly XLIFF Viewer/Editor (LocConnect by UL)
- Integrated Machine Translation Solutions
- Moravia's implementation of M4Loc and Moses with ITS 2.0 support
- DCU MaTrEx with ITS 2.0 support
- Fallback handling of the ITS 2.0 information within SOLAS MT Service Mapper with services that are not ITS 2.0 aware, such
as Microsoft Bing
- Details (M4Loc processing of ITS2.0 enhanced XLIFF files):
- Running software: http://mlwlt.moravia.com (testing site)
- Running software (web-service): http://mlwlt.moravia.com/mlwlt-service-xliff-mt/mlwlt-service.asmx
- Source code: https://github.com/mkarasek/mlwlt-m4loc-xliff-mt
- General documentation: https://github.com/mkarasek/mlwlt-m4loc-xliff-mt/wiki
Please note that links to the running software are currently only accessible to the SOLAS system at the moment. They should
become public next week.
2.12 ITS 2.0 for localization of content in a Web Content Management System
2.12.1 Description
- Drupal is a Web Content Management System (WCMS).
- The Drupal modules, developed by Cocomore,
- add the ability to apply ITS 2.0 local metadata through Drupal's WYSIWYG editor.
- add the ability to apply global ITS 2.0 metadata at content mode level.
- Implemented jQuery plugin to optimize the GUI of the Translation Management tool (there is a published jQuery download as standalone solution, too).
Benefits:
- Support for ITS 2.0 in Drupal facilitates the localization/translation of Drupal-based content.
- The Drupal modules facilitate the roundtripping process from WCMS with systems of Localization Service Provider (including
automatic content re-integration).
- The Drupal modules enable tracking of provenance information (e.g. to identify translation post-editors).
2.12.2 Data category usage
- Translate - Mark content which should not be translated and highlight this marked content.
- Localization Note - Add a note for the translator to improve his understanding of this content and can make a better translation.
- Domain - Set the domain of a text to improve the machine and human translation process.
- Provenance - Check which translator/reviser worked on content.
- Allowed Characters/Storage Size - Make the translator aware of restrictions for specific content, like not allowed characters
or a maximum length of a translation. These constraints are automatically set by Drupal.
- Text Analysis - Annotate text with terminology metadata to improve the machine and human translation process.
2.12.3 More Information and Implementation Status/Issues
Tool: Drupal Module for editing and viewing of ITS 2.0 markup (Cocomore AG)
Tool: Drupal Module to connect to TMGMT Translator Linguaserve (Cocomore AG)
Tool: Drupal Module to interact with TMGMT Workflow (Cocomore AG)
Tool: ITS 2.0 jQuery Plugin (Cocomore AG)
2.13 Integrating ITS 2.0, Content Management Interoperability Services, and W3C Provenance
2.13.1 Description
Localization interoperability can be enhanced by using not just ITS 2.0 as standard. In particular, the following standards
provide additional opportunities:
- OASIS Content Management Information Service (CMIS) to externally associate multiple ITS 2.0 rules files with large sets
of documents, and to retrieve those documents regardless of the Content Management System in use
- W3C Provenance (PROV) to track which human agents or software agents processed the content; tracking can span multiple agents/components,
while allowing individual tracking records to be easily consolidated via linked data approaches
Benefits:
- Enables ITS 2.0 annotations to be associated with multiple documents via the CMS without editing individual files. This reduces
source content internationalization and document management costs. Furthermore, it reduces annotation errors.
- Allows fine-grained tracking and analysis of Language Technology (LT) components, human agents (language workers) and service
providers - even across multiple organizations, projects, and heterogeneous process landscapes. This reduces the overhead costs
in tracking, monitoring, analyzing and optimizing the localization workflows - especially of the critical elements within
them (e.g. MT engines, human terminologists and translators)
- Enables tracking of human linguistic judgments and their influence on the output of LT components. Tracking data can be
curated for retraining/retuning those LT components (e.g. Statistical Machine Translation or text analysis components)
- Tracking information can be mapped to the W3C PROV Ontology (PROV-O) which expresses the PROV Data Model using the OWL2 Web
Ontology Language (OWL2), and stored in Resource Description Framework (RDF) triple stores.
2.13.2 Data category usage
- Provenance - Tracks MT-based translation and translation revision through a post-editing interface. Tracking is implemented
as standoff provenance records in XLIFF files. The post-editing records detail which of the MT outputs was used if multiple
MT outputs are offered to the post-editor. The agent's ITS annotations (from translation and translation revision) are mapped
to PROV-O triples in the accompanying RDF provenance logs.
- Text analysis - Calls text analysis service (e.g. Enrycher) on source HTML file for Named Entity Recognition annotations.
These annotations are also mapped into XLIFF files. This annotation results in logging of activities performed on an 'analysed
text' entity in the PROV-O triple store.
- Terminology - Allows text annotated by Named Entity Recognition, as well as other phrases, to be identified as terms and
used to populate a multilingual glossary. If the text analysis annotation returns a DBpedia reference, a query for the label
used in the equivalent target language page can be attempted to populate the term target in the glossary. The terminology
annotation and the glossary are mapped to XLIFF as well as resulting in a 'term' entity being tracked in the PROV-O provenance
logs.
- MT Confidence - This is used to annotate - in XLIFF - the assumed quality of output of MT engines. MT Confidence is also
tracked for the translation entities generated by MT in the PROV-O logs.
- Domain - Mapped from HTML source document to XLIFF, and used to annotate PROV-O entities representing source units, i.e.
the source content of translation units.
- Translate - Mapped from HTML source document to XLIFF, and used to annotate PROV-O entities representing source units, i.e.
the source content of translation units.
Where available, and not already specified by explicit ITS provenance annotation, annotatorsRef was used to derive PROV-O
agent details for specific activities, e.g. text analysis and terminology.
2.13.3 More Information and Implementation Status/Issues
Details:
2.14 Text Analysis - Named Entity Recognition and Enrichment
2.14.1 Description
- Named entities (e.g. names of persons, places, or products) in HTML content are recognized based on the Natural Language
Processing (NLP) tool - Enrycher.
- The entities are enriched in the following ways:
- the identity is computed/disambiguated (so that for example London - England, and London - Ontario can be distinguished)
- a category (e.g. geographic name/place) is assigned
- Both the entity recognition and the enrichment generate markup which amongst others allows tracking of the software agent/NLP
tool that was used
- Enriched, disambiguated content facilitates processing for source and target languages (amongst others since it provides
context to translators)
Benefits:
- The ITS 2.0 markup provides the key information about entities, so they can be correctly processed. Example: one may employ
specific translations, transliterations, officially mandated translations, or even keep the original.
- Content management systems may use disambiguated, enriched content for providing entity-centric browsing and retrieval functionality.
2.14.2 Data category usage
- Text Analysis - Mark fragments of content which mention named entities; enrich the content by additional information such
as a URI denoting the entity's identity.
- Text Analysis - Mark fragments of content with individual word meanings; enrich the content by additional information such
as a URI denoting the word's meaning.
2.14.3 More Information and Implementation Status/Issues
Implementation issues and need for discussion:
- Implementation of NLP tools for providing the Domain data category annotations.
2.15 Automated Terminology Annotation
2.15.1 Description
- Term candidates in HTML5, XLIFF and plaintext are annotated by humans or software agents (automatic term candidate annotation).
- Automatic term candidate annotation can comprise:
- Term candidate recognition based on existing terminology resources (e.g., term banks, such as EuroTermBank or IATE)
- Term candidate identification based on unguided terminology extraction systems (e.g., ACCURAT Toolkit or TTC TermSuite)
- Content analysis and terminology mark-up are performed by a Web Service API with the following functionality:
- Support for ITS 2.0 metadata (Terminology, Language Information, Domain, Elements Within Text and Locale Filter data categories);
- Annotation of the content by the two above-mentioned methods. The API breaks down the content in Language and Domain dimensions
and uses terminology annotation services provided by the TaaS platform in order to identify terms and link them with the TaaS
platform.
- Visualization capabilities are provided for the annotated terminology allowing human users access to the annotation results.
Benefits:
The Web Service API can be integrated in automated language processing workflows, for instance, machine translation, localization,
terminology management and many other tasks that may benefit from terminology annotation.
2.15.2 Data category usage
- Domain - The domain information is used to split and analyze the content per domain separately. This allows filtering terms
in the term bank-based terminology annotation as well as identifying domain-specific content using unguided term extraction
systems. The user is asked to provide a default domain for the term bank-based terminology annotation. This user-supplied
domain will be overridden with ITS 2.0 domain metadata if present in the content.
- Element Within Text - The information is used to decide which elements are extracted as in-line codes and sub-flows.
- Language Information - The language information is used to split and analyze the content per language. The user will be asked
to provide a source (default) language, however, the default language will be overridden with ITS 2.0 Language Information
metadata if present in the content.
- Locale Filter - Whenever used only the text in the locale as specified by the user defined source language is analyzed. The
remaining content is ignored.
- Terminology - For existing terminology metadata, the mark-up is preserved (terminology mark-up overlaps are not allowed).
For new terminology metadata, terms are marked according to the Terminology data category’s rules.
2.15.3 More Information and Implementation Status/Issues
The implementation has reached Milestone 2 (Initial HTML5 term tagging with simple visualization). The implementation for
the Milestone 3 (Enhanced HTML5 term tagging with full visualization) is ongoing.
- Detailed slides: will be made available at the end of May, 2013
- Running code: http://taws.tilde.com
- Source code: will be made available at the end of May, 2013
- General documentation: will be made available at the end of May, 2013
2.16 Universal Preview of ITS 2.0 Metadata in XML, XLIFF, and HTML Files
2.16.1 Description
XML-based source content such as XLIFF files is usually provided to translators or reviewers as reduced and partially transformed
text without any information about local or global context or support for rendering/visualization of content itself or metadata
embedded in the content. In sum this has negative effects on quality of final output and productivity of human workers.
The usage scenario allows rendering of content and metadata for easy and interactive reading it as a reference material in
a browser. The rendering includes special visual cues, and interaction possibilities (such as colour-coding and pop-ups for
metadata to be displayed). It is based on auxiliary files in HTML5+ITS 2.0 (including JavaScript) that are generated from
ITS-annotated source content of any supported formats (XML, XLIFF, HTML).
2.16.2 Data category usage
- All ITS 2.0 data categories
2.16.3 More Information and Implementation Status/Issues
Implementer: Logrus
Implementation status: Prototype will display Translate, Localization Note, and Terminology data categories at the MultilingualWeb
Workshop March 2013.
2.17 ITS 2.0 in word processing software
2.17.1 Description
- The tool - ITS for Libre Office Writer Extension (ILO)- allows use of a subset of ITS 2.0 in an open source word processing
software (Libre Office).
- Capabilities include:
- Tagging phrases and terms as “not to translate” (translate)
- Tagging words as “term” (terminology)
- Tagging words for a specific locale only (locale filter)
- Providing additional information for the translator (localization note)
- The Libre Office extension and its software packages allows users to
- Load ITS 2.0 annotated XML files (ODT, XLIFF)
- Visualize ITS 2.0 metadata in the WYSIWYG editor of Libre office
- Edit text related to ITS 2.0 meta data
- Save and export the text and including ITS 2.0 markup into the original file format (ODT, XLIFF)
2.17.2 Data category usage
- Terminology - One or several words can be marked up as “term”
- Translate – Mark content as “to translate” or “not to translate”
- Localization Note – Pass a message (information, alert) to human agents (such as translators)
- Locale Filter – Limit content to specific locales
2.17.3 More Information and Implementation Status/Issues
ILO uses OKAPI capabilities for XLIFF handling and will be available in April 2013. The use of ILO will be presented at the MultilingualWeb Workshop March 2013. The results of ILO development will be given back to the public domain under the open licenses
LGPL V3 (same as Libre Office).
2.18 Training for Statistical Machine Translation
2.18.1 Description
- ITS 2.0 bilingual data is collected in a Content Management System, and passed to a Statistical Machine Translation (SMT)
system for training the system's language models.
- If domain information is supplied for the content, domain-aware modules in the SMT system are trained on the corresponding
content.
Benefits:
- The ITS 2.0 markup provides key information to drive the reliable extraction of domain-specific content.
- MT systems trained on domain-specific data allow for potentially more accurate translation.
2.18.2 Data category usage
- Translate - Parts that retain their original form are passed through the MT as-is.
- Language Information - Used to select the appropriate MT language models.
- Domain - Domain values direct the selection of/training of the appropriate MT language models.
2.18.3 More Information and Implementation Status/Issues
3 Authors and Implementation Contributors
Renat Bikmatov (Logrus),
David Filip (University of Limerick),
Leroy Finn (Trinity College Dublin),
Karl Fritsche (Cocomore AG),
Serge Gladkoff (Logrus),
Declan Groves (Centre for Next Generation Localisation (CNGL), Dublin City University),
Milan Karasek (Moravia),
Jirka Kosek (University of Economics, Prague),
Kevin Lew (Spartan Software),
Dave Lewis (Trinity College Dublin),
Fredrik Liden (ENLASO Corporation),
Shaun McCane ((public) Invited expert),
Sean Mooney (University of Limerick),
Pablo Nieto Caride (Linguaserve),
Pēteris Ņikiforovs (Tilde),
David O'Carrol (University of Limerick),
Philip O'Duffy (University of Limerick),
Mauricio del Olmo (Linguaserve),
Mārcis Pinnis (Tilde),
Phil Ritchie (VistaTEC),
Nieves Sande (German Research Center for Artificial Intelligence (DFKI) Gmbh),
Felix Sasaki (W3C Fellow),
Yves Savourel (ENLASO Corporation),
Sebastian Sklarß (]init[ Europe),
Ankit Srivastava (Centre for Next Generation Localisation (CNGL), Dublin City University),
Tadej Štajner (Jozef Stefan Institute),
Chase Tingley (Spartan Software),
Asanka Wasala (University of Limerick),
Clemens Weins (Cocomore AG).