- From: Arle Lommel <arle.lommel@dfki.de>
- Date: Wed, 8 Aug 2012 13:33:33 +0200
- To: "Gabor L. Ugray" <gabor.ugray@kilgray.com>, István Lengyel <istvan.lengyel@kilgray.com>
- Cc: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
- Message-Id: <4812C4E8-D930-43B6-B730-1036865A181E@dfki.de>
Szervusztok, Gábor és István! I hope this mail finds you well. As part of work at the W3C on ITS 2.0 on standardized markup for localization quality data, I have been asked to follow up with some developers of quality assurance tools to get feedback. While we are still working on details, there is one piece I was asked to approach you about as the drivers of XLIFF:doc’s quality model. (If you want a lot of details about the work going into this, see https://www.w3.org/International/multilingualweb/lt/track/issues/34) Basically, to support a broad-brush interoperability between tools we decided to create a list of "top-level" quality issue categories. They are not intended to replace internal categories, or describe systems in full detail. Rather the categories are intended to give various tools some idea of what sort of quality issue has been detected by a tool, even if the precise details may not match the consuming tool’s own ontology of issues. What we are asking you to do is read the following extract from a rough draft of the description of the locQualityType (localization quality type) attribute value that is used for this categorization and let us know if (a) the values proposed would meet your needs (both for Kilgray and for XLIFF:doc) and (b) if the proposed mapping of XLIFF:doc quality categories (in the appendix draft further down) make sense to you. The hardest one for us to map was the XLIFF:doc internationalization category since it is very broad in its meaning, considerably broader than the corresponding category in this list. (The text also refers to locQualityCode, which is used to pass on internal codes, which may be considerably more detailed than the locQualityType values to which they are mapped.) So, if you can spare a little time to look this over and provide your thoughts, we would appreciate it. (By the way, I will be in Hungary in a few weeks. Unfortunately I won't be anywhere near Gyula—I'm going to Nógrád megye for a week, so I'm headed the wrong way—but if you would happen to be in Budapest on Monday, August 20, I could meet with you to discuss this in person.) Best, Arle DRAFT CONTENT on locQualityType The locQualityType subcategory is intended to provide a basic level of interoperability between differing localization quality assurance systems. It provides a list of 26 high-level quality issue types common in automatic and human localization quality assessment. Localization quality assessment tools can map their internal categories to these categories in order to exchange information about the kinds of issues they identify and take intelligent and appropriate action even if another tool does not know the specific issues identified by the generating tool NOTE: Tools implementing the loc-quality data category that use the locQualityCode subcategory SHOULD also use the locQualityCode to provide this level of interoperability. The values listed in the following table are allowed for locQualityType. Note that tools implementing locQualityTypeare not required to check for or flag issues for all or any of the types. However, if they implement locQualityType, the values they produce for the attribute MUST match one of the values provided in this table and MUST be semantically accurate. If a tool can map its internal values to these categories it MUST do so and may not use the value of other, which is reserved strictly for values that cannot be mapped to these values. Allowable values for locQualityType Value Description Examples Notes terminology An incorrect term or a term from the wrong domain was used or terms are used inconsistently The localization had Pen Drivewhen corporate terminology specified thatUSB Stick was to be used; The localization inconsistently used Start andBegin. Should not be confused with the ITS terminology data category. mistranslation The content of the target mistranslates the content of the source The English source reads “An ape succeeded in grasping a banana lying outside its cage with the help of a stick†but the Italian translation reads “l’ape riuscì a prendere la banana posta tuori dall sua gabbia aiutandosi con un bastone†(“Abeesucceeded…â€) Issues related to translation of specific terms related to the domain or task-specific language should be categorized as terminology issues omission Necessary text has been omitted from the localization or source One or more segments found in the source that should have been translated are missing in the target This category should not be used for missing whitespace or formatting codes, but instead should be reserved for linguistic content. untranslated Content that should have been translated was left untranslated The source segment reads “The Professor said to Smith that he would hear from his lawyer†but the Hungarian localization reads “A professzor azt modta Smithnek, hogy he would hear from his lawyer.†omission take precedence over untranslated. Omissions are distinct in that they address cases where text is not present, while untranslated address cases where text has been carried from the source untranslated. addition The translated text contains inappropriate additions The translated text contains a note from the translator to himself to look up a term; the note should have been deleted but was not. duplication Content has been duplicated improperly A section of the target text was inadvertently copied twice in a copy and paste operation. inconsistency The text is inconsistent with itself (NB: not for use with terminology inconsistency) The text states that an event happened in 1912 in one location but in another states that it happened in 1812. grammar The text contains a grammatical error (including errors of syntax and morphology) The text reads “The guidelines says that users should use a static grounding strap.†legal The text is legally problematic (e.g., it is specific to the wrong legal system) The localized text is intended for use in Thailand but includes U.S. regulatory notices. A text translated into German contains comparative advertising claims that are not allowed by German law register The text is written in the wrong linguistic register of uses slang or other language variants inappropriate to the text A financia text translated into U.S. English refers to dollars as “bucksâ€. locale-specific-content The localization contains content that does not apply to the locale for which it was prepared A text translated for the Japanese market contains call center numbers in Texas and refers to special offers available only in the U.S. Legally inappropriate material should be classified as legal locale-violation Text violates norms for the intended locale A text localized into German has dates in YYYY-MM-DD format instead of in DD.MM.YYYY A translated text uses American-style foot and inch measurements instead of centimeters. style The text contains stylistic errors Company style dictates that all individuals be referred to as Mr. or Ms. with a family name, but the text refers to “Jack Smithâ€. characters The text contains characters that are garbled or incorrect or that are not used in the language in which the content appears the text should have a but instead has a Â¥ sign A text translated into German omits the umlauts over ü, ö, and ä A Japanese localization contains characters like à°® and à°Š (from Telugu) misspelling The text contains a misspelling A German text misspells the word Zustellungas Zustellüng typographical The text has typographical errors such as omitted/incorrect punctuation, incorrect capitalization, etc. An English localization has the following sentence: The man whom, we saw, was in the Military and carried it’s insignias formatting The text is formatted incorrectly Warnings in the target text are supposed to be set in italic face, but instead appear in bold face Margins of the text are narrower than specified inconsistent-entities The source and target text contain different named entities (dates, times, place names, individual names, etc.) The nameThaddeus Cahillappears in an English source but is rendered as TamaÅ¡ Cahillin the Czech version The date February 9, 2007 appears in the source but the translated text has “2. September 2007.†numbers Numbers are inconsistent between source and target The source text states that an object is 120 cm long, but the target text says it is 129 cm. long. Some tools may correct for differences in units of measurement to reduce false positives markup There is an issue related to markup or a mismatch in markup between source and target The source segment has five markup tags but the target has only two An opening tag in the localization is missing a closing tag pattern-problem The text fails to match a pattern that defines allowable content (or matches one that defines non-allowable content) The quality checking tool disallows the regular expression pattern ['"â€â€™][\.,] but the translated text contains A leading “expertâ€, a political hack, claimed otherwise. whitespace There is a mismatch in whitespace between source and target content A source segment starts with six space characters but the corresponding target segment has two non-breaking spaces at the start. internationalization There is an error related to the internationalization of content A line of programming code has embedded language-specific strings A user interface element leaves no room for text expansion A form allows only for U.S.-style postal addresses and expects five digit U.S. ZIP codes There are many kinds of internationalization errors of various types. This category is therefore very heterogeneous in what it can refer to. length There is a significant difference in source and target length The translation of a segment is five times as long as the source What constitutes a “significant†difference in length is determined by the model referred to in the locQualityProfile uncategorized The issue has not been categorized A new version of a tool returns information on an issue that has not been previously checked and that is not yet classified This category has to uses: (1) a tool can use it to pass through quality data from another tool in cases where the issues from the other tool are not classified (for example, a localization quality assurance tool interfaces with a third-party grammar checker); (2) a tool’s issues are not yet assigned to categories, and, until an updated assignment is made, they may be listed as uncategorized. In the latter case it is recommended that issues be assigned to appropriate categories as soon as possible since uncategorized does not foster interoperability. other Any issue that cannot be assigned to any values listed above. This category allows for the inclusion of any issues not included in the previously listed values. This value MUST not be used for any tool- or model-specific issues that can be mapped to the values listed above. In addition, this value is not synonymous with uncategorized in that uncategorized issues may be assigned to another precise value, while other issues cannot. If a metric has an “miscellaneous†or “other†category, it should be mapped to this value even if the specific instance of the issue might be mapped to another category. Annex X: Mapping of Tool-Specific Quality Codes tolocQualityType Values (Non-Normative) This Annex is informative. The following table provides mappings of native quality assurance issue codes for a number of common localization quality tools to locQualityType values. Tool developers are free to map their own issue codes to the locQualityTypevalues and are encouraged to make their mappings publicly available. Tools that produce ITS 2.0 loc-qualitymarkup should ensure that the output of their tools matches any publicly available mappings they may produce. Note: These mappings are provides for example only. In the event of discrepancy between the mapping published by a developer and this annex, the statements from the developer take precedence over this annex. locQualityType value Tool/Metric-Specific Values CheckMate xliff:doc QA Distiller SAE J2450 LISA QA Model (UI) LISA QA Model (doc)* - language only** terminology TERMINOLOGY terminology Consistency Tag-aware ID-aware Untranslatables wrong term Terminology Glossary adherence Abbreviations Context mistranslation Mistranslation Accuracy Semantics Accuracy omission MISSING_TARGETTU MISSING_TARGETSET EMPTY_TARGETSEG EMTPY_SOURCESEG omission Empty translations omission Omissions untranslated TARGET_SAME_AS_SOURCE Forgotten translations Skipped translations Partial translations Incomplete translation addition EXTRA_TARGETSEG Additions duplication Not addressed in any of these metrics. It may be possible to treat this as a case of addition. inconsistency inconsistency Source Target Consistency grammar syntactic error word structure or agreement error Grammar legal Not addressed in any of these metrics. However, legal compliance checking is a big deal for regulated industries and forms a core part of their metrics. register Register/tone Language variants/slang locale-specific-content Local suitability locale-violation Country Country standards style Style General style Company standards characters ALLOWED_CHARACTERS Corrupt characters, source Corrupt characters, target Double/Single Size Character formatting misspelling misspelling Spelling typographical punctuation Consecutive punctuation marks End of segment punctuation Non-matching pairs (brackets) Leading bracket outside of TU Different count(brackets) Initial capitalization Entire capitalization Non-matching pairs (quotation marks) Incorrect type (quotation marks) Different count (quotation marks) punctuation error Punctuation marks formatting TOC Index Layout Typography Graphics Call Outs and Captions Alignment Sizing Truncation/overlap (Numerous) inconsistent-entities date time numbers number Number values Incorrect type (measurements) Check conversions (measurements) markup MISSING_CODE EXTRA_CODE SUSPECT_CODE tags pattern-problem UNEXPECTED_PATTERN SUSPECT_PATTERN pattern whitespace MISSING_LEADINGWS MISSINGORDIFF_LEADINGWS EXTRA_LEADINGWS EXTRAORDIFF_LEADINGWS MISSING_TRAILINGWS MISSINGORDIFF_TRAILINGWS EXTRA_TRAILINGWS EXTRAORDIFF_TRAILINGWS Consecutive spaces Inconsistent leading and trailing spaces Required/forbidden spaces Different count (tabs) Required/forbidden spaces internationalization internationalization (The examples for this code are broader than the type category here.) Number formatting length TARGET_LENGTH uncategorized LANGUAGETOOL_ERROR other other miscellaneous error Hyper text functionality, jumps, popups Localizable text Dialogue functionality Menu functionality Hotkeys/accelerators Jumps/links (** There are significant discrepancies between the categories in the LISA QA Model software and its documentation. The relationship between the two is unclear, so both are listed here.) (** The LISA QA Model documentation addresses numerous issues related to software formatting that are outside the scope of the ITS 2.0 loc-quality model. For the sake of conciseness and clarity, these are not listed in this document.) <eof>
Received on Wednesday, 8 August 2012 11:34:14 UTC