Re: [ISSUE 34] Potential problem with high-level quality issues


From:   Arle Lommel <>
To:     Multilingual Web LT Public List 
Date:   02/08/2012 09:09
Subject:        Re: [ISSUE 34] Potential problem with high-level quality 

Hi all,

I think Felix's mail gets at part of the confusion here. So let me try to 
state the problem as I see it:

The tools that generate and receive the quality data (at least what we are 
talking about here in terms of issue types) don't actually do anything 
themselves.  Their function is to preserve the data and to present it to 
the user in an intelligible format. <pr>Exactly!</pr>  Part of the reason 
is that, at least in the case of something like CheckMate, the tools 
doesn't know whether the issue is an error or not; all it is doing is 
flagging something that may be an issue. For example, if the source starts 
with six spaces and the target has only three non-breaking spaces instead, 
it will flag that there is a difference in white-space, but it doesn't 
have a way to know whether that was an actual error or something done 
deliberately by a knowledgeable translator. All it can do is raise a hand 
and politely say "guys, you might want to take a look at this because it 
looks off."

So the problem with looking for interoperable workflow actions based on 
the values of these attributes is that they really are just informational 
in nature. In most cases the tools simply present an issue to the user and 
allow him/her to take appropriate actions. I think those instances where 
the tools take action on their own are pretty limited and trivial 
(something like fixing the order of trailing punctuation).

I think Phil's original example of the HTML preview that highlights 
segments with problems is actually typical of what one would normally 
expect from quality categories.

Would something like highlighting purported errors for the user and color 
coding them be sufficient? Since ultimately these categories are about 
presenting the information to the user for further action, that actually 
is the correct interpretation of the categories and anything more would be 
problematic in most cases.  <pr>See my previous comment about metrics. We 
have automation which takes actions upon rolled up data, not each specific 
error instance.</pr>

To think of this another way, consider the issues flagged by Language 
Tool. What is the action that a tool should take based on them? The answer 
really is "it depends". In the case of a tool like CheckMate, simply 
passing them on and presenting them to the user is the appropriate action. 
In any case, they will require human intervention and they are, at most, a 
flag for attention that the human reviewer may or may not decide to act 

In other cases a tool may present quality data to the user who could 
decide to do any number of things based on it:

Fix the error himself (e.g., the translator clearly mistyped a number and 
it can be fixed).
Send it back to the translator for fixing
Send it to another translator for fixing
Do nothing (the "error" isn't an error or is so minor it doesn't justify 
the cost of fixing)
Rejecting the entire translation because it is so ridden with errors that 
it needs to be started over from scratch

What is done is not determined by the tool (although the tool may make 
suggestions), but rather by the user, so the goal in all of this is to 
pass the information on to the user in a way that is intelligible. The 
category mapping is needed to help the tool decide how to present it to 
the user and what to require of the user. For example, in one scenario a 
tool may ignore whitespace errors entirely (e.g., they don't matter for 
the format in question) but insist on having each and every terminology 
issue checked and either fixed or explicitly acknowledged as being OK.

One possible workflow scenario corresponds roughly to Phil’s example: the 
tool generates an HTML preview with embedded ITS quality metadata and then 
uses a styling mechanism to control how that is visualized for the user. 
For example, you might have something like this:

<!DOCTYPE html>
<html lang="en">
      <title>Telharmonium 1897</title>
      <style type="text/css">
         [qualitytype=untranslated] {
            border:5px solid green;
            background-color: red;
      <h1 id="h0001" qualitytype="untranslated">Telharmonium (1897)</h1>
      <p id="p0001">
         <span class="segment" id="s0001">Thaddeus Cahill (1867–1934) 
conceived of
             an instrument that could transmit its sound from a power 
plant for
             hundreds of miles to listeners over telegraph wiring.</span>
         <span class="segment" id="s0002">Beginning in 1889 the sound 
quality of
             regular telephone concerts was very poor on account of the 
             generated by carbon-granule microphones. As a result Cahill 
decided to
             set a new standard in perfection of sound quality with his 
             a standard that would not only satisfy listeners but that 
would overcome
             all the flaws of traditional instruments.</span>

Which renders nicely as:

(Note that I used the non "its-" prefixed attribute name per Felix's email 

This example also suggests to me that, per one of Yves' recent emails 
(which I don't find at the moment), we need to split the attributes used 
for the general type and the tool-specific type. If we glom them together 
in one attribute with an internal syntax, we lose the ability to do this 
sort of CSS-based highlighting and then need things like this:

<style type="text/css">
      border:5px solid green;
      background-color: red;

This kind of syntax would break the ability to automatically apply styling 
based only on the top-level categories (which might be desirable if you 
are using a browser to render ITS 2.0 quality-tagged data without 
knowledge of the specific tool that is the source). Much better then to 
split the attributes, which would make selection based on the top-level 
categories much easier.

Felix, based on my explanation, would it be enough for an implementation 
to say that a browser or tool simply displays the different issues to the 
user in a visually distinct manner? If so, the bar for implementation 
isn't so high and it would meet typical user needs quite well.

<pr>Great mail Arle. All the things I wanted to say but done more 



On Aug 2, 2012, at 7:23, Felix Sasaki <> wrote:

2012/8/1 Yves Savourel <>
Ok, sorry I missed the distinction in Arle’s note et read your email too 

So this is a requirement that we put upon ourselves.


> The test cases must be more robust that simply seeing
> that a tool identifies an issue and passes it on:
> we also need to see that they do this consistently with
> each other, which is hard since the set of issues
> from the various tools only partially overlap.

I’m not sure I get "we also need to see that they do this consistently 
with each other". Each tool has its own set of issues. The only exchange 
part between tools is when a tool A generates a list of qa notes and those 
are then read into a tool B which do something with them.

My point is just: what useful thing can a tool do when all it knows is 
that something is e.g. a grammar error? See the workflow I tried to 
explain at


The interoperability I can see is that, for example, when tool A and B 
filter the same list of qa notes on the 'omission' type we get the same 

If you mean that we must make sure that tool A map its issue that we see 
as omissions to the 'omission' top-level types, that seems to be out of 
our purview. Or am I missing something?

I am probably asking for mapping in the sense of

For other data categories, we have a small set of allowed values like 
"yes" or "no". So even if we don't test that tools do the same stuff with 
theses values, the value set is so small that the interpretation becomes 
very clear. I just don't understand what useful and testable thing (one or 
two) tools can do with a high level information like "this is a grammar 
error". Maybe you or others can draft an example, filling 1-4 at

in? That would help me a lot.




From: Felix Sasaki []
Sent: Wednesday, August 01, 2012 7:07 PM
To: Yves Savourel
Cc: Arle Lommel; Multilingual Web LT Public List
Subject: Re: [ISSUE 34] Potential problem with high-level quality issues

2012/8/1 Yves Savourel <>
I’m not sure I completely understand the requirement. For each value we 
need two applications that use it?

Did we have such requirement for 1.0?

No, we didn't, since - see below - the number of values was very small and 
easy to understand.

With the need (more on that later) to convince people working on the 
content production side of the usefulness of our metadata, I think we have 
a higher bar than for locNoteType.



For example we have a locNoteType with ‘alert’ or ‘description’. Do we 
have two applications that generate those two values?

Just wondering.

From: Felix Sasaki []
Sent: Wednesday, August 01, 2012 5:22 PM
To: Arle Lommel
Cc: Multilingual Web LT Public List

Subject: Re: [ISSUE 34] Potential problem with high-level quality issues

Hi Arle, all,

let me just add that for other data categories, we have only small set of 
predefined values - e.g. for "Translate" only "yes" or "no", or for 
localization note type "alert" or "description". Also, these values are 
distinct - you have either "yes" or "no", so there is no danger of doing 
the wrong thing then an application produces or consumes the values. 
Finally, the categorization of an error seems to be difficult, with so 
many categories being proposed.

This situation led me to the thinking that we should set a high bar for 
the normative values - otherwise there won't be any interoperability of 
what implementations produce or consume, as Arle described. I don't see a 
clear way out, and I'm looking very much forward to feedback from 
implementors - Yves, Phil etc.



2012/8/1 Arle Lommel <>
Hello all,

I was discussing the high-level quality issues with Felix this morning and 
we have an issue. If they are to be normative, then we will need to find 
at least two interoperable implementations for each value, not just for 
the mechanism as a whole, and to test those implementations against test 
cases. While that would not be hard for some like terminology, it would be 
difficult for others like legal, because, while they are used in metrics, 
they are not particularly embedded in tools that would produce or consume 
ITS 2.0 markup.

One solution is to put the issue names in an informative annex and very 
strongly recommend that they be used. That approach is, I realize, 
unlikely to satisfy Yves, for good reason: if we cannot know what values 
are allowed in that slot, then we cannot reliably expect interoperability. 
At the same time, if we only go with those values for which we can find 
two or more interoperable implementations, that list of 26 issues will 
probably become something like six or eight, thereby leaving future tools 
that might address the other issues out in the cold.

I have to confess that I do not see a solution to this issue right now 
since we really need the values to be normative but if we cannot test them 
in fairly short order they cannot be normative. The test cases must be 
more robust that simply seeing that a tool identifies an issue and passes 
it on: we also need to see that they do this consistently with each other, 
which is hard since the set of issues from the various tools only 
partially overlap.

If anyone has any brilliant ideas on how to solve the issue, please feel 
free to chime in. We're still working on this and hope to find a way to 
move forward with normative values.



Felix Sasaki
DFKI / W3C Fellow

This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the sender immediately by e-mail.

Received on Thursday, 2 August 2012 09:47:32 UTC