RE: [ISSUE-34] - Quality data category. Some questions about the OKAPI errors

Hi Arle,

> I've been reviewing your error categories and now 
> have a few questions about how you distinguish 
> between the following pairs of categories:
> • MISSING_LEADINGWS
> • MISSINGORDIFF_LEADINGWS
> ...
> In each case the second of the pair seems to 
> include the first in its meaning. So why not 
> have something like the following?

Yes both are about missing or extra whitespaces. The difference is how the issue is detected: The ones without 'ORDIFF' are triggered when a whitespace in the source doesn't have any character correspondence in the target. "abc  " vs "xyz " for example.

The issues with 'ORDIFF' are triggered when a whitespace in the source corresponds to a different character in the target, for example "  abc" vs " xyz" or " \txyz". The source whitespace may be missing or, if the corresponding character in the target is also a whitespace, it may be different. We don't spend the time to checking which it is.

We could, or we could generate an issue 'ORDIFF' for the first case as well. Those items may change in new versions.


> Also, for the following, are these automatically determined?
> • TERMINOLOGY
> Terminology looks like it would normally be spotted by a human,
> but perhaps you have a different detection method?

As you saw in the documentation, those are automatically detected. Currently it's still crude, but we have access to more sophisticated code that could be quite efficient. We have to find the time to port it to Okapi some day (it would take a while). 


> • LANGUAGETOOL_ERROR
> For the language-tool error, can you provide an example?

That one is a catch-all issue we use when applying LanguageTool's rule on the content.
LanguageTool is a grammar/style/issue checker used (among other places) in OpenOffice. Currently we just pass on the error message and the positions if any is available. And a suggestion if one is also provided.

The errors generated by LT are rule-based and configurable, so there is no way to get a finite list of them.
You can find an example of a few grammar issue categories here: http://www.languagetool.org/download/errors.xml

I hope this helps,
-yves

Received on Sunday, 8 July 2012 05:11:44 UTC