Re: [ISSUE-34]: Revised description and example

There are two options, depending on how sophisticated the tool is and if there are mappings from Language Tool to specific categories or not. If there are specific mappings, this would pretty clearly be a "typographical" issue. If there are not, I would expect most tools to pass it through under the "language-issue" heading.

Ideally, of course, we have mappings from Language Tool's categories to ours, but the current proposal does what Okapi does, in essence: it passes the Language Tool error through as a "language-issue" and just spits the native code out the other side. Not ideal, but given the configurability of Language Tool, it seemed a reasonable approach. But if we can do more, we should.

In terms of how to represent the data, that's interesting, because if the errors are just passed through, there is an abundance of information that technically constitutes the bit that corresponds to translation-quality-code. One way to do it would be to have something like this:

<span
 its-translation-quality-type="typographic"
 its-translation-quality-code="languageTool:
  error_fromy-0;
  fromx-0;
  toy-0;
  tox-5;
  ruleid-UPPERCASE_SENTENCE_START;
  msg-This sentence does not start with an uppercase letter;
  replacement-This;
  context-this is a test;
  contextoffset-0
  errorlength-4
>this is a test.</span>

But there is something inherently unsatisfactory about that model. If we look a little closer, however, most of that is some sort of stand-off markup designed to point inside the text chunk. So if we're clever and apply a bit of fairly straightforward preprocessing and a few added span tags, we can get this:

<span
 its-translation-quality-type="typographic"
 its-translation-quality-code="languageTool:UPPERCASE_SENTENCE_START"
 its-translation-quality-note="replacement: This"
>this</span> is a test.

Which is pretty elegant and matches the proposed model cleanly and nicely.

One other issue to consider: most of the Language Tool stuff isn't really translation-quality: it is language-quality (I know, the boundaries are heavily blurred, to say the least) and thus, if anything, deserves more scope and recognition than we can give it.

But does my second solution seem reasonable?

Best,

Arle


--
Dr. Arle Lommel 
Senior Consultant, Language Technology Lab 
Deutsches Forschungszentrum für Künstliche Intelligenz GmbH / German Research Center for Artificial Intelligence 
Alt-Moabit 91c, D-10559 Berlin, Germany 
http://www.dfki.de 
☎: +49.30.23895.1834 (Germany) / +1.707.709.8650 (USA) / +49.30.23895.1810 (fax) 
Skype: arle_lommel 
Time zone: Central European Time (UTC+1 / UTC+2 in summer)

Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern 
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster 
(Vorsitzender), Dr. Walter Olthoff 
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes 
Amtsgericht Kaiserslautern, HRB 2313


On Aug 3, 2012, at 17:15 , Felix Sasaki <fsasaki@w3.org> wrote:

> Hi Arle,
> 
> I was looking into what languagetool is producing, see the example below, with the input sentence "this is a test", taken from
> http://www.languagetool.org/usage/
> 
> 
> <error fromy="0" fromx="0" toy="0" tox="5"ruleId="UPPERCASE_SENTENCE_START"msg="This sentence does not start with an uppercase letter"replacements="This" context="this is a test."contextoffset="0"errorlength="4"/>
> 
> 
> How would this information be represented, using your proposal?
> 
> Best,
> 
> Felix

Received on Friday, 3 August 2012 15:53:47 UTC