Re: [ISSUE-34]: Revised description and example from Felix Sasaki on 2012-08-03 (public-multilingualweb-lt@w3.org from August 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Fri, 3 Aug 2012 18:10:16 +0200
To: Arle Lommel <arle.lommel@dfki.de>
Cc: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
Message-ID: <CAL58czp5mbSzkaUCH410WEDHQEb2N6rYZxCOr=BERPJ0k-rr5g@mail.gmail.com>
2012/8/3 Arle Lommel <arle.lommel@dfki.de>

> There are two options, depending on how sophisticated the tool is and if
> there are mappings from Language Tool to specific categories or not. If
> there are specific mappings, this would pretty clearly be a "typographical"
> issue. If there are not, I would expect most tools to pass it through under
> the "language-issue" heading.
>

Interesting - languagetool is producing lot's of types of errors, and each
language version is having it's own names for types of errors. See e.g.

http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/de/grammar.xml?content-type=text%2Fplain
here you have a category "Grammatik"
and for English
http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/de/grammar.xml?content-type=text%2Fplain
it is "grammar". But there is no relation between these categories.
So do we assume that somebody (who?) will maintain mappings? How about
versioning of the grammar.xml files? In a different version, again the
categories can change.


>
> Ideally, of course, we have mappings from Language Tool's categories to
> ours, but the current proposal does what Okapi does, in essence: it passes
> the Language Tool error through as a "language-issue" and just spits the
> native code out the other side. Not ideal, but given the configurability of
> Language Tool, it seemed a reasonable approach. But if we can do more, we
> should.
>

So it seems that "language-issue" is a basket for everything, loosing
actually a lot of information which actually would fit into the grammar and
other top level categories, correct?
Is the "matching" of top level categories to existing data better for other
tools? Do the assume the tool producers to work on their output so that the
matching gets better?


>
> In terms of how to represent the data, that's interesting, because if the
> errors are just passed through, there is an abundance of information that
> technically constitutes the bit that corresponds to
> translation-quality-code. One way to do it would be to have something
> like this:
>
> <span
> its-translation-quality-type="typographic"
> its-translation-quality-code="languageTool:
> error_fromy-0;
> fromx-0;
> toy-0;
> tox-5;
> ruleid-UPPERCASE_SENTENCE_START;
> msg-This sentence does not start with an uppercase letter;
> replacement-This;
> context-this is a test;
> contextoffset-0
> errorlength-4
> >this is a test.</span>
>
> But there is something inherently unsatisfactory about that model. If we
> look a little closer, however, most of that is some sort of stand-off
> markup designed to point inside the text chunk. So if we're clever and
> apply a bit of fairly straightforward preprocessing and a few added span
> tags, we can get this:
>
> <span
> its-translation-quality-type="typographic"
> its-translation-quality-code="languageTool:UPPERCASE_SENTENCE_START"
> its-translation-quality-note="replacement: This"
> >this</span> is a test.
>
> Which is pretty elegant and matches the proposed model cleanly and nicely.
>
> One other issue to consider: most of the Language Tool stuff isn't really
> *translation-quality*: it is *language-quality* (I know, the boundaries
> are heavily blurred, to say the least) and thus, if anything, deserves more
> scope and recognition than we can give it.
>
> But does my second solution seem reasonable?
>

Yes, but ... again, without a mapping, here from language tool output to
"typographic", it is probably not realistic that we will have such data.

It might be useful to try to engage tool producers in the quality area -
what we have started already - to see the limits we can go to wrt to
mappings.

Best,

Felix


> Best,
>
> Arle
>
> ------------------------------
>  --
>
> Dr. Arle Lommel
> Senior Consultant, Language Technology Lab
> Deutsches Forschungszentrum für Künstliche Intelligenz GmbH / German
> Research Center for Artificial Intelligence
> Alt-Moabit 91c, D-10559 Berlin, Germany
> http://www.dfki.de
> ☎: +49.30.23895.1834 (Germany) / +1.707.709.8650 (USA) / +49.30.23895.1810(fax)
> Skype: arle_lommel
> Time zone: Central European Time (UTC+1 / UTC+2 in summer)
> ------------------------------
>
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
> (Vorsitzender), Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
>
> On Aug 3, 2012, at 17:15 , Felix Sasaki <fsasaki@w3.org> wrote:
>
> Hi Arle,
>
> I was looking into what languagetool is producing, see the example below,
> with the input sentence "this is a test", taken from
> http://www.languagetool.org/usage/
>
>
> <error fromy="0" fromx="0" toy="0"
> tox="5"ruleId="UPPERCASE_SENTENCE_START"msg="This sentence does not start
> with an uppercase letter"replacements="This" context="this is a
> test."contextoffset="0"errorlength="4"/>
>
> How would this information be represented, using your proposal?
>
> Best,
>
> Felix
>
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Friday, 3 August 2012 16:10:44 UTC