- From: Arle Lommel <arle.lommel@dfki.de>
- Date: Fri, 3 Aug 2012 17:53:16 +0200
- To: Felix Sasaki <fsasaki@w3.org>
- Cc: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
- Message-Id: <E5E1A651-2DEF-4602-AC0B-2FE6AB3100EC@dfki.de>
There are two options, depending on how sophisticated the tool is and if there are mappings from Language Tool to specific categories or not. If there are specific mappings, this would pretty clearly be a "typographical" issue. If there are not, I would expect most tools to pass it through under the "language-issue" heading. Ideally, of course, we have mappings from Language Tool's categories to ours, but the current proposal does what Okapi does, in essence: it passes the Language Tool error through as a "language-issue" and just spits the native code out the other side. Not ideal, but given the configurability of Language Tool, it seemed a reasonable approach. But if we can do more, we should. In terms of how to represent the data, that's interesting, because if the errors are just passed through, there is an abundance of information that technically constitutes the bit that corresponds to translation-quality-code. One way to do it would be to have something like this: <span its-translation-quality-type="typographic" its-translation-quality-code="languageTool: error_fromy-0; fromx-0; toy-0; tox-5; ruleid-UPPERCASE_SENTENCE_START; msg-This sentence does not start with an uppercase letter; replacement-This; context-this is a test; contextoffset-0 errorlength-4 >this is a test.</span> But there is something inherently unsatisfactory about that model. If we look a little closer, however, most of that is some sort of stand-off markup designed to point inside the text chunk. So if we're clever and apply a bit of fairly straightforward preprocessing and a few added span tags, we can get this: <span its-translation-quality-type="typographic" its-translation-quality-code="languageTool:UPPERCASE_SENTENCE_START" its-translation-quality-note="replacement: This" >this</span> is a test. Which is pretty elegant and matches the proposed model cleanly and nicely. One other issue to consider: most of the Language Tool stuff isn't really translation-quality: it is language-quality (I know, the boundaries are heavily blurred, to say the least) and thus, if anything, deserves more scope and recognition than we can give it. But does my second solution seem reasonable? Best, Arle -- Dr. Arle Lommel Senior Consultant, Language Technology Lab Deutsches Forschungszentrum für Künstliche Intelligenz GmbH / German Research Center for Artificial Intelligence Alt-Moabit 91c, D-10559 Berlin, Germany http://www.dfki.de ☎: +49.30.23895.1834 (Germany) / +1.707.709.8650 (USA) / +49.30.23895.1810 (fax) Skype: arle_lommel Time zone: Central European Time (UTC+1 / UTC+2 in summer) Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender), Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 On Aug 3, 2012, at 17:15 , Felix Sasaki <fsasaki@w3.org> wrote: > Hi Arle, > > I was looking into what languagetool is producing, see the example below, with the input sentence "this is a test", taken from > http://www.languagetool.org/usage/ > > > <error fromy="0" fromx="0" toy="0" tox="5"ruleId="UPPERCASE_SENTENCE_START"msg="This sentence does not start with an uppercase letter"replacements="This" context="this is a test."contextoffset="0"errorlength="4"/> > > > How would this information be represented, using your proposal? > > Best, > > Felix
Received on Friday, 3 August 2012 15:53:47 UTC