W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > August 2012

Re: [ISSUE-34]: Revised description and example

From: Felix Sasaki <fsasaki@w3.org>
Date: Fri, 3 Aug 2012 19:03:26 +0200
Message-ID: <CAL58czoZ9xJSGFONTtct8SUVt9a-N=Ur-8N79HfpGDKY+fM0Ug@mail.gmail.com>
To: Arle Lommel <arle.lommel@dfki.de>
Cc: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
Hi Arle,

2012/8/3 Arle Lommel <arle.lommel@dfki.de>

> 2012/8/3 Arle Lommel <arle.lommel@dfki.de>
>> There are two options, depending on how sophisticated the tool is and if
>> there are mappings from Language Tool to specific categories or not. If
>> there are specific mappings, this would pretty clearly be a "typographical"
>> issue. If there are not, I would expect most tools to pass it through under
>> the "language-issue" heading.
> Interesting - languagetool is producing lot's of types of errors, and each
> language version is having it's own names for types of errors. See e.g.
> http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/de/grammar.xml?content-type=text%2Fplain<http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/de/grammar.xml?content-type=text/plain>
> here you have a category "Grammatik"
> and for English
> http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/de/grammar.xml?content-type=text%2Fplain<http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/de/grammar.xml?content-type=text/plain>
> it is "grammar". But there is no relation between these categories.
> If there was a mapping from each of the language sections to our top-level
> categories, then things get, if not easy, at least not so bad.

There are two issues here: one is the number of language (27, I think) and
the situation (again I think, we need to check with Daniel Naber) that
there is no central authority checking the categories. So the "Grammatik"
category we are discussing for German here might disappear (unlikely) or be
re-named in a subsequent version of

> There are only 11 categories in English:
>    - Possible Typos = typography
>    - Grammar = grammar
>    - Collocations = ??? (probably grammar)
>    - Miscellaneous = typography (mostly), but this category has little
>    internal coherence (not surprising given the name)
>    - Punctuation Errors = typography
>    - Commonly Confused Words = spelling
>    - Nonstandard Phrases = seems a split between grammar and spelling
>    - Slang = register
>    - Redundant Phrases = style
>    - Bad style = style
>    - Capitalization = typography
> Unfortunately, as you can see, in at least two cases the top-level mapping
> is insufficient because the language-tool way of dividing into categories
> does not match what I've done.


> Unfortunately that will always be the case for n+1 formats we consider.
> For German, the categories are:
>    - Mögliche Tippfehler = Possible Typos = typography
>    - Leicht zu verwechselnde Wörter = Commonly Confused Words = spelling
>    - Falschschreibung prominenter/geographischer Eigennamen = (not in
>    English) = spelling
>    - Zusammen-/Getrenntschreibung = (not in English) = typography
>    - Semantische Unstimmigkeiten = (not in English) = ???? (not sure here)
>    - Redundanz = Redundance = style
>    - Stil, Umgangssprache = Bad style & Slang = style
>    - Briefe und E-Mails = (not in English) = mostly grammar, it seems
>    - Groß-/Kleinschreibung = capitalization = typography
>    - Grammatik = grammar = grammar
>    - Redewendungen = ??? = mostly spelling, I think, but maybe grammar
>    (the two obviously overlap in German)
>    - Zeichensetzung = Punctuation errors = typography
>    - Typographie = Possible Typos = typography
>    - Sonstiges = Miscellaneous = (see note above)
> So the issue won't be super easy, but it also looks tractable.
> So do we assume that somebody (who?) will maintain mappings? How about
> versioning of the grammar.xml files? In a different version, again the
> categories can change.
> This is actually why I included a "language-issue" category: Unfortunately
> it would become the dumping ground for all sorts of things that aren't
> maintained. If they were able to add a field to support our category names
> and do the mapping as part of their project, that would be great, but it is
> probably not realistic to ask them to do this.

So how about renaming "language-issue" to "uncategorized"? It seems that
this is closer to the situation we have here: many languagetool categories
would fit to one of the top level categories. The issue is just that the
mapping is not defined.

>  Is the "matching" of top level categories to existing data better for
> other tools? Do the assume the tool producers to work on their output so
> that the matching gets better?So it seems that "language-issue" is a basket
> for everything, loosing actually a lot of information which actually would
> fit into the grammar and other top level categories, correct?
> It would be a lot easier for most translation tool developers to do it, in
> part because my list was designed based on what they were doing. Most of
> the translation-specific error sets are pretty simple: a few dozen types at
> most, with the mappings pretty straight-forward.

Understand. For me (and for also shorting this discussion) it would really
be helpful to see the examples here. And this should also go into the spec
IMO - not as a normative part, but as an appendix explaining the current
data available.

We had discussed such an appendix before - do you think you could create it
before the September f2f?

> I've actually started a mapping from four different tools as an example.
> I'd hoped to have it ready today, but don't quite have it yet.

Great, so forget about me September f2f question.

> <snip>
> Yes, but ... again, without a mapping, here from language tool output to
> "typographic", it is probably not realistic that we will have such data.
> Worst case, if we process the stand-off markup, is we end up with this:
> <span
> its-translation-quality-type="language-issue"

the type I would call "uncategorized".

> its-translation-quality-code="languageTool:UPPERCASE_SENTENCE_START"
> its-translation-quality-note="replacement: This"
> >this</span> is a test.
To understand what needs to be replaced, you need character offset
information. Language tool is providing that - is there a means to make use
if it?

> That's really not so bad. Not as specific as having
> its-translation-quality-type="typographic", but still pretty useful.
> It might be useful to try to engage tool producers in the quality area -
> what we have started already - to see the limits we can go to wrt to
> mappings.
> I agree, but we run the risk of getting something too big if we try to
> engage all of language quality rather than just the commonly done
> translation-specific stuff.

Sure - I wasn't talking about broadening the scope, rather having more tool
data available.

> So let's talk Monday about what we want to do.



> -Arle

Felix Sasaki
DFKI / W3C Fellow
Received on Friday, 3 August 2012 17:03:52 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:31:50 UTC