Re: [ISSUE-34]: Revised description and example from Arle Lommel on 2012-08-03 (public-multilingualweb-lt@w3.org from August 2012)

From: Arle Lommel <arle.lommel@dfki.de>
Date: Fri, 3 Aug 2012 18:49:18 +0200
To: Felix Sasaki <fsasaki@w3.org>
Cc: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
Message-Id: <A9A14FC9-6F42-44DB-913C-A02975B6251E@dfki.de>
> 2012/8/3 Arle Lommel <arle.lommel@dfki.de>
> There are two options, depending on how sophisticated the tool is and if there are mappings from Language Tool to specific categories or not. If there are specific mappings, this would pretty clearly be a "typographical" issue. If there are not, I would expect most tools to pass it through under the "language-issue" heading.
> 
> Interesting - languagetool is producing lot's of types of errors, and each language version is having it's own names for types of errors. See e.g.
> 
> http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/de/grammar.xml?content-type=text%2Fplain
> here you have a category "Grammatik"
> and for English
> http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/rules/de/grammar.xml?content-type=text%2Fplain
> it is "grammar". But there is no relation between these categories.

If there was a mapping from each of the language sections to our top-level categories, then things get, if not easy, at least not so bad. There are only 11 categories in English:

Possible Typos = typography
Grammar = grammar
Collocations = ??? (probably grammar)
Miscellaneous = typography (mostly), but this category has little internal coherence (not surprising given the name)
Punctuation Errors = typography
Commonly Confused Words = spelling
Nonstandard Phrases = seems a split between grammar and spelling
Slang = register
Redundant Phrases = style
Bad style = style
Capitalization = typography
Unfortunately, as you can see, in at least two cases the top-level mapping is insufficient because the language-tool way of dividing into categories does not match what I've done. Unfortunately that will always be the case for n+1 formats we consider.

For German, the categories are:

Mögliche Tippfehler = Possible Typos = typography 
Leicht zu verwechselnde Wörter = Commonly Confused Words = spelling
Falschschreibung prominenter/geographischer Eigennamen = (not in English) = spelling
Zusammen-/Getrenntschreibung = (not in English) = typography
Semantische Unstimmigkeiten = (not in English) = ???? (not sure here)
Redundanz = Redundance = style
Stil, Umgangssprache = Bad style & Slang = style 
Briefe und E-Mails = (not in English) = mostly grammar, it seems
Groß-/Kleinschreibung = capitalization = typography
Grammatik = grammar = grammar
Redewendungen = ??? = mostly spelling, I think, but maybe grammar (the two obviously overlap in German)
Zeichensetzung = Punctuation errors = typography
Typographie = Possible Typos = typography
Sonstiges = Miscellaneous = (see note above)

So the issue won't be super easy, but it also looks tractable.

> So do we assume that somebody (who?) will maintain mappings? How about versioning of the grammar.xml files? In a different version, again the categories can change.

This is actually why I included a "language-issue" category: Unfortunately it would become the dumping ground for all sorts of things that aren't maintained. If they were able to add a field to support our category names and do the mapping as part of their project, that would be great, but it is probably not realistic to ask them to do this.


>  Is the "matching" of top level categories to existing data better for other tools? Do the assume the tool producers to work on their output so that the matching gets better?So it seems that "language-issue" is a basket for everything, loosing actually a lot of information which actually would fit into the grammar and other top level categories, correct?

It would be a lot easier for most translation tool developers to do it, in part because my list was designed based on what they were doing. Most of the translation-specific error sets are pretty simple: a few dozen types at most, with the mappings pretty straight-forward. I've actually started a mapping from four different tools as an example. I'd hoped to have it ready today, but don't quite have it yet.

<snip>

> Yes, but ... again, without a mapping, here from language tool output to "typographic", it is probably not realistic that we will have such data.

Worst case, if we process the stand-off markup, is we end up with this:

<span
	its-translation-quality-type="language-issue"
	its-translation-quality-code="languageTool:UPPERCASE_SENTENCE_START"
	its-translation-quality-note="replacement: This"
>this</span> is a test.

That's really not so bad. Not as specific as having its-translation-quality-type="typographic", but still pretty useful.


> It might be useful to try to engage tool producers in the quality area - what we have started already - to see the limits we can go to wrt to mappings.

I agree, but we run the risk of getting something too big if we try to engage all of language quality rather than just the commonly done translation-specific stuff.

So let's talk Monday about what we want to do.

-Arle
Received on Friday, 3 August 2012 16:49:49 UTC