- From: Felix Sasaki <fsasaki@w3.org>
- Date: Mon, 10 Jun 2013 20:05:01 +0200
- To: joerg@bioloom.de
- CC: Jirka Kosek <jirka@kosek.cz>, Arle Lommel <arle.lommel@dfki.de>, public-i18n-its-ig@w3.org, kim_harris@textform.com, Hans Uszkoreit <uszkoreit@dfki.de>, Aljoscha Burchardt <aljoscha.burchardt@dfki.de>
- Message-ID: <51B6154D.5030906@w3.org>
Hi Arle, all, Arle, thanks a lot for discussing one issue I mentioned at http://lists.w3.org/Archives/Public/public-i18n-its-ig/2013Jun/0000.html What are you thoughts about the other issues 1) and 3)? Wrt to 1) and also the markup issue 2), it would be great to have the spec draft for MQM available - is that possible? Now, about the overlapping markup: as Jirka mentioned, each solution has drawbacks. Below are two proposals that try to life with these. 1) Using NIF: You are correct that the NIF solution has the drawback of fixed character offsets. In an XSLT stylsheet that converts markup to NIF (here only for "text analysis", but I can easily create that for LQI) http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl I have resolved this by having a white space stripping step http://www.w3.org/People/fsasaki/its20-general-processor/tools/stripping.xsl that is also included in the backconversion from NIF http://www.w3.org/People/fsasaki/its20-general-processor/tools/nif-2-its-ta.xsl so all your editing environment need to make sure: has the textual content of the string being annotated changed. I assume that during a quality check step, the non white space content must not be changed anyway, so such a check is easy to do. 2) Using a copy of the same content for overlapping annotations Another solution is to copy the strings that have issues so that users can annotate overlapping issues in parallel. This is what we did in the class in Potsdam: having an XLIFF file with several "alt-trans" elements. Each of these contain the same translation, but different lqi markup - see the attachment. So your example would look like this: 1st annotation <p>Fifteen <span its-loc-quality-issue-type="..." ...><em>relays is</em></span> involved in the operation.</p> 2nd annotation <p>Fifteen <em>relays <span itsx-mqm-issue-type="agreement" ...>is</span></em> involved in the operation.</p> Of course this has the drawback that you need to copy the original text. But it again (like NIF) has the advantage that you can make use of the tree document structure (see below) and (different to NIF) you can stay in the markup world. 3) milestone proposal Your original solution and also the processing instruction proposal fall under the category of "milestones", see http://conferences.idealliance.org/extreme/html/2004/Witt01/EML2004Witt01.html#t2-1-1 "milestone elements: empty elements that mark the boundaries between elements in a non-nesting structure " The problem is that with this solution you loose the document tree structure and have the following problems: a) As Jirka said, no document structure restriction can be stated: the related start and end "id" attributes can appear everywhere in the document. How do you want to validate the annotations? b) Processing the milestones: in the approaches 1) and 2) it is easy to do an analysis like this: "give me all HTML 'a' elements that have an issue'. You can write an XPath expresssion like //h:a[@its-loc-quality-issue-type] and you are done. With your approach that is not feasible: you cannot easily check 'is there a milestone in 'a'?", since there is no guarentee that both start and end milestone are inside 'a'. Such checks are somehow doable, but the performance will suffer a lot c) Given b), you will need a special purpose annotation tool to work with the regions. A general XML editor can do solutions 1) (in combination with XSLT) and 2). For solution 3) you will need special purpose tools - do you expect widespread adoption and good interoperability between these? The "processing of milestones" argument including performance would be the killer argument for me. I have seen a few tools implementing milestones and similar solutions, similar in the sense of breaking up the document structure. None of these could deal with large data sets, and all of them didn't find a lot of adoption (this was in the area of linguistic data sets, btw). Since both 1) and 2) rely on standard XML processing, you won't find such issues. Best, Felix Am 10.06.13 13:38, schrieb Jörg Schütz: > +1 for PIs. We have always used PIs in language checking applications > (spelling mistakes, grammar and style errors), and they worked very well. > > Cheers -- Jörg > > On June 10, 2013 at 12:03 (CEST), Jirka Kosek wrote: >> On 10.6.2013 11:26, Arle Lommel wrote: >> >>> We need a way to mark up overlapping spans. For example, if you have >>> the following HTML5 segment: >> >> Haha, overlapping markup. There are several common ways how to handle >> overlapping markup. But if you want to stick to XML syntax and >> datamodel, no solution will be completely perfect and elegant. >> >>> The mapping from MQM to ITS 2.0 is clear here, but we need a way to >>> mark up the overlapping spans. So far we have internally used >>> something like this: >>> >>> <p>Fifteen <mqm-startIssue type="markup, misplaced" id="1" >>> /><em>relays <mqm-startIssue type="agreement" id ="2" >>> />is</em><mqm-endIssue id="1" /> involved</mqm-endIssue id="2" /> in >>> the operation.</p> >>> >>> We want a good path to interoperability with ITS. So we need a way >>> to put the following information in the document on overlapping >>> spans using local markup: >>> >>> its-loc-quality-issue-type="grammar" itsx-mqm-issue-type="agreement" >>> its-loc-quality-comment="should be "relays are"" (etc…) >>> >>> Any suggestions for how to handle this use case? We want to make it >>> as easy as possible to use MQM and ITS together, where MQM provides >>> mechanisms for greater granularity while still retaining >>> compatibility with ITS and ITS provides a way to share MQM data at a >>> common granularity with other systems. >> >> Your example above means that elements like <mqm-startIssue/> and >> <mqm-endIssue/> have to be allowed anywhere in document which requires >> change in schema. For this reason it's very common to use processing >> instructions instead -- they are opaque to schema validation and no >> schema change is necessary, for example: >> >> <p>Fifteen <?mqm-startIssue type="markup, misplaced" id="1" ?><em>relays >> <?mqm-startIssue type="agreement" id ="2" ?>is</em><?mqm-endIssue id="1" >> ?> involved<?mqm-endIssue id="2" ?> in the operation.</p> >> >> Jirka >> >
Attachments
- text/xml attachment: multiple-annotations.xml
Received on Monday, 10 June 2013 18:05:38 UTC