Re: idValue requirement updated from David Lewis on 2012-05-02 (public-multilingualweb-lt@w3.org from May 2012)

From: David Lewis <dave.lewis@cs.tcd.ie>
Date: Thu, 03 May 2012 00:20:32 +0100
To: Yves Savourel <ysavourel@enlaso.com>
CC: public-multilingualweb-lt@w3.org
Message-ID: <4FA1C140.2000009@cs.tcd.ie>
Hi Yves,
You are quite right, it is difficult to to come up with a scheme that 
both guarantees persistence in fragment IDs and minimises impact on the 
source document structure.

As always the willingness to absorb the pain involved implementing 
systems that provide such guarantees, e.g. in this case updating ids 
when source documents are revised, must be balanced by the business 
importance of successfully addressing the use case.

In this regard I observe:
1) that many companies are shifting emphasis in customer care from 
solely professionally developed manuals to wikis that evolve over time, 
user Q-A forums and mixes of these following the Stack Exchange model. 
Localising such fluid content would favour more comprehensive, fine 
grained segment tracking.

2) that many companies have already invested in solutions to support 
integrated fine grained version management for software localisation, 
e.g. see presentations by Microsoft 
(http://www.localisation.ie/resources/conferences/2011/presentations/LRCXVI_How%20Cloud%20Based%20Technology%20Improves%20Localisation.pdf) 
with similar systems in place at companies such as Oracle and Symantec. 
If the observation (1) leads web content to exhibit the more fluid 
translatable content characteristics of software, client companies may 
well be willing to accept the overhead of markup and maintenance to 
ensure smooth fine grain revisions (especially if this is eased by some 
ITS support and resulting plug-ing for CMS).  Similarly, the need to 
encompass software localisation management techniques along side web 
localisation is going to face us soon with the growth of web scripting. 
Even if this isn't in scope of ITS 2.0, think about the likely 
implications will help future proof ITS2.0 somewhat.

It would be great to get some input from some of the clients in the 
workgroup about how important they see such fine-grained, roundtrip 
segment tracking requirements - Des, Jan, Ryan? Are other seeing such 
needs emerging from clients, Pedro, Moritz, Phil?

In the mean time, not withstanding the different id mechanisms that we 
could use, should we flesh out the idValue with a statement about a 
requirement for a global document data category that indicates the rules 
active in that document for managing IDs?

I've also raised an issue [ISSUES-8] to help us track this, especially 
as there are many ramifications that may pop up in other data category 
use cases.

many thanks,
Dave


On 02/05/2012 11:51, Yves Savourel wrote:
>
> Hi Dave,
>
> I may be wrong, but it looks to me that an ID value that can be 
> maintain over the course of several modifications of the source 
> content it identifies cannot be auto-generated with certainty. The six 
> cascading methods you described are getting close to achieve that, 
> but, as far as I can tell, can’t guarantee it.
>
> Such ID probably has to exist in the source document and be maintain 
> by it.
>
> In addition you are talking about a segment ID rather than an ID on 
> some existing unit of the source document. This means the possible 
> addition of markup at the middle of plain text paragraphs. All this 
> would work if we have some extra layer of identification on top of the 
> original content, something partially similar to xml:tm. But I’m 
> doubtful it’s realistic for most formats.
>
> Maybe the best the idValue data category could do is convey 
> information about existing ways in the original document to get a 
> unique and maintainable ID for a given node. In other words the 31 and 
> #2 of your cascading rules. It’s already something important.
>
> Maybe exploring ways to build additional identification information is 
> worthwhile though. Being able to create some “companion IDs” using the 
> content around a given node would certainly be useful for some 
> localization task like perfect matching. I could imagine the use for 
> some rule that would allow to generate “context values” that could be 
> used stored in TM repositories and re-used across different tools.
>
> Cheers,
>
> -yves
>
> *From:*David Lewis [mailto:dave.lewis@cs.tcd.ie]
> *Sent:* Tuesday, May 01, 2012 2:54 PM
> *To:* public-multilingualweb-lt@w3.org
> *Subject:* Re: idValue requirement updated
>
> Hi Yves,
> I think you are right about me thinking more along the lines of XLIFF 
> id rather than XLIFF resname, but perhaps not exactly in the way you 
> characterise it.
>
> I am thinking in terms of an id that can be used to track the progress 
> of a specific segment in the content documents (lets park the use of 
> multi-segement translation unit in XLIFF for the moment) against the 
> corresponding XLIFF id.  However, I'm specifically concerned with the 
> round trip use cases where the document may pass from a CMS to an 
> XLIFF cycle and back again several times. The use cases I see for this 
> are driven by the need for more continuous translation, pipelined at 
> the granularity of the segment rather than the document, rather than 
> once off hand-overs of documents between processes. Possible use cases 
> might be:
>
> 1) a document is having its source revised and is being translated at 
> the same time. Readiness of different elements is signalled in the 
> document using the readiness/processTrigger data category, which is 
> monitored by an LSP which provide updates of segments to be translated 
> based on these flags and distributes translations using XLIFF. 
> Consistent mapping between all segements and xliff translation unit 
> ids is required to ensure that new, modified and deleted trans-units 
> are correctly updated and kept in sequence.
>
> 2) Translations from one LSP may be undergoing monolingual review 
> through direct access to the target on the CMS, while selected 
> bi-lingual translation review is being conducted in parallel by 
> another LSP. Feedback from both reviews may need to be routed back to 
> the translating LSP, so document element-to-XLIFF mappings would be 
> need to be reliably maintained for the two sets of XLIFF ids  operated 
> by two different LSPs.
>
> In these sort of use cases, where their is ongoing round-tripping 
> between the CMS and TMS/XLIFF, then the need for consistent mapping 
> between the source document on the CMS and the versions LSPs have, may 
> soften the assumption that clients won't be willing to add additional 
> elements to the document on the CMS. One can imagine that any 
> augmented versions of the content documents would live on a 'staging' 
> CMS while it is subject to preparation, translation and review, but 
> prior to publication.
>
> So, this implies a need for an id that is indeed relevant just to the 
> localization process, but that never-the-less needs to support a 
> persistent mapping between CMS element and trans-unit ID, potentially 
> over several CMS-TMS roundtrips. The difference to resname as I 
> understand it, is that resname is optional and in a sense best effort 
> - if you can't map a trans-unit back to a particular element in the 
> source, you can still try and translate the string, you just loose 
> some contextual info. So it doesn't have the requirement to 
> comprehensively *maintain a mapping between all *trans-units and  
> source content elements in the way I think the above use cases require.
>
> Hope that explains the requirement I had in mind a bit more clearly.
>
> Finally, I'm not sure in any of these cases we are talking about an 
> explicit id data category are we?
>
> Would the implementation in fact be rules for generating and 
> maintaining the mapping between source elements and XLIFF ids. Very 
> speculatively, these could be expressed as some cascading rules for 
> using: 1st) existing ids if present; 2nd) combo rules of ID and 
> element names as your the updated text; 3rd) if allowed new id in 
> existing elements; 4th) if allowed new elements with specific ids; 
> 5th) some sort of external hashing pointer (e.g. 
> http://nlp2rdf.org/nif-1-0#toc-nif-recipe-context-hash-based-uris) ; 
> 6th) some sort of character count-based pointer (e.g. 
> http://nlp2rdf.org/nif-1-0#toc-nif-recipe-offset-based-uris).  It 
> would be a ruleset applicable to the document that we would need to 
> record.
>
> cheers,
> Dave
>
>
>
>
> On 01/05/2012 15:20, Yves Savourel wrote:
>
> I guess what we need to clarify is what are the requirements of the ID value we are discussing.
>   
> To me it should be:
> - unique at least within the document
> - the value should be the same in new versions of the document
>   
> That's because the type of tasks I would use it for are tasks across versions of the same document.
>   
> But Dave, you are maybe thinking of something different: how to get an ID valid for a given document during its localization cycle. In other word a value that doesn't need to survive after the document is done.
>   
> In other words you are thinking XLIFF 'id' and I'm thinking XLIFF 'resname'.
>   
> Cheers,
> -ys
>
Received on Wednesday, 2 May 2012 23:20:58 UTC