- From: Felix Sasaki via GitHub <sysbot+gh@w3.org>
- Date: Mon, 30 May 2016 15:34:46 +0000
- To: public-annotation@w3.org
> I also recognize that whitespace normalization would destroy "layout" such as represented by pre. I think this is expected. If one wants document fidelity, use text positions and extract the layout, not just the plain text. The tools I am referring to are using text to process content as part of a text analytics pipeline. Text analytics tools as of today only understand plain text. So at some point in the text analytics pipeline you have to get rid of the markup - and have to decide about what to do with the white space. That decision will always be format specific, see the DocBook programlisting example. So my point is: you can describe the steps needed only to some extend. At some point implementations have to look into the specifics of formats to process. That is why step two at https://github.com/w3c/web-annotation/issues/227#issuecomment-222330988 is hard to formulate as a MUST requirement. I am coming from an XLIFF extracting and merging point of view - which is the same as text analytics with putting the outcome into the orginal content again (= roundtripping). Specs like XLIFF wisely do not speciffy the details of such processes, but say "be careful about them" - and then there are - to some extend, not enough - format specific guidelines how to do this. I am not asking for such guidelines here, just trying to explain how big this pandora box is. -- GitHub Notification of comment by fsasaki Please view or discuss this issue at https://github.com/w3c/web-annotation/issues/227#issuecomment-222516176 using your GitHub account
Received on Monday, 30 May 2016 15:34:48 UTC