Re: [web-annotation] Reference to text encoding in spec perhaps not appropriate

> I also recognize that whitespace normalization would destroy 
"layout" such as represented by pre. I think this is expected. If one 
wants document fidelity, use text positions and extract the layout, 
not just the plain text.

The tools I am referring to are using text to process content as part 
of a text analytics pipeline. Text analytics tools as of today only 
understand plain text. So at some point in the text analytics pipeline
 you have to get rid of the markup - and have to decide about what to 
do with the white space. That decision will always be format specific,
 see the DocBook programlisting example. So my point is: you can 
describe the steps needed only to some extend. At some point 
implementations have to look into the specifics of  formats to 
process. That is why step two at 
https://github.com/w3c/web-annotation/issues/227#issuecomment-222330988
 is hard to formulate as a MUST requirement.
I am coming from an XLIFF extracting and merging point of view - which
 is the same as text analytics with putting the outcome into the 
orginal content again (= roundtripping). Specs like XLIFF wisely do 
not speciffy the details of such processes, but say "be careful about 
them" - and then there are - to some extend, not enough - format 
specific guidelines how to do this. I am not asking for such 
guidelines here, just trying to explain how big this pandora box is.

-- 
GitHub Notification of comment by fsasaki
Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/227#issuecomment-222516176
 using your GitHub account

Received on Monday, 30 May 2016 15:34:48 UTC