Re: Fwd: Unicode offset calculations from Sebastian Hellmann on 2015-05-01 (www-tag@w3.org from May 2015)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Fri, 01 May 2015 03:21:13 +0200
To: Michael Scharf <w3c@scharf.gr>, Public TAG List <www-tag@w3.org>
CC: W3C Public Annotation List <public-annotation@w3.org>, nlp2rdf <nlp2rdf@lists.informatik.uni-leipzig.de>
Message-ID: <5542D509.4010508@informatik.uni-leipzig.de>
Hi Michael,

On 01.05.2015 02:55, Michael Scharf wrote:
> Hi Sebastian,
>
> > While UTF-8 has a variable length of one to four bytes per code point,
> > UTF-16 and 32 have the advantage of a fixed length.
>
> UTF-16 is **not** a fixed length encoding. Like UTF-8 it can use up to 
> 4 bytes.
> Only UTF-32 encodes with a fixed length.

Ah, yes, thanks, I got mixed up.

> Here I show it in python (note the u'xxx' is a UTF-16):
>
>   >>> len('𐐂')
>   4
>   >>> len(u'𐐂')
>   2
>   >>> len('ä')
>   2
>   >>> len(u'ä')
>   1
>
> The same is true in javascript (node):
>
>
>   > u4='𐐂'
>   '𐐂'
>   > u4.length
>   2
>   > u4.charCodeAt(1)
>   56322
>   > u4.charCodeAt(0)
>   55297
>
>   > u2='ä'
>   'ä'
>   > u2.length
>   1
>   > u2.charCodeAt(0)
>   228
>   > u2.charCodeAt(1)
>   NaN
>

Do you think it is feasible to require implementations of the Web 
Annotation Data Model to count in code points?
Javascript also seems to have methods for "character-based" string counting:

punycode.ucs2.decode(string).length;
or
Array.from(string).length;
or
[...string].length;

I am not a JS expert, I am just copying from 
https://mathiasbynens.be/notes/javascript-unicode.

All the best,
Sebastian

>
> Michael
>
> On 2015-04-30 17:55, Sebastian Hellmann wrote:
>> Hi all,
>>
>>   I am a bit puzzled why 
>> http://www.w3.org/TR/charmod/#sec-stringIndexing is renaming Unicode 
>> Code Points (a clearly defined thing) to Character String.
>>  From my understanding the example in 
>> http://www.w3.org/TR/charmod/#C052 is not good:
>> "(Example: the use of UTF-16 in [DOM Level 1])."
>>
>> UTF-16 is the encoding of the string and is independent of code 
>> points, units and graphems, i.e. you can encode the same code point 
>> in UTF-8, UTF-16 and UTF-32 which will definitely change the number 
>> of code units and bytes needed.
>>
>> While UTF-8 has a variable length of one to four bytes per code 
>> point, UTF-16 and 32 have the advantage of a fixed length. This means 
>> that you can use byte offsets easily to jump to certain positions in 
>> the text. However, this is mostly
>> used internally, i.e. C/C++ has a dataype widechar using 16 bits as 
>> it is easier to allocate memory for variables. Maybe some DOM parser 
>> rely on UTF-16 internally too, but still count Code Points
>>
>> On the (serialized) web, UTF-8 is predominant, which is really not 
>> the question here as the choice between graphems, code points and 
>> units is orthogonal to encoding.
>>
>> Regarding annotation, using code points or Character Strings is 
>> definitely the best practice. Any deviation will lead to side effects 
>> such as "ä" having the length 2:
>>
>> Using code points:
>> Java, length(): "ä".length() == 1
>> PHP,utf8_decode(): strlen(utf8_decode("ä"))===1
>> Python, len() in combination with decode(): len("ä".decode("UTF-8")) ==1
>>
>> Using code units:
>> Unix wc:           echo -n "ä" | wc is 2
>> PHP:                 strlen("ä")===2
>> Python:             len("ä")===2
>>
>> For the NLP2RDF project we converted these 30 million annotations to 
>> RDF: http://wiki-link.nlp2rdf.org/
>> It was quite difficult to work with the byte offset given that the 
>> original formats where HTML, txt, PDFs and docx.
>>
>> Anyhow, I wouldn't know a single use case for using Code Units for 
>> annotation. I am unsure about Graphems. Personally I think, byte 
>> offset for text is unnecessary, simply because code points are 
>> better, i.e. stable regarding encoding and
>> charset.
>>
>> There is a problem with Unicode Normal Form (NF). Generally, Normal 
>> Form C is fine. However, if people wish to annotate diacritics 
>> independently. NFD is needed.
>> NFC:  è
>> NFD: `e
>> in NFD you can annotate the code point for the diacritic separately. 
>> However, NFD is not in wide use and the annotation of diacritics is 
>> probably out of scope.
>>
>> There is some info in
>> - the "definition of string" section in the NIF spec: 
>> http://persistence.uni-leipzig.org/nlp2rdf/specification/core.html
>> (yes,  we consider moving to a W3C community group for further 
>> improvement)
>> - Unicode Norm Forms: http://unicode.org/reports/tr15/#Norm_Forms
>> - http://tinyurl.com/sh-thesis ,  page 76
>>
>> On my wishlist, I would hope that the new Annotation standard would 
>> include a normative list (SHOULD not MUST) of string counting 
>> functions for all major programming languages and other standards 
>> like SPARQL  to tackle interoperability.
>> When transfering data, it is important that the other implementation 
>> counts offsets the same way. Listing the functions would help a lot.
>>
>> All the best,
>> Sebastian
>>
>> On 30.04.2015 13:01, Nick Stenning wrote:
>>> Thanks for this reference, Martin, and thanks for passing this to TAG,
>>> Frederick.
>>>
>>> The character model lays out the problems more clearly than I have. 
>>> It's
>>> clear that recommendation is to use character strings (i.e. codepoint
>>> sequences) unless:
>>>
>>> a) there are performance considerations that would predicate the use of
>>> "code unit strings" (I presume interop with existing DOM APIs would 
>>> also
>>> be a strong motivator)
>>> b) "user interaction is a primary concern" -- in which case grapheme
>>> clusters may be considered
>>>
>>> Unfortunately for us, both considerations apply in the annotation use
>>> case.
>>>
>>> I'd suggest we schedule a discussion of this issue in an upcoming call.
>>>
>>> N
>>>
>>> On Thu, Apr 30, 2015, at 02:58, Martin J. Dürst wrote:
>>>> Hello Frederik,
>>>>
>>>> This is an old, well-known issue. As a starter, please have a look at
>>>> what the Character Model has to say about this:
>>>>
>>>> http://www.w3.org/TR/charmod/#sec-stringIndexing
>>>>
>>>> Please feel free to come back again here or contact the I18N WG.
>>>>
>>>> Regards,   Martin.
>>>>
>>>> On 2015/04/29 21:45, Frederick Hirsch wrote:
>>>>> TAG members - has the issue of dealing with symbols vs 
>>>>> characters/codepoints come up in TAG discussion?
>>>>>
>>>>> Any comment/suggestion welcome  (I've cross-posted intentionally, 
>>>>> please remove recipients if not appropriate.)
>>>>>
>>>>> Thanks
>>>>>
>>>>> regards, Frederick
>>>>>
>>>>> Frederick Hirsch
>>>>> Co-Chair, W3C Web Annotation WG
>>>>>
>>>>> www.fjhirsch.com  <http://www.fjhirsch.com/>
>>>>> @fjhirsch
>>>>>
>>>>>> Begin forwarded message:
>>>>>>
>>>>>> From: "Nick Stenning"<nick@whiteink.com>
>>>>>> Subject: Unicode offset calculations
>>>>>> Date: April 29, 2015 at 4:38:34 AM EDT
>>>>>> To:public-annotation@w3.org
>>>>>> Resent-From:public-annotation@w3.org
>>>>>>
>>>>>> One of the most useful discussions at our working group F2F last 
>>>>>> week was the result of a question from Takeshi Kanai about how we 
>>>>>> calculate character offsets such as those used by Text Position 
>>>>>> Selector<http://www.w3.org/TR/annotation-model/#text-position-selector> 
>>>>>> in the draft model. Specifically, if I have a selector such as
>>>>>>
>>>>>> {
>>>>>>     "@id": "urn:uuid:...",
>>>>>>     "@type": "oa:TextPositionSelector",
>>>>>>     "start": 478,
>>>>>>     "end": 512
>>>>>> }
>>>>>> to what do the numbers 478 and 512 refer? These numbers will 
>>>>>> likely be interpreted by other components specified by this WG 
>>>>>> (such as the RangeFinder API), not to mention external systems, 
>>>>>> and we need to make sure we are consistent in our definitions 
>>>>>> across these specifications.
>>>>>>
>>>>>> I've reviewed what the model spec currently says and I'm not sure 
>>>>>> it's particularly precise on this point. Even if I'm misreading 
>>>>>> it and it is clear, I'm not sure it makes a recommendation that 
>>>>>> is practical. In order to review this, I'm going to first lay out 
>>>>>> the possible points of ambiguity, and then review what the spec 
>>>>>> seems to say on these issues.
>>>>>>
>>>>>> 1. A symbol is not (necessarily) a codepoint
>>>>>>
>>>>>> The atom of selection in the browser is the symbol, or grapheme. 
>>>>>> For example, "ą́" is composed of three codepoints, but is rendered 
>>>>>> as a single selectable symbol. It can only be unselected or 
>>>>>> selected: there is no way to only select some of the codepoints 
>>>>>> that comprise the symbol.
>>>>>>
>>>>>> Because user selections start and end at symbols, it would be 
>>>>>> reasonable for TextPositionSelector offsets to be defined as 
>>>>>> symbol counts. Unfortunately, most extant DOM APIs don't deal in 
>>>>>> symbols:
>>>>>>
>>>>>>> var p = document.createElement('p')
>>>>>>> p.innerText = 'ą́'
>>>>>>> p.innerText.length
>>>>>> 3
>>>>>>> p.firstChild.splitText(1)
>>>>>>> p.firstChild.nodeValue
>>>>>> '\u0061'
>>>>>>> p.firstChild.nextSibling.nodeValue
>>>>>> '\u0328\u0301'
>>>>>> Calculating how a sequence of codepoints maps to rendered symbols 
>>>>>> is in principle 
>>>>>> complicated<http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries> 
>>>>>> and in practice not completely standardised across rendering 
>>>>>> engines. It's also (as demonstrated with splitText above) 
>>>>>> possible for the DOM to end up in a state in which the mapping 
>>>>>> between textual content and rendered symbols has become decoupled.
>>>>>>
>>>>>> 2. Combining characters
>>>>>>
>>>>>> Some sequences of codepoints can render identically to other 
>>>>>> sequences of codepoints. For example:
>>>>>>
>>>>>> ñ (U+00F1 LATIN SMALL LETTER N WITH TILDE)
>>>>>> renders identically to
>>>>>>
>>>>>> ñ (U+006E LATIN SMALL LETTER N + U+0303 COMBINING TILDE)
>>>>>> This is the "combining characters" problem. Some codepoints are 
>>>>>> used to modify the appearance of preceding codepoints. Selections 
>>>>>> made on a document containing one of these would behave 
>>>>>> identically to selections made on a document containing the 
>>>>>> other, but:
>>>>>>
>>>>>>> 'ñ'.length
>>>>>> 1
>>>>>>> 'ñ'.length
>>>>>> 2
>>>>>> This is not an insoluble problem, as the Unicode specification 
>>>>>> itself defines a process by which sequences of codepoints can be 
>>>>>> canonicalised into fully decomposed (aka "NFD") or fully composed 
>>>>>> (aka "NFC") form. But it's not that simple, because if we specify 
>>>>>> a canonicalisation requirement for annotation selector offsets, 
>>>>>> then there may be undesirable performance implications (consider 
>>>>>> making an annotation at the end of a 100KB web page of unknown 
>>>>>> canonicalisation status).
>>>>>>
>>>>>> 3. Astral codepoints and JavaScript
>>>>>>
>>>>>> JavaScript's internal encoding of Unicode strings is based on 
>>>>>> UCS-2<https://en.wikipedia.org/wiki/UTF-16#History>, which means 
>>>>>> that it represents codepoints from the so-called "astral planes" 
>>>>>> (i.e. codepoints above 0xFFFF) as two surrogate pairs. This leads 
>>>>>> to the principal problem that Takeshi identified, which is that 
>>>>>> different environments will calculate offsets differently. For 
>>>>>> example, in Python 3:
>>>>>>
>>>>>>>>> len('😀') # U+1F600 GRINNING FACE
>>>>>> 1
>>>>>> Whereas in JavaScript:
>>>>>>
>>>>>>> '😀'.length
>>>>>> 2
>>>>>> There are ways of addressing this problem in JavaScript, but to 
>>>>>> my knowledge none of them are particularly elegant, and none of 
>>>>>> them will allow us to calculate offsets at the bottom of a long 
>>>>>> document without scanning the entire preceding text for astral 
>>>>>> codepoints.
>>>>>>
>>>>>> So what does our spec currently say?
>>>>>>
>>>>>> The text must be normalized before counting characters. HTML/XML 
>>>>>> tags should be removed, character entities should be replaced 
>>>>>> with the character that they encode, unnecessary whitespace 
>>>>>> should be normalized, and so forth. The normalization routine may 
>>>>>> be performed automatically by a browser, and other clients should 
>>>>>> implement the DOM String Comparisons [DOM-Level-3-Core] method.
>>>>>>
>>>>>> It's not immediately clear what this means in terms of Unicode 
>>>>>> normalisation. Following the chain of specifications leads to 
>>>>>> §1.3.1 of the DOM Level 3 Core 
>>>>>> specification<http://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMString>. 
>>>>>> The only thing this says about Unicode normalisation is:
>>>>>>
>>>>>> The character normalization, i.e. transforming into their fully 
>>>>>> normalized form as as defined in [XML 1.1], is assumed to happen 
>>>>>> at serialization time.
>>>>>>
>>>>>> This doesn't appear to be relevant, as the meaning of 
>>>>>> "serialization" in this context appears to refer to the 
>>>>>> mechanisms described in the DOM Load and 
>>>>>> Save<http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/> 
>>>>>> spec, and does not refer to the process of parsing an HTML 
>>>>>> document and presenting its content through the DOM APIs. (I'd be 
>>>>>> very happy if someone more familiar with the DOM Level 3 Spec 
>>>>>> could confirm this interpretation.)
>>>>>>
>>>>>> For completeness, "fully normalized form" in the XML 1.1 
>>>>>> sense<http://www.w3.org/TR/2004/REC-xml11-20040204/#dt-fullnorm> 
>>>>>> would appear to imply full "NFC" normalisation of the document. 
>>>>>> It is apparent even from the simple examples above that browsers 
>>>>>> do not apply NFC normalisation to documents they receive from the 
>>>>>> server.
>>>>>>
>>>>>> What should we do about all this?
>>>>>>
>>>>>> I've listed above three sources of confusion in talking about 
>>>>>> offsets. In each case there is tension between what would make 
>>>>>> the most sense to a user and a pragmatic engineering 
>>>>>> recommendation that takes into account contingent factors.
>>>>>>
>>>>>> Symbols: users can only select symbols in browsers, but as far as 
>>>>>> I'm aware all current internal DOM APIs ignore this fact. 
>>>>>> Further, given that determining the symbol sequence for a given 
>>>>>> codepoint sequence is non-trivial, we probably should not attempt 
>>>>>> to define offsets in terms of symbols.
>>>>>>
>>>>>> Combining characters: user selections make no distinction between 
>>>>>> combinatoric variants such as "ñ" and "n + ˜", so it would seem 
>>>>>> logical to define offsets in terms of the "NFC" canonicalised 
>>>>>> form. In practice, such a recommendation would likely be ignored 
>>>>>> by implementers (for reasons of complexity or performance 
>>>>>> impact), and so for the same reasons as in 1) I'd be inclined to 
>>>>>> suggest we define offsets in terms of the delivered document 
>>>>>> codepoint sequence rather than any canonical form.
>>>>>>
>>>>>> Astral codepoints + surrogate pairs: this is the tricky one. As 
>>>>>> demonstrated by this 
>>>>>> page<http://bl.ocks.org/nickstenning/bf09f4538878b97ebe6f>, this 
>>>>>> poses serious problems for interoperability, as JavaScript counts 
>>>>>> a single unicode astral codepoint as having length 2, due to the 
>>>>>> internal representation of the codepoint as a surrogate pair. As 
>>>>>> far as I'm concerned we're stuck between a rock and a hard place:
>>>>>>
>>>>>> a. calculating offsets in terms of codepoints (i.e. accounting 
>>>>>> for surrogate pairs in JavaScript) makes interoperability more 
>>>>>> likely, but could impose a substantial cost on client-side 
>>>>>> algorithms, both in terms of implementation complexity and 
>>>>>> performance impact.
>>>>>>
>>>>>> b. calculating offsets using native calculations on JavaScript 
>>>>>> strings is preferable from an implementation complexity 
>>>>>> standpoint, but as far as I'm aware no other mainstream 
>>>>>> programming environment has the same idiosyncrasy, thus almost 
>>>>>> guaranteeing problems of interoperability when offsets are used 
>>>>>> in both the DOM environment and outside.
>>>>>>
>>>>>> In summary, Takeshi raised an important question at the F2F. What 
>>>>>> do we do about JavaScript's rather unfortunate implementation of 
>>>>>> unicode strings? I'd be interested to hear from anyone with 
>>>>>> thoughts on this subject. I imagine there are people in the I18N 
>>>>>> activity at W3C who would be able to weigh in here too.
>>>>>>
>>>>>> -N
>>>>>>
>>>>>
>>>
>>
>>
>> -- 
>> Sebastian Hellmann
>> AKSW/NLP2RDF research group
>> Insitute for Applied Informatics (InfAI) and DBpedia Association
>> Events:
>> * *Feb 9th, 2015* 3rd DBpedia Community Meeting in Dublin 
>> <http://wiki.dbpedia.org/meetings/Dublin2015>
>> * *May 29th, 2015* Submission deadline SEMANTiCS 2015
>> * *Sept 15th-17th, 2015* SEMANTiCS 2015 (formerly i-SEMANTICS), 
>> Vienna <http://semantics.cc/>
>> Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
>> Projects: http://dbpedia.org, http://nlp2rdf.org, 
>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
>> <http://www.w3.org/community/ld4lt>
>> Homepage: http://aksw.org/SebastianHellmann
>> Research Group: http://aksw.org
>> Thesis:
>> http://tinyurl.com/sh-thesis-summary
>> http://tinyurl.com/sh-thesis
>


-- 
Sebastian Hellmann
AKSW/NLP2RDF research group
Insitute for Applied Informatics (InfAI) and DBpedia Association
Events:
* *Feb 9th, 2015* 3rd DBpedia Community Meeting in Dublin 
<http://wiki.dbpedia.org/meetings/Dublin2015>
* *May 29th, 2015* Submission deadline SEMANTiCS 2015
* *Sept 15th-17th, 2015* SEMANTiCS 2015 (formerly i-SEMANTICS), Vienna 
<http://semantics.cc/>
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Thesis:
http://tinyurl.com/sh-thesis-summary
http://tinyurl.com/sh-thesis
Received on Friday, 1 May 2015 01:21:50 UTC