- From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
- Date: Fri, 01 May 2015 03:21:13 +0200
- To: Michael Scharf <w3c@scharf.gr>, Public TAG List <www-tag@w3.org>
- CC: W3C Public Annotation List <public-annotation@w3.org>, nlp2rdf <nlp2rdf@lists.informatik.uni-leipzig.de>
- Message-ID: <5542D509.4010508@informatik.uni-leipzig.de>
Hi Michael, On 01.05.2015 02:55, Michael Scharf wrote: > Hi Sebastian, > > > While UTF-8 has a variable length of one to four bytes per code point, > > UTF-16 and 32 have the advantage of a fixed length. > > UTF-16 is **not** a fixed length encoding. Like UTF-8 it can use up to > 4 bytes. > Only UTF-32 encodes with a fixed length. Ah, yes, thanks, I got mixed up. > Here I show it in python (note the u'xxx' is a UTF-16): > > >>> len('𐐂') > 4 > >>> len(u'𐐂') > 2 > >>> len('ä') > 2 > >>> len(u'ä') > 1 > > The same is true in javascript (node): > > > > u4='𐐂' > '𐐂' > > u4.length > 2 > > u4.charCodeAt(1) > 56322 > > u4.charCodeAt(0) > 55297 > > > u2='ä' > 'ä' > > u2.length > 1 > > u2.charCodeAt(0) > 228 > > u2.charCodeAt(1) > NaN > Do you think it is feasible to require implementations of the Web Annotation Data Model to count in code points? Javascript also seems to have methods for "character-based" string counting: punycode.ucs2.decode(string).length; or Array.from(string).length; or [...string].length; I am not a JS expert, I am just copying from https://mathiasbynens.be/notes/javascript-unicode. All the best, Sebastian > > Michael > > On 2015-04-30 17:55, Sebastian Hellmann wrote: >> Hi all, >> >> I am a bit puzzled why >> http://www.w3.org/TR/charmod/#sec-stringIndexing is renaming Unicode >> Code Points (a clearly defined thing) to Character String. >> From my understanding the example in >> http://www.w3.org/TR/charmod/#C052 is not good: >> "(Example: the use of UTF-16 in [DOM Level 1])." >> >> UTF-16 is the encoding of the string and is independent of code >> points, units and graphems, i.e. you can encode the same code point >> in UTF-8, UTF-16 and UTF-32 which will definitely change the number >> of code units and bytes needed. >> >> While UTF-8 has a variable length of one to four bytes per code >> point, UTF-16 and 32 have the advantage of a fixed length. This means >> that you can use byte offsets easily to jump to certain positions in >> the text. However, this is mostly >> used internally, i.e. C/C++ has a dataype widechar using 16 bits as >> it is easier to allocate memory for variables. Maybe some DOM parser >> rely on UTF-16 internally too, but still count Code Points >> >> On the (serialized) web, UTF-8 is predominant, which is really not >> the question here as the choice between graphems, code points and >> units is orthogonal to encoding. >> >> Regarding annotation, using code points or Character Strings is >> definitely the best practice. Any deviation will lead to side effects >> such as "ä" having the length 2: >> >> Using code points: >> Java, length(): "ä".length() == 1 >> PHP,utf8_decode(): strlen(utf8_decode("ä"))===1 >> Python, len() in combination with decode(): len("ä".decode("UTF-8")) ==1 >> >> Using code units: >> Unix wc: echo -n "ä" | wc is 2 >> PHP: strlen("ä")===2 >> Python: len("ä")===2 >> >> For the NLP2RDF project we converted these 30 million annotations to >> RDF: http://wiki-link.nlp2rdf.org/ >> It was quite difficult to work with the byte offset given that the >> original formats where HTML, txt, PDFs and docx. >> >> Anyhow, I wouldn't know a single use case for using Code Units for >> annotation. I am unsure about Graphems. Personally I think, byte >> offset for text is unnecessary, simply because code points are >> better, i.e. stable regarding encoding and >> charset. >> >> There is a problem with Unicode Normal Form (NF). Generally, Normal >> Form C is fine. However, if people wish to annotate diacritics >> independently. NFD is needed. >> NFC: è >> NFD: `e >> in NFD you can annotate the code point for the diacritic separately. >> However, NFD is not in wide use and the annotation of diacritics is >> probably out of scope. >> >> There is some info in >> - the "definition of string" section in the NIF spec: >> http://persistence.uni-leipzig.org/nlp2rdf/specification/core.html >> (yes, we consider moving to a W3C community group for further >> improvement) >> - Unicode Norm Forms: http://unicode.org/reports/tr15/#Norm_Forms >> - http://tinyurl.com/sh-thesis , page 76 >> >> On my wishlist, I would hope that the new Annotation standard would >> include a normative list (SHOULD not MUST) of string counting >> functions for all major programming languages and other standards >> like SPARQL to tackle interoperability. >> When transfering data, it is important that the other implementation >> counts offsets the same way. Listing the functions would help a lot. >> >> All the best, >> Sebastian >> >> On 30.04.2015 13:01, Nick Stenning wrote: >>> Thanks for this reference, Martin, and thanks for passing this to TAG, >>> Frederick. >>> >>> The character model lays out the problems more clearly than I have. >>> It's >>> clear that recommendation is to use character strings (i.e. codepoint >>> sequences) unless: >>> >>> a) there are performance considerations that would predicate the use of >>> "code unit strings" (I presume interop with existing DOM APIs would >>> also >>> be a strong motivator) >>> b) "user interaction is a primary concern" -- in which case grapheme >>> clusters may be considered >>> >>> Unfortunately for us, both considerations apply in the annotation use >>> case. >>> >>> I'd suggest we schedule a discussion of this issue in an upcoming call. >>> >>> N >>> >>> On Thu, Apr 30, 2015, at 02:58, Martin J. Dürst wrote: >>>> Hello Frederik, >>>> >>>> This is an old, well-known issue. As a starter, please have a look at >>>> what the Character Model has to say about this: >>>> >>>> http://www.w3.org/TR/charmod/#sec-stringIndexing >>>> >>>> Please feel free to come back again here or contact the I18N WG. >>>> >>>> Regards, Martin. >>>> >>>> On 2015/04/29 21:45, Frederick Hirsch wrote: >>>>> TAG members - has the issue of dealing with symbols vs >>>>> characters/codepoints come up in TAG discussion? >>>>> >>>>> Any comment/suggestion welcome (I've cross-posted intentionally, >>>>> please remove recipients if not appropriate.) >>>>> >>>>> Thanks >>>>> >>>>> regards, Frederick >>>>> >>>>> Frederick Hirsch >>>>> Co-Chair, W3C Web Annotation WG >>>>> >>>>> www.fjhirsch.com <http://www.fjhirsch.com/> >>>>> @fjhirsch >>>>> >>>>>> Begin forwarded message: >>>>>> >>>>>> From: "Nick Stenning"<nick@whiteink.com> >>>>>> Subject: Unicode offset calculations >>>>>> Date: April 29, 2015 at 4:38:34 AM EDT >>>>>> To:public-annotation@w3.org >>>>>> Resent-From:public-annotation@w3.org >>>>>> >>>>>> One of the most useful discussions at our working group F2F last >>>>>> week was the result of a question from Takeshi Kanai about how we >>>>>> calculate character offsets such as those used by Text Position >>>>>> Selector<http://www.w3.org/TR/annotation-model/#text-position-selector> >>>>>> in the draft model. Specifically, if I have a selector such as >>>>>> >>>>>> { >>>>>> "@id": "urn:uuid:...", >>>>>> "@type": "oa:TextPositionSelector", >>>>>> "start": 478, >>>>>> "end": 512 >>>>>> } >>>>>> to what do the numbers 478 and 512 refer? These numbers will >>>>>> likely be interpreted by other components specified by this WG >>>>>> (such as the RangeFinder API), not to mention external systems, >>>>>> and we need to make sure we are consistent in our definitions >>>>>> across these specifications. >>>>>> >>>>>> I've reviewed what the model spec currently says and I'm not sure >>>>>> it's particularly precise on this point. Even if I'm misreading >>>>>> it and it is clear, I'm not sure it makes a recommendation that >>>>>> is practical. In order to review this, I'm going to first lay out >>>>>> the possible points of ambiguity, and then review what the spec >>>>>> seems to say on these issues. >>>>>> >>>>>> 1. A symbol is not (necessarily) a codepoint >>>>>> >>>>>> The atom of selection in the browser is the symbol, or grapheme. >>>>>> For example, "ą́" is composed of three codepoints, but is rendered >>>>>> as a single selectable symbol. It can only be unselected or >>>>>> selected: there is no way to only select some of the codepoints >>>>>> that comprise the symbol. >>>>>> >>>>>> Because user selections start and end at symbols, it would be >>>>>> reasonable for TextPositionSelector offsets to be defined as >>>>>> symbol counts. Unfortunately, most extant DOM APIs don't deal in >>>>>> symbols: >>>>>> >>>>>>> var p = document.createElement('p') >>>>>>> p.innerText = 'ą́' >>>>>>> p.innerText.length >>>>>> 3 >>>>>>> p.firstChild.splitText(1) >>>>>>> p.firstChild.nodeValue >>>>>> '\u0061' >>>>>>> p.firstChild.nextSibling.nodeValue >>>>>> '\u0328\u0301' >>>>>> Calculating how a sequence of codepoints maps to rendered symbols >>>>>> is in principle >>>>>> complicated<http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries> >>>>>> and in practice not completely standardised across rendering >>>>>> engines. It's also (as demonstrated with splitText above) >>>>>> possible for the DOM to end up in a state in which the mapping >>>>>> between textual content and rendered symbols has become decoupled. >>>>>> >>>>>> 2. Combining characters >>>>>> >>>>>> Some sequences of codepoints can render identically to other >>>>>> sequences of codepoints. For example: >>>>>> >>>>>> ñ (U+00F1 LATIN SMALL LETTER N WITH TILDE) >>>>>> renders identically to >>>>>> >>>>>> ñ (U+006E LATIN SMALL LETTER N + U+0303 COMBINING TILDE) >>>>>> This is the "combining characters" problem. Some codepoints are >>>>>> used to modify the appearance of preceding codepoints. Selections >>>>>> made on a document containing one of these would behave >>>>>> identically to selections made on a document containing the >>>>>> other, but: >>>>>> >>>>>>> 'ñ'.length >>>>>> 1 >>>>>>> 'ñ'.length >>>>>> 2 >>>>>> This is not an insoluble problem, as the Unicode specification >>>>>> itself defines a process by which sequences of codepoints can be >>>>>> canonicalised into fully decomposed (aka "NFD") or fully composed >>>>>> (aka "NFC") form. But it's not that simple, because if we specify >>>>>> a canonicalisation requirement for annotation selector offsets, >>>>>> then there may be undesirable performance implications (consider >>>>>> making an annotation at the end of a 100KB web page of unknown >>>>>> canonicalisation status). >>>>>> >>>>>> 3. Astral codepoints and JavaScript >>>>>> >>>>>> JavaScript's internal encoding of Unicode strings is based on >>>>>> UCS-2<https://en.wikipedia.org/wiki/UTF-16#History>, which means >>>>>> that it represents codepoints from the so-called "astral planes" >>>>>> (i.e. codepoints above 0xFFFF) as two surrogate pairs. This leads >>>>>> to the principal problem that Takeshi identified, which is that >>>>>> different environments will calculate offsets differently. For >>>>>> example, in Python 3: >>>>>> >>>>>>>>> len('😀') # U+1F600 GRINNING FACE >>>>>> 1 >>>>>> Whereas in JavaScript: >>>>>> >>>>>>> '😀'.length >>>>>> 2 >>>>>> There are ways of addressing this problem in JavaScript, but to >>>>>> my knowledge none of them are particularly elegant, and none of >>>>>> them will allow us to calculate offsets at the bottom of a long >>>>>> document without scanning the entire preceding text for astral >>>>>> codepoints. >>>>>> >>>>>> So what does our spec currently say? >>>>>> >>>>>> The text must be normalized before counting characters. HTML/XML >>>>>> tags should be removed, character entities should be replaced >>>>>> with the character that they encode, unnecessary whitespace >>>>>> should be normalized, and so forth. The normalization routine may >>>>>> be performed automatically by a browser, and other clients should >>>>>> implement the DOM String Comparisons [DOM-Level-3-Core] method. >>>>>> >>>>>> It's not immediately clear what this means in terms of Unicode >>>>>> normalisation. Following the chain of specifications leads to >>>>>> §1.3.1 of the DOM Level 3 Core >>>>>> specification<http://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMString>. >>>>>> The only thing this says about Unicode normalisation is: >>>>>> >>>>>> The character normalization, i.e. transforming into their fully >>>>>> normalized form as as defined in [XML 1.1], is assumed to happen >>>>>> at serialization time. >>>>>> >>>>>> This doesn't appear to be relevant, as the meaning of >>>>>> "serialization" in this context appears to refer to the >>>>>> mechanisms described in the DOM Load and >>>>>> Save<http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/> >>>>>> spec, and does not refer to the process of parsing an HTML >>>>>> document and presenting its content through the DOM APIs. (I'd be >>>>>> very happy if someone more familiar with the DOM Level 3 Spec >>>>>> could confirm this interpretation.) >>>>>> >>>>>> For completeness, "fully normalized form" in the XML 1.1 >>>>>> sense<http://www.w3.org/TR/2004/REC-xml11-20040204/#dt-fullnorm> >>>>>> would appear to imply full "NFC" normalisation of the document. >>>>>> It is apparent even from the simple examples above that browsers >>>>>> do not apply NFC normalisation to documents they receive from the >>>>>> server. >>>>>> >>>>>> What should we do about all this? >>>>>> >>>>>> I've listed above three sources of confusion in talking about >>>>>> offsets. In each case there is tension between what would make >>>>>> the most sense to a user and a pragmatic engineering >>>>>> recommendation that takes into account contingent factors. >>>>>> >>>>>> Symbols: users can only select symbols in browsers, but as far as >>>>>> I'm aware all current internal DOM APIs ignore this fact. >>>>>> Further, given that determining the symbol sequence for a given >>>>>> codepoint sequence is non-trivial, we probably should not attempt >>>>>> to define offsets in terms of symbols. >>>>>> >>>>>> Combining characters: user selections make no distinction between >>>>>> combinatoric variants such as "ñ" and "n + ˜", so it would seem >>>>>> logical to define offsets in terms of the "NFC" canonicalised >>>>>> form. In practice, such a recommendation would likely be ignored >>>>>> by implementers (for reasons of complexity or performance >>>>>> impact), and so for the same reasons as in 1) I'd be inclined to >>>>>> suggest we define offsets in terms of the delivered document >>>>>> codepoint sequence rather than any canonical form. >>>>>> >>>>>> Astral codepoints + surrogate pairs: this is the tricky one. As >>>>>> demonstrated by this >>>>>> page<http://bl.ocks.org/nickstenning/bf09f4538878b97ebe6f>, this >>>>>> poses serious problems for interoperability, as JavaScript counts >>>>>> a single unicode astral codepoint as having length 2, due to the >>>>>> internal representation of the codepoint as a surrogate pair. As >>>>>> far as I'm concerned we're stuck between a rock and a hard place: >>>>>> >>>>>> a. calculating offsets in terms of codepoints (i.e. accounting >>>>>> for surrogate pairs in JavaScript) makes interoperability more >>>>>> likely, but could impose a substantial cost on client-side >>>>>> algorithms, both in terms of implementation complexity and >>>>>> performance impact. >>>>>> >>>>>> b. calculating offsets using native calculations on JavaScript >>>>>> strings is preferable from an implementation complexity >>>>>> standpoint, but as far as I'm aware no other mainstream >>>>>> programming environment has the same idiosyncrasy, thus almost >>>>>> guaranteeing problems of interoperability when offsets are used >>>>>> in both the DOM environment and outside. >>>>>> >>>>>> In summary, Takeshi raised an important question at the F2F. What >>>>>> do we do about JavaScript's rather unfortunate implementation of >>>>>> unicode strings? I'd be interested to hear from anyone with >>>>>> thoughts on this subject. I imagine there are people in the I18N >>>>>> activity at W3C who would be able to weigh in here too. >>>>>> >>>>>> -N >>>>>> >>>>> >>> >> >> >> -- >> Sebastian Hellmann >> AKSW/NLP2RDF research group >> Insitute for Applied Informatics (InfAI) and DBpedia Association >> Events: >> * *Feb 9th, 2015* 3rd DBpedia Community Meeting in Dublin >> <http://wiki.dbpedia.org/meetings/Dublin2015> >> * *May 29th, 2015* Submission deadline SEMANTiCS 2015 >> * *Sept 15th-17th, 2015* SEMANTiCS 2015 (formerly i-SEMANTICS), >> Vienna <http://semantics.cc/> >> Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf >> Projects: http://dbpedia.org, http://nlp2rdf.org, >> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt >> <http://www.w3.org/community/ld4lt> >> Homepage: http://aksw.org/SebastianHellmann >> Research Group: http://aksw.org >> Thesis: >> http://tinyurl.com/sh-thesis-summary >> http://tinyurl.com/sh-thesis > -- Sebastian Hellmann AKSW/NLP2RDF research group Insitute for Applied Informatics (InfAI) and DBpedia Association Events: * *Feb 9th, 2015* 3rd DBpedia Community Meeting in Dublin <http://wiki.dbpedia.org/meetings/Dublin2015> * *May 29th, 2015* Submission deadline SEMANTiCS 2015 * *Sept 15th-17th, 2015* SEMANTiCS 2015 (formerly i-SEMANTICS), Vienna <http://semantics.cc/> Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org Thesis: http://tinyurl.com/sh-thesis-summary http://tinyurl.com/sh-thesis
Received on Friday, 1 May 2015 01:21:50 UTC