Re: CR Feedback: String counting and offsets from Sebastian Hellmann on 2015-10-20 (www-international@w3.org from October to December 2015)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Tue, 20 Oct 2015 13:37:51 +0200
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, www-international@w3.org
Message-ID: <5626278F.80209@informatik.uni-leipzig.de>

Hello Martin,

thanks for the pointer. Good work, this was exactly what I was looking for.
Maybe a reference to the character model can be added to the current CR 
to keep the pointer.

All the best,
Sebastian

On 20.10.2015 13:20, Martin J. Dürst wrote:
> Hello Sebastian,
>
> There is already quite a bit about character counting/string length at 
> http://www.w3.org/TR/charmod/#sec-stringIndexing. But it just gives 
> some guidelines.
>
> The Encoding CR (not a Recommendation yet) deals with encoding 
> conversions, not with what you do once you have a single internal 
> encoding.
>
> Regards,   Martin.
>
> On 2015/10/20 19:02, Sebastian Hellmann wrote:
>> Hi all,
>> I am new, so sorry, if I  reraise a topic.
>>
>> I was wondering, whether  the Encoding Recommendation would be the right
>> place to tackle a string counting issue. Lot's of programming languages
>> and specifications have quite different implementations regarding string
>> counting. I am sure you are aware of this. A particular example is this
>> spec in Section 2.1.2: https://tools.ietf.org/html/rfc5147#section-2.1.2
>> which specifies to count two code point as one, or PHP with
>> |strlen(utf8_decode("ä")) != ||strlen("ä")|
>>
>> Could we include some definitions in the standard on how strings are
>> counted and define a way to have offsets over these strings?
>>
>> Suggestion 1 (easy change): In the terminology section:
>> A string is a sequence of code points.
>> The /length/ of a string equals the number of contained code points.
>>
>> Suggestion 2 :
>> Define offsets similar to this image:
>> http://persistence.uni-leipzig.org/nlp2rdf/specification/image/iso+24612-2012.png 
>>
>>
>> e.g. start with 0 and then count the gaps.
>>
>> I would have high hopes that some implementers would pick it up
>> eventually. Such a definition would help immensely in the area of text
>> annotation and might also be an issue for the Web Annotation Group.
>>
>> All the best,
>> Sebastian
>>
>>
>>
>


-- 
Sebastian Hellmann
AKSW/KILT research group
Insitute for Applied Informatics (InfAI) at University Leipzig
DBpedia Association
Events:
* *Oct 31st, 2015* Deadline for Quality Management of Semantic Web 
Assets (Data, Services and Systems) 
<http://www.semantic-web-journal.net/blog/call-papers-special-issue-quality-management-semantic-web-assets-data-services-and-systems>
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
Thesis:
http://tinyurl.com/sh-thesis-summary
http://tinyurl.com/sh-thesis

Received on Tuesday, 20 October 2015 11:38:26 UTC