Hi, The attached code is a fast, small, simple, robust UTF-8 decoder in C; on my antiquated system mere validation of in-memory data proceeds at a- bout 1 GB / s. Also capturing the code points is about 25% of that. The algorithm used is online allowing byte-level streaming decoding. It pro- tects against all forms of illegal UTF-8. The algorithm proceeds thus: - Each byte is associated with a character class and a mask - The character class is used to advance a finite automaton - The mask is used to strip off leading bits from the byte - The remaining bits are combined into a Unicode code point - A code point is complete if the DFA enters the final state The table used for byte mapping and the DFA is 288-416 bytes in size; the smaller version requires a branch instruction in the decoder, the larger version does not. The automaton was built using http://search.cpan.org/dist/Unicode-SetAutomaton/ http://search.cpan.org/dist/Set-IntSpan-Partition/ The latter was used for the character class mapping. I've tested this on some properly encoded files and it works as expected. I could not find an easily usable test set with malformed documents. I also could not find interesting UTF-8 decoders to compare this one with, most of them are rather scary. regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:33:35 UTC