Four years of decoding UTF-8... from Bjoern Hoehrmann on 2013-04-20 (www-archive@w3.org from April 2013)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 20 Apr 2013 05:10:17 +0200
To: www-archive@w3.org
Message-ID: <bfp3n8dqjb9sl2l85mgbteujqj10t9t4nt@hive.bjoern.hoehrmann.de>
So,

  http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ is four years old now. I
learned earlier this week that the Adobe Flash Player now uses it, and I
downloaded the binary and grepped for the data table, and indeed, so it
does. I guess that makes it by far my most successful code, by number of
installations.

I learned a lot writing it. One lesson was that C compilers are not very
good at optimising code for speed. For instance, the `decode()` function
should use `uint8_t` as type for `byte`, but despite inlining the code,
and despite only ever being fed `uint8_t` as `byte` value, I had to use
a register-sized type for appropriate performance.

Finding a formulation that performed well in MSVC and GCC wasn't easy,
and as noted in the document, older GCCs produced faster code than what
newer versions produced, under the same options. Usually with very huge
differences, over 20% between the two GCCs for instance, while elsewhere
people celebrate 2% improvements.

It wasn't actually clear to me that I much of a chance to outperform the
other widely used implementations. How do lookup tables compare to care-
fully optimised control flow statements? Some time before I started work
on the decoder, I was actually more interested in UTF-8 encoding, with a
twist (store individual code point in 32 bit integer). I could not come
up with code that I liked, so I asked in comp.lang.c how to improve what
I had come up with.

Paul Hsieh came to the rescue with a mixture of table lookups and bit
arithmetic which encouraged me to read up quite a bit on the topic, and 
before long I found myself studying Agner Fog's instruction tables, and
other resources. My Perl module Unicode::SetAutomaton "still" uses some
variant of what emerged in that Usenet discussion.

So one thing I learned there was that it is rather unclear which kind of
code will actually perform better, since processors vary in what kind of
penalty is incurred for a branch misprediction, for instance. One thing
I note in the decoder documentation is that the performance figures do
not model real-world applications well, which would usually add work on
top of partial results from the decoder, so one issue is the interaction
between my code and calling code, like how they might trash caches...

The Unicode::SetAutomaton code is much closer to what I was working on
at the time. The module basically allows you to rewrite a Unicode-based
regular expression into a UTF-8-based regular expression. Consider you
have a regular expression `/ö/`, then the module helps you generate the
regular expression `/\xC3\xB6/`, only that it supports whole character
classes like "all Unicode code points of the general category 'Letter'".

I was working on various formal language and automata tools at the time,
after James Clark's "An algorithm for RELAX NG validation" introduced me
to Janusz Brzozowski's "Derivatives of Regular Expressions" from long
before I was born. That did a whole lot to demystify this area of compu-
ting for me. Working more recently on de-horrifying email plaintext, it
still pains me that there are no tools for problems I regularily encoun-
ter that I know are trivial due to this. For instance, consider

  > ----- Original Message -----
  > From: ...
  > ...
  > Subject: Very long subject that
  > is wrapped like this

Noting that the leading `> ` is not due to quoting, what I would want to
do is matching the subject string from the parent mail in the child mail
but basically at any point of the subject string, as alternative to the
subject string, I would also want to accept white space or ">". That is
essentially the `interleave` operator in RELAX NG, so something like

  "Very long subject ..." `interleave` /[ \n>]*/

is what I want to match. And `interleave` is easy to implement just like
`intersect`, but regular expression systems usually do not support them,
and as an even greater pain point, they tend to work only on strings. As
my data model is "tokens", that makes for something like

  > This is quoted text that has be
  > en wrapped in the wrong position

namely a word that appears to have been quoted ("has") followed by two
words that do not appear to be quoted ("be" and "en") followed by more
words that appear to be quoted ("wrapped ..."), to check if "be" + "en"
is actually "been" in the mail that is being quoted, rather difficult.

Anyways... Something I tried to improve the performance of the example
transcoder to UTF-16 was special handling for long US-ASCII sequences,
with plain C and and available x86 intrinsics. Back then I was not able
to get any improvement out of that. The control flow logic to avoid mis-
aligned reads took too many cycles, and on my processor the penalty for
misaligned reads was, as best as I could tell, too high, and I did not
test it on really purely US-ASCII text, going forward the total lack of
non-ASCII in long sequences does not seem very realistic, for the most
uses. But in 2011 Bryan O'Sullivan tried to use my code for Haskell's
Unicode text package, and apparently CPUs have changed since, he was
able to get a lot out of this. I have not independently confirmed that
yet, but I count this as another thing learned.

In fact, I do not care all that much about throughput. The more widely
used variation of my initial decoder is the last variation mentioned on
the page above, due to Rich Felker pointing out that the state-table can
be pre-multiplied, which makes the data table smaller and eliminates a
whole x86 instruction for every byte (or any non-ascii byte, depending
on how callers and compilers optimise). That cannot possibly be slower,
compilers willing, than having the larger table with the added `shift`
instruction, but I never got around to updating the benchmark table, so
it is possible and possibly likely that my code outperforms any other by
a wider margin than it already did when I measured it. Sadly, nobody has
compiled better benchmarks on the matter, as far as I am aware.

I, of course, did not write the decoder for speed. I just needed a work-
around for the lack of UTF-8 support "in C", and the existing options...
Well, they were not really options. There are some lessons about API de-
sign in there. I would encourage anyone interested to write a simple "C"
program that just reads from a UTF-8 encoded text file and prints out,
say, the Unicode scalar value of any character found, while also noting
when the file is malformed, with the constraint that you only have, say,
4KB of memory available. Reading files in C like that is already a huge
pain, as far as I am concerned, so the combination of the two problems,
with UTF-8 decoders other than mine, is likely to push you over the edge
if you think and feel about code like I do...

One property of the decoder is that it validates at all times, more so
than others because it "stops" earlier on malformed input, and as it is
quite fast, there isn't much incentive to use a non-validating one with-
in "hot" loops, which would lead to security vulnerabilities when used
incorrectly. A discussion between Greg Kroah-Hartman and Linus Torvalds
in the context of my decoder, I learned that this is not a universally
shared "pro", with Linus Torvalds noting:

  Please don't use stuff that "validates" utf8. It's crap.
  It leads to various security issues, and it results in
  problems whenever you have non-well-formed-utf8 data...

There are many things that I like about the code, and some that I dis-
like. The `decode(...)` function is pure and generates a finite and very
small output set that can be tested exhaustively, automatically, and if
you are into that sort of thing "manually", "mathematically" if you pre-
fer. It's mainly data driven, so it is easy to port to other languages.
It contains only a single branch that can easily be removed, say by con-
ditional move, `cmov`, instructions. More importantly, it allows calling
code to stay in control, you don't have to assemble a "chunk" of data to
pass to the decoder, then pass control to the decoder, and regain it at
a later point, it is more of a natural part of your own code. As there's
only very little code, it can easily be "inlined", and with that, as you
use it for different purposes, parts of the code can be eliminated, like
you may want to have a `is_utf8` function that just validates a string,
without caring about the codepoints, where the compiler can remove code
for codepoint computation. And so on...

Some other day I might get to write down how trying to improve the code
almost cost me my only computer, due to overheating while compiling and
instrumenting randomly generated code based on an ABNF grammar...

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Saturday, 20 April 2013 03:10:44 UTC