Re: The Science of Insecurity from Bjoern Hoehrmann on 2012-01-10 (www-tag@w3.org from January 2012)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 10 Jan 2012 01:39:55 +0100
To: Henry Story <henry.story@bblfish.net>
Cc: "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <fmsmg71s4jk6i7sgqu3do9tlhc8hp1l888@hive.bjoern.hoehrmann.de>
* Henry Story wrote:
>Perhaps bringing out some of these aspects more clearly would help
>clarify why simplicity is important, in a way that appeals to education
>don't do so well. 

Well, the protocol and format designers and implementers I typically run
across generally do not seem to understand language theory well, and you
can't really blame them, as there is pretty much no benefit in doing so.
That's largely because there are no tools to work with formal languages.

Simple example: you have a specification for a URI scheme and you want
to know whether the scheme's syntax is a subset of the generic syntax.
Now, the latest fad is to use turing machines as formal specifications,
like you have "algorithms" instead of grammars in the "HTML5" specifica-
tion, so you would probably start by reverse-engineering the "algorithm"
into a grammar, and then you would have to write software that can read
that grammar and the generic URI syntax grammar into a common data model
and then you would have to implement something that can tell whether the
two are different, which would give you a solution if you've done every-
thing right.

Even if you were lucky and the specific scheme was provided as some ABNF
grammar that's compatible with the ABNF grammar for the generic URI syn-
tax, there is no tool that I am aware of, that would, for instance, turn
all regular productions in an ABNF grammar into a regular expression and
there is no tool that I am aware of that would take two expressions, and
tell you whether one is a subset of the other. Nothing easily discover-
able and usable that I did not write myself anyway. Even though this is
a very simple problem, you turn the grammars into a NFAs, the NFAs into
DFAs, toggle the acceptance of all states in one, compute a product au-
tomaton, minimize it, and look at the result. A 4th semester computer
science student should be familiar with all the individual steps. But no
mainstream tool for this exists as far as I am aware.

The same goes for parser generators and whatever else, even though this
is a very regularily recurring topic in the CSS Working Group, there is
no simple online tool that would tell you whether some proposed syntax
is possible under the generic syntax the CSS Working Group promises to
never change. Even for much simpler languages like URIs there is no such
service, except <http://www.websitedev.de/temp/rfc3986-check.html.gz>.

It does not seem likely to me that protocol and format designers will do
much to go into the direction proposed in the talk you referenced unless
the tools get much, much better. One point raised in the talk is that it
is common that input validation happens in many uncoordinated places in
code, and at times code is written assuming that surely by the time some
input reaches it, it has been validated (which may be true when it was
written but just means a seemingly unrelated change that invalidates the
assumption will break the whole thing).

One case of that I've seen a number of times is character string input.
You validate the an input string is properly formed UTF-8 when reading
it the first time, and later on you don't validate it again because that
is probably inefficient. The popular ICU library for instance has tools
that decode UTF-8 slowly while performing error checks, and tools that
decode quickly but do not check for errors. I arrived at this from a bit
of a different angle, I wanted to make standalone C-programs that decode
UTF-8 properly, but the amount of code the usual UTF-8 decoder employed
made using any of them unappealing, never minding that they tended to be
so complex you couldn't fully verify that they work correctly. Worse, my
general use case was reading from files, and decoders typically made it
either difficult or impossible to deal with incomplete input, and that's
just asking for trouble when programming in C. So I wrote my own:

  http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

In writing it I applied what I knew about formal languages and automata;
it's rather unusual in that the caller controls the input loop, normally
the caller would collect a reasonably large chunk of input and then call
the decoder, one reason being that function calls are expensive, and the
typically decoder is too large to "inline". That means you have to study
how the decoder, for instance, deals with cases where your input buffer
ends in the middle of a character, and how to call it to resume. That is
not even possible with some decoders, and it's a bit hard to test, since
you would have to design code and tests so you actually hit these cases.

My code does not have that problem, you always feed it exactly one byte.
You also can quite easily mathematically or automatically prove the cor-
rectness of my implementation, the code is very simple and short, and it
is possible to simulate all possible states and inputs and be sure that
you have done so, at least if you use proper type information (the code
uses uint32_t where it should use uint8_t due to compiler "bugs").

More importantly, my decoder is very fast, on the machine I developed it
on it outperformed all other decoders that fully validate their input,
that I could find and benchmarked anyway, and with a minor optimization
it performed just as well as the ICU code that does not validate its in-
put at all, meaning there is no incentive to use a less robust but much
faster decoder in "inner layer" code. And performance has improved since
I first published this, a whole instruction that's applied to every byte
in the original is unnecessary as explained later in the document.

And there are other benefits, ICU for instance has different code paths,
for performance reasons, when you have a fixed-length input string, and
null-terminated strings, and you typically have different code for deco-
ing and verification (verification would be much slower if you also de-
code but then discard the result without using it), which for my decoder
the compiler can detect and optimize on its own.

The underlying idea is a simple classic, regular language, use a DFA to
recognize it. The lookup tables you would get for the 256 character al-
phabet were too large, so I used one level of indirection to minimize
the space requirements (characters are translated to character classes,
so for instance all ASCII bytes are treated identically), and then the
remaining problem was that you have to strip certain bits from starter
bytes (the first byte in an UTF-8 sequence for a multi-byte charecter),
and I found a way to use the character classes for this purpose.

So, in this sense, if you understand formal languages well, and you can
apply this knowledge to practical problems, you will find solutions that
are much simpler, more secure, more easily re-usable, easier to verify,
but as before, even the first step in constructing my decoder would re-
quire some tool to discover the DFA tables for the simple verification
of input. http://search.cpan.org/dist/Unicode-SetAutomaton/ is probably
what I used, after writing it weeks prior. No tools.

The talk suggests parser generation from grammars. As it is, we would've
to re-educate protocol and format designers, in certain circles anyway,
to even value and provide grammars to begin with. Their prefer the prose
turing machines that people who want a grammar have to reverse-engineer.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 10 January 2012 00:40:22 UTC