[RDF 1.1 N-Triples] Security considerations when (stream-)parsing remote documents from Dominik George on 2024-08-03 (public-rdf-comments@w3.org from August 2024)

From: Dominik George <nik@naturalnet.de>
Date: Sat, 3 Aug 2024 12:25:26 +0200
To: public-rdf-comments@w3.org
Message-ID: <aw3ffpwpj7tahmngmgyqoxoyerxtooq4ctzyp3wfiqxdi6sise@ka55rngvpsdp>

Hi,

I am implementing an N-Triples parser that is capable of stream-parsing
from remote HTTP servers by a provided IRI.

While clearly this is a security risk in general, and taking in
untrusted IRIs from unknown sources should be handled very carefully,
one particular issue arose while implementing.

In order to work efficiently, the stream parser should read line by line
(as N-Triples is a line-based format), and beginning to parse a chunk
makes no sense before a line ending is reached.

The possible attack vector now is:

* An attacker provides a dereferencable IRI to a remote HTTP server
  serving an N-Triple document
* The remote HTTP server is crafted so that it sends valid Unicode
  text, carefully avoiding to send a \r or \n (or ., if we went on to
  take that as a stop condition)

N-Triples has two productions that theoretically allow for arbitrarily
long values: blank node labels and literals.

Clearly, some limit needs to be set for how much data to load before
giving up. Thus, I wonder whether there is some sane default I could
choose, from experiences by others?

My current idea is that my N-Triples source implementation, if handling
a remote document from HTTP, should require the server to send a
Content-Length header, and only begin loading and parsing if it is
acceptable. The option to parse actually streamed data, where the
Content-Length is not known beforehand, should be placed behind a flag
that needs to be explicitly set.


Any recommendations are welcome.

Thanks,
Nik

P.S.: The same concerns exist for Turtle and other formats. However, as
these formats are not line-based, the parser could work with a
fixed-size buffer instead, and an attacker would have to put more effort
in crafting an endless stream of valid tokens to not make the parser
suspicous.

Received on Monday, 5 August 2024 08:25:37 UTC