- From: Dominik George <nik@naturalnet.de>
- Date: Sat, 3 Aug 2024 12:25:26 +0200
- To: public-rdf-comments@w3.org
- Message-ID: <aw3ffpwpj7tahmngmgyqoxoyerxtooq4ctzyp3wfiqxdi6sise@ka55rngvpsdp>
Hi, I am implementing an N-Triples parser that is capable of stream-parsing from remote HTTP servers by a provided IRI. While clearly this is a security risk in general, and taking in untrusted IRIs from unknown sources should be handled very carefully, one particular issue arose while implementing. In order to work efficiently, the stream parser should read line by line (as N-Triples is a line-based format), and beginning to parse a chunk makes no sense before a line ending is reached. The possible attack vector now is: * An attacker provides a dereferencable IRI to a remote HTTP server serving an N-Triple document * The remote HTTP server is crafted so that it sends valid Unicode text, carefully avoiding to send a \r or \n (or ., if we went on to take that as a stop condition) N-Triples has two productions that theoretically allow for arbitrarily long values: blank node labels and literals. Clearly, some limit needs to be set for how much data to load before giving up. Thus, I wonder whether there is some sane default I could choose, from experiences by others? My current idea is that my N-Triples source implementation, if handling a remote document from HTTP, should require the server to send a Content-Length header, and only begin loading and parsing if it is acceptable. The option to parse actually streamed data, where the Content-Length is not known beforehand, should be placed behind a flag that needs to be explicitly set. Any recommendations are welcome. Thanks, Nik P.S.: The same concerns exist for Turtle and other formats. However, as these formats are not line-based, the parser could work with a fixed-size buffer instead, and an attacker would have to put more effort in crafting an endless stream of valid tokens to not make the parser suspicous.
Received on Monday, 5 August 2024 08:25:37 UTC