Two challenges related to KR of scientific papers and books from Stanislav Srednyak, Ph.D. on 2022-11-21 (public-aikr@w3.org from November 2022)

From: Stanislav Srednyak, Ph.D. <stanislav.srednyak@duke.edu>
Date: Mon, 21 Nov 2022 00:59:24 +0000
To: W3C AIKR CG <public-aikr@w3.org>
Message-ID: <BN7PR05MB414825D27315F2DC47CEA663E50A9@BN7PR05MB4148.namprd05.prod.outlook.com>

Dear colleagues,

thanks for the discussion that we had last time. I did some more study of the material on the KR group and I thing the following two questions would resonate with what many people here were thinking about.

1) building representation for latex formulas.

There is an amazing data set at arxiv. More generally, there are many pdfs of scientific articles available for modeling. This data set is challenging for several reasons.

1. it is hard to parse formulas.
2. there are omitted calculations

The problem 1. is the one I would like to attract attention to. It may be amenable to definite analysis because the set of math objects that humas use is very restrictive. In fact, most of the papers are just bout discrete math, number theory, and functions of real variables. Very seldom people use higher functionals. Thus, although the ZFC axiomatics admits arbitrarily complicated objects, what is actually found in papers is on the first several floors of Godel's definable universe.

There is an effort to revolutionize .pdf format here desci.com

2) AST of code.

There is a lot of code on Git, and there are standard tools for some languages, e.g LLVM for c++, ast package for Go, etc. Unfortunately, the representations that result from these tools are ill suited for describing code. One would like a graphical tool to represent and manipulate the code.

If such tools would be available, we would be able to match the parse trees of accompanying English with the trees from AST.

The tools like Transformer with its multihead attention would be much more useful.

Stan Srednyak

Received on Monday, 21 November 2022 01:00:25 UTC