Announcing the ar5iv-04.2024 dataset

Hi everyone,

I am happy to announce that the latest ar5iv collection of HTML+MathML
documents is now freely available for reuse as a dataset. The release
contains 2.1 million HTML documents, and over 1 billion MathML expressions,
generated by latexml v0.8.8.

More details and download at:
https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/

As a reminder, the "ar5iv Lab" is an HTML preview site for arXiv.org. As of
late 2023, ar5iv is in the process of being phased out, as arXiv's official
HTML coverage gradually reaches parity. Until then, it continues to be
available at:

https://ar5iv.labs.arxiv.org/

Best regards,
Deyan

Received on Tuesday, 30 April 2024 15:32:09 UTC