Published on January 4, 2024

Announcing the ChemBench Project

Kevin Maik Jablonka

@kmjablonka

LLMs are on everyone's mind these days. But how good are they at chemistry? We're building a benchmark to find out.

Introduction

Large language models (LLMs) have been making waves. Some have seen in them sparks of artificial general intelligence, while others only see stochastic parrots.

Chemists, too, have been starting to use LLMs. They have been using LLMs to predict properties of molecules and materials,¹^,²^,³ to guide the optimization of materials ⁴ or to even autonomously use tools such as cloud labs.⁵^,⁶

Even though this lead some to state that "the future of chemistry is language" ⁷ we have only very little systematic understanding about how well models know and understand the chemical sciences.

Understanding the chemical capabilities of frontier models is not only important to be able to improve them (and systems in which they are used)—but also has important safety implications as chemical frontier models are a dual use technology.⁸

Building a suite of chemical evals

Google has been pushing one of the most popular benchmark suites for LLM: BigBench.⁹ This suite contains more than 200 tasks, but less than a handful have some relation to chemistry.

In the ChemBench project, we build on top of the success of BigBench—but add some crucial pieces to build an instrumental guage for chemical frontier models' performance.

Chemistry tasks

The most fundamental piece for that is a diverse collection of chemical tasks. We have been sourcing more than 6000 questions from various sources. Some have been automatically generated (e.g., symmetry of compounds), some semi-automatically (based on curated datasets), many others have been completely manually curated from exam papers, exercise sheet or are completely novel questions. Before questions added our pool, they all underwent a peer review process.

Parsing and evaluation routine

The most popular way for testing LLMs is using multiple choice questions. While certain concepts of the chemical sciences can be probed using such questions, it is not enough.

A very important part of chemistry is to solve or balance equations or to perform calculations. Therefore, our benchmark also contains tasks for which the models is given no choices but simply expected to return the right answer—for example a number.

While this allows us to probe the model with more interesting questions, this required us to develop a pipeline for prompting models with such questions as well as extraction and analysing the model responses.

Contextualizing evals with human baselines

While this benchmark allows us to compare models, the scores we get are difficult to interpret. What does an accuracy of 46% for "organic chemistry mean"?

Clearly, those scores would be more interpretable and useful if we could compare them to how well chemists perform. To understand this—how chemists with different backgrounds and experience levels—perform, we (the LAMAlab with Aswanth Krishna) have been building app to just learn that.

What's next?

We are currently in the process of finalizing the benchmark and will release it soon. While we are doing that, we are also working on a paper that will describe the benchmark in more detail.

This benchmark will not be the end of the story. There are many more things we want—and need—to do to understand the capabilities of LLMs in chemistry. We are currently working on a few of those, but we are also looking for collaborators to help us with this. Reach out to us if you are interested in helping us with this.

Announcement video

If you prefer watching a video, here is a short announcement video:

References

See all posts