About ChemBench

Learn more about the app and the project


The goal of the project

Our goal is to better understand how good models are in chemistry. Understanding this is also important in order to create better models. Clearly, "being good" is a subjective term, and we want to make it more objective. Toward this goal, we have been curating questions from diverse areas of chemistry and at a variety of difficulty levels.

The questions are designed such that they can easily be used in automated model evaluations, i.e., we do not ask for free-form answers, but rather for a single number of a choice from a list of options. This allows us to automatically evaluate the answers and to provide feedback to the model developers.

However, knowing how well models perform does not yet tell us much about how good they really are. Would you be able to say if an accuracy of 90% is good or bad? To be able to answer this question, we need to know how well humans perform on the same questions. And this is where the app comes in. The sole purpose of the app is to collect answers from human experts. To allow us to do meaningful comparisons, we ask you provide us with some information about yourself, such as your background and your experience with the questions. We will not share this information with anyone, and we will not use it for any other purpose than to better understand the answers you provide. For example, we might compare the models' answers to questions about organic chemistry from people with a background in organic chemistry.

"Rules" for answering questions

To ensure that we can compare the answers from different people, we ask you to follow a few simple rules when answering questions:

  • Do not use any external resources. This includes books, websites, and other people. It is okay to use a calculator, though.
  • Make a serious attempt to answer the question. If you do not know the answer, make an educated guess.
  • Provide us with feedback if you think that the question is unclear or if you think that none of the answers is correct. You can do this by clicking on the "bug" icon next to the question title.

How to use the app

After the first login, you will be asked to provide some information about yourself. This information will be used to better understand the answers you provide. But we will not share displayname or email address with anyone, and we will not use it for the analysis. Hence, you can use a pseudonym if you want to remain anonymous. You can also use a throw-away email address if you do not want to use your regular email address. One example of such as service is minuteinbox.com.

After that you can start answering questions by navigating to "Questions" in the menu. By clicking on "view" you can see the question and the possible answers.

About the leaderboard

The leaderboard shows the performance of models and users on the questions.

Of course, there are many different ways to measure performance, and we currently focus on the fraction of questions that are answered completly correctly for comparing models and users. That is, the scores are computed as

score=number of questions answered correctlytotal number of annswered questions.\text{score} = \frac{\text{number of questions answered correctly}}{\text{total number of annswered questions}}.

Additionally, we have a leaderboard that compares users by the number of questions they have answered.