ChatGPT Put to the Test Against Students—With Concerning Results

Graduate students at Harvard outperformed a ChatGPT model by more than two letter grades in a study conducted by researchers at the university.

The researchers expected that OpenAI’s chatbot would, “perform similarly to doctoral students on lower cognitive levels,” hypothesizing that ChatGPT would be able to memorize materials sufficiently while struggling with critical thinking problems.

However, ChatGPT was significantly outperformed by the students because the model struggled to “remember” and “apply” tasks, although the researchers were able to improve ChatGPT’s performance with prompts.

“We found a striking deficit in ChatGPT’s ability to interpret scientific graphs and raw data in both short-answer and multiple-choice questions, even when using a version specifically designed for image interpretation,” the researchers wrote.

The researchers conducted the study with students from Harvard’s Principles of Molecular Biology course, a 200-level class that spans the full semester.

Over the course of the study, the students were expected to maintain a minimum grade of 80 percent, which is a passing grade for doctoral students.

The AI’s responses, meanwhile, were produced using GPT-4o, which was released by OpenAI in May 2024.

To make sure that the students didn’t use artificial intelligence themselves, the researchers took out-of-class assignments from 2022, before AI was widely available and adopted.

The Findings

Doctoral students outperformed ChatGPT at every level.

The chatbot did well, but significantly worse than students, on “remember” questions. The researchers noted that the questions are not meant to be challenging, but are intended to encourage students to summarize techniques.

Students outperformed ChatGPT 98 percent to 82 percent.

Meanwhile, students out-performed ChatGPT significantly on long-answer design questions. The students also outperformed ChatGPT on fill-in-the-blank questions.

ChatGPT was particularly poor at “understand”, “apply” and “analyze” questions, where it earned a 66 percent average, compared to 87 percent by the doctoral students.

ChatGPT would have “failed,” and according to the researchers, the poor results were “largely driven by the algorithm’s markedly poor performance on the ‘apply’ level, which refers to identifying, rationalizing and describing experimental controls that students had previously learned through their coursework.”

‘Is this really surprising?’

Commentators on Reddit’s r/science forum were not shocked.

“Anyone who has spent time using [large language models] should know that they are still a long way from being as good as an experienced human,” a critic wrote.

“Even for really focused tasks like coding you need to be very attentive in watching out for hallucinations or bad practices in the code.”

Another contributor asked, “Is this really surprising? The only people claiming that LLMs operate at the ‘PhD level’ are LLM marketers. They constantly fail to solve introductory physics and chemistry questions, so no doubt research level biology is beyond them.”

However, several pundits pointed out that the ChatGPT model used in the experiment was outdated.

“I think it’s critical to point out that when this study was done, LLMs like ChatGPT were nowhere near where they are now,” an individual posted.

“As someone who uses LLMs daily and runs a significant research group, we have found that the difference between now and even just one year ago is an order of magnitude. It can solve many science and engineering problems without significant prompt.”

Newsweek has reached out to the researchers and OpenAI for comment via email.

Source link