Summarize by Aili

GPT-4 didn't ace the bar exam after all, MIT research suggests — it didn't even break the 70th percentile

https://www.livescience.com/technology/artificial-intelligence/gpt-4-didnt-ace-the-bar-exam-after-all-mit-research-suggests-it-barely-passed

🌈 Abstract

The article discusses a new study that suggests OpenAI's GPT-4 model did not actually score in the top 10% on the bar exam, as previously claimed. The study found that GPT-4 scored in the 69th percentile of all test takers and in the 48th percentile of those taking the test for the first time. The model also performed poorly on the essay-writing section of the exam, landing in the 15th percentile of first-time test takers. The article highlights the importance of carefully evaluating AI systems before using them in legal settings.

🙋 Q&A

[01] Findings of the new study

1. What were the key findings of the new study on GPT-4's performance on the bar exam?

The study found that GPT-4 did not actually score in the top 10% on the bar exam, as previously claimed by OpenAI.
Instead, GPT-4 scored in the 69th percentile of all test takers and in the 48th percentile of those taking the test for the first time.
The model performed particularly poorly on the essay-writing section of the exam, landing in the 15th percentile of first-time test takers.

2. How did the study's findings differ from the original claims made by OpenAI?

OpenAI had claimed that GPT-4 scored in the top 10% on the bar exam, but the new study found that this was only true when compared to repeat test-takers, a much lower-scoring group.
When compared to all test-takers or first-time test-takers, GPT-4's performance was much lower, falling outside the top 10% and even below average in some areas.

3. What were the methodological issues identified in the original study's grading of the MPT and MEE sections?

The original study did not use the essay-grading guidelines set by the National Conference of Bar Examiners, which administers the bar exam.
Instead, the researchers simply compared answers to "good answers" from the state of Maryland, which the new study author, Martínez, said was a significant issue.

[02] Implications and Recommendations

1. What were the implications of the study's findings regarding the use of AI in legal settings?

The study's findings suggest that current AI systems, including GPT-4, should be carefully evaluated before being used in legal settings, to avoid "unintentionally harmful or catastrophic" consequences.
The poor performance of GPT-4 on the essay-writing section, which is seen as the closest proxy to the tasks performed by a practicing lawyer, indicates that large language models may still struggle with tasks that more closely resemble real-world legal work.

2. What recommendation did the study author, Martínez, make regarding the use of AI in legal settings?

Martínez recommended that AI systems be carefully evaluated before being used in legal settings, to ensure they are not used in an "unintentionally harmful or catastrophic manner."

Shared by Daniel Chen ·

Install fromChrome Web Store