There’s a New King of Chatbots, and It’s Not ChatGPT
If you ask the general public which AI model is the best, chances are most people will answer ChatGPT. While there will be many players on the scene in 2024, LLM OpenAI is the one that has truly broken through and introduced powerful generative AI to the masses. And be that as it may, ChatGPT’s large language model (LLM), GPT, has consistently been a leader among its peers, from the advent of GPT-3.5 to GPT-4 and currently GPT-4 Turbo.
But that appears to be changing: this week, Claude 3 Opus , Anthropic’s LLM, beat GPT-4 for the first time in Chatbot Arena , prompting app developer Nick Dobos to declare, ” The King is Dead .” If you check the leaderboard as of this writing, Claude still has an edge over GPT: Claude 3 Opus has an Arena Elo rating of 1253, and GPT-4-1106-preview has a rating of 1251, followed by GPT. -4-0125-preview, with a rating of 1248.
Regardless, Chatbot Arena puts all three of these LLMs in first place, but Claude 3 Opus has a slight edge.
Anthropic’s other LLMs are also performing well. Claude 3 Sonnet ranks fifth on the list, just below Google’s Gemini Pro (both rank fourth), and Claude 3 Haiku, Anthropic’s lower-end LLM for efficient processing, ranks just below GPT-4’s 0613 but just below 0613 GPT-4. higher than version 0613 GPT-4.
How Chatbot Arena evaluates LLMs
To rank the different LLMs currently available, Chatbot Arena asks users to enter a prompt and rate how two different, unnamed models would respond. Users can continue to chat to evaluate the differences between the two until they decide which model they think performs better. Users don’t know which models they’re comparing (you can compare Claude to ChatGPT, Gemini to Meta’s Llama, etc.), eliminating any bias associated with brand preference.
However, unlike other types of benchmarking, there is no true criterion by which users can evaluate their anonymous models. Users can simply decide for themselves which LLM works best based on the metrics they care about. As artificial intelligence researcher Simon Willison told Ars Technica , much of what makes LLM perform better in the eyes of users has more to do with “vibes” than anything else. If you like the way Claude answers better than ChatGPT, that’s all that really matters.
Above all, it is a testament to how influential these LLMs have become. If you had offered the same test many years ago, you would likely have been looking for more standardized data to determine which LLM was stronger, be it speed, accuracy, or consistency. Now Claude, ChatGPT and Gemini have become so good that they are almost interchangeable, at least when it comes to general use of generative AI.
While it’s impressive that Claude beat OpenAI’s LLM for the first time, it’s perhaps even more impressive that GPT-4 lasted this long. LLM itself is a year old , excluding iterative updates such as GPT-4 Turbo , and Claude 3 launched this month. Who knows what will happen when OpenAI releases GPT-5, which is , according to at least one anonymous CEO , “…really good, even substantially better.” There are currently several models of generative AI, each of which is approximately equally effective.
Chatbot Arena has collected over 400,000 votes to evaluate these LLM programs. You can try the test yourself and add your vote to the rating.