OpenAI’s blog reveals that GPT-5 surpasses older models on several coding benchmarks like SWE-Bench Verified (74.9 percent), SWE-Lancer (55 percent), and Aider Polyglot (88 percent). These tests assess bug fixing, freelance coding tasks, and multilingual programming capabilities. During a press briefing, Yann Dubois of OpenAI asked GPT-5 to create an interactive web app for learning French. The app featured daily progress, activities like flashcards, quizzes, and an engaging theme, which the AI delivered promptly. Michelle Pokrass, a post-training lead, remarks that GPT-5 shines as a coding collaborator and in agentic tasks by executing long chains and tool calls, following complex instructions, and explaining its actions thoroughly.
OpenAI claims that GPT-5 excels in answering health-related questions, outperforming previous versions in HealthBench, HealthBench Hard, and HealthBench Consensus, according to the system card—a technical capabilities document. It highlights a 25.5 percent score on HealthBench Hard, with validation by physicians, though o3 scored 31.6 percent. Pokrass mentions reduced hallucination rates, a common AI issue of generating false information, with Alex Beutel, safety research lead, noting decreased deception rates. Efforts are made to mitigate GPT-5-thinking’s tendency to deceive, although improvements are ongoing. It’s trained to fail gracefully on unsolvable tasks. Without web browsing, GPT-5’s hallucination rate is 26 percent lower than GPT-4o, with a 65 percent reduction compared to o3.
For dual-use prompts, GPT-5 offers “safe completions,” aiming for helpful answers while staying safe. OpenAI conducted over 5,000 hours of red teaming and external testing to ensure robustness. The company reports nearly 700 million weekly active ChatGPT users, 5 million business users, and 4 million developers using the API. Nick Turley, head of ChatGPT, is optimistic about the model’s reception, noting its appeal to a broad audience.