OpenAI open sources HealthBench, collaborating with 60 countries to develop 5000 real conversations.

13/05/2025

On May 12th, OpenAI released a specialized test evaluation set for medical large models called HealthBench. Unlike previous test sets, this test set consists of 5000 core test dialogues created by 262 doctors from 26 professions from 60 countries/regions, greatly enhancing the difficulty, authenticity, and richness of this test set. It also uses multi-round dialogue tests instead of simple question-and-answer or multiple-choice modes. According to test data, the performance of large models in the healthcare field has significantly improved. For example, the performance has improved from 16% for GPT-3.5Turbo to 32% for GPT-4o, and further to 60% for GPT-4.3. Overall performance has made significant progress. The improvement in small models is particularly remarkable, with GPT-4.1nano not only surpassing GPT-4o in performance, but also reducing costs by 25 times.