SimpleQA test to measure hallucinations: “questions had to induce hallucinations from either GPT-4o or GPT-3.5”

Link. “We hired AI trainers to browse the web and create short, fact-seeking questions and corresponding answers”

I wonder if they used “Amazing Turk”

“… there is a lot of room to improve the calibration of large language models in terms of stated confidence”