Discussion about this post

User's avatar
Jeffrey Soreff's avatar

"I would like to see a ‘yes and’ approach to what evaluations are run here, with cooperation between OpenAI and Anthropic (and ideally Google and others), where all labs run all the tests that any lab runs. This would give us a relatively robust set of tests, and also give us comparisons."

Generally agreed, with the caveat that the more organizations handle the tests, the higher the probability that tests leak and get into the training data for the next model training, invalidating them.

5 more comments...

No posts

Ready for more?