> An interesting letdown here is the AP Calculus test. It improves from 71 to 83, but I remember that test, and there is nothing there that this system should find hard. If we can get into the 90s in these other places, why not there?
My first guess would be multimodal gaps. ~25% of the test relying heavily on images sounds about right.
> An interesting letdown here is the AP Calculus test. It improves from 71 to 83, but I remember that test, and there is nothing there that this system should find hard. If we can get into the 90s in these other places, why not there?
My first guess would be multimodal gaps. ~25% of the test relying heavily on images sounds about right.
Speaking of multimodal-- that cryptic crosswords result made me sit up and go "huh". I've always thought they're a great test case for LLMs, since there's plenty of training data and it's text-only IO with clear right answers, but it tickles a lot of things LLMs find hard: paying close attention to lengths / letter counting, ignoring surface-level vibes, and making creative leaps.
My provisional AGI fire alarm is "it ruins the MIT Mystery Hunt" and that's basically cryptics with multimodal and tool use. It now feels like o1 strategies and another few OOMs of model scaling are likely to get us there.
> An interesting letdown here is the AP Calculus test. It improves from 71 to 83, but I remember that test, and there is nothing there that this system should find hard. If we can get into the 90s in these other places, why not there?
My first guess would be multimodal gaps. ~25% of the test relying heavily on images sounds about right.
Speaking of multimodal-- that cryptic crosswords result made me sit up and go "huh". I've always thought they're a great test case for LLMs, since there's plenty of training data and it's text-only IO with clear right answers, but it tickles a lot of things LLMs find hard: paying close attention to lengths / letter counting, ignoring surface-level vibes, and making creative leaps.
My provisional AGI fire alarm is "it ruins the MIT Mystery Hunt" and that's basically cryptics with multimodal and tool use. It now feels like o1 strategies and another few OOMs of model scaling are likely to get us there.