Of course, GeoS makes errors for different reasons than high-schoolers. A human being might correctly interpret the question, then apply the wrong formula, or muck up the calculation. GeoS, being a computer, will virtually always get the correct answer so long as it truly understands the question. It might not be able to read a word correctly, or the grammar of a question might be too alien for the computer to parse. Regardless, what we’re really measuring here is the computer’s ability to understand human communication in a form that’s deliberately (pardon the pun) obtuse.
To do this, the researchers had to smash together a whole array of different software technologies. GeoS uses optical character recognition (OCR) algorithms to read the text, and custom language processing to try to understand what it reads. Geometry questions are structured to be difficult to parse, hiding important information as inferences and implications.
The other side of the coin is that though geometry questions are dense and hard to tease apart, they’re also extremely uniform in structure and subject matter. The AI’s programmers can plan for the strict design principles that go into writing the questions. It couldn’t take this same programming and directly apply it to calculus problems for instance, because they use somewhat different language and mathematical symbols to describe the problem. But a good GeometryBot would also be relatively easy to adapt to those few distinguishing rules. Each successive new area of competence would make the next one easier to acquire.
One intriguing implication of this research is that someday, we might have algorithms quality-checking SAT questions. We could have different AI programs intended to achieve different levels of success on average questions, perhaps even for different reasons. Run proposed new questions through them, and their relative performance could not only weed out bad questions for point to the source of the problem. BadAtReadingAI and BadAtLogicAI did as expected on the question, but BadAtDiagramsAI did terribly — maybe the drawing simply needs to be a little clearer.
This isn’t a sign of the coming AI-pocalypse, or at least not a particularly immediate sign; as dense as geometry questions might be, they’re homogeneous and nowhere near as complex as something like conversational speech. But this study shows how the individual tools available to AI researchers can be assembled to create rather full-featured artificial intelligences. When things will really take off is when those same researchers start snapping together those amalgamations into something far more versatile and full-featured — something not entirely unlike a real biological mind.