AI ResearchLearning VisibilityFuture of EducationAI in Schools

The Studies We Have Not Run

Stanford just reviewed more than a thousand papers on AI in K-12 and found only twenty with rigorous causal evidence. None of them were conducted in an American classroom. The gap is not academic. It is the same gap teachers face every day.

April 28, 20266 min readKoan Team

This month, Stanford's SCALE Initiative released a report that should quietly rearrange a national conversation. The Evidence Base on AI in K-12: A 2026 Review combed through more than 800 academic studies, expanded to over 1,100 papers in its working repository, looking for one specific thing: rigorous causal evidence about how AI tools actually affect students and educators.¹

They found twenty.

Not twenty thousand. Twenty. Out of more than a thousand papers in the field, only twenty meet the standard of convincingly showing that an AI tool caused an outcome, rather than simply correlating with one. Most of the twenty look at math. None of them, the report notes, study student AI use inside an American K-12 classroom.¹

Read that once more. After two years of headlines, hundreds of pilots, billions of dollars of investment, and 134 state bills introduced this year alone,² we have no high-quality causal evidence of what happens when American schoolchildren use AI in their actual schools.

The Stanford team is not saying nothing works. They are saying we do not yet know. The studies that exist tend to be short. They focus on performance during a task. They tell us very little about what happens after the tool is put away. Did the student remember the concept? Could they apply it independently? Did their relationship with effort and confusion change at all? On these questions, the literature is nearly silent.

Why the Evidence Is Thin

The reason there are so few causal studies is not that researchers are lazy. It is that the things that actually matter in a classroom are extraordinarily hard to measure. A randomized controlled trial can isolate a single variable. But learning is not a variable. It is an unfolding process that takes weeks or months, that involves false starts and revisions and long quiet moments of breakthrough, and that depends on relationships, mood, attention, prior knowledge, and a hundred other things a research design cannot easily hold constant.

So the field defaults to what it can measure. Performance during a task. Score on a test. Time to completion. These are real signals, but they are downstream of the phenomenon we actually care about. They tell us about the surface of learning, not its substance.

That gap, between what we can easily study and what actually matters, is the same gap teachers face every day in their own classrooms.

The Teacher's Version of the Problem

Consider what a typical teacher knows about a typical student. They have the score on the last quiz. They have the essay that was turned in. They have a vague impression built from a few interactions in a hallway or during a discussion. They do not have a record of how the student arrived at any of it.

Did the student labor over the essay or generate it in three minutes? Did they pause halfway through, reconsider, and begin again, or did they push through without thinking? Did the right answer emerge from understanding or from copying? On these questions, even the most attentive teacher is mostly guessing.

The Stanford researchers face the same problem at scale. They want to know whether AI is helping students learn. But learning is invisible. What is visible is performance. So the field studies performance, and the studies pile up, and the actual question gets no closer to an answer.

A Different Kind of Evidence

Here is what the future could look like, if we are willing to redesign the surface of learning itself.

Imagine a classroom where every essay carries with it a record of how it was built. Not as surveillance, but as a normal artifact of the work. The teacher can see the four drafts, the moments of revision, the long pauses, the questions the student asked before they started writing. The researcher can see the same thing. The student, weeks later, can see how their own thinking evolved.

In such a classroom, the gap between what we can study and what we actually care about begins to close. Performance on a final task is no longer the only signal. The process becomes the evidence. And the questions Stanford's reviewers worried we cannot yet answer, about whether AI builds genuine capacity or only inflates short-term metrics, become tractable.

This is the project we are working on at Koan. The WorkHub captures the texture of student work as it happens. Every revision, every pause, every shift in reasoning. When Aidan, our AI tutor, helps a student think through a problem, the teacher can see exactly how the conversation unfolded. The record is not a dashboard summary. It is the thing itself.

We do not pretend this solves the rigorous causal study problem on its own. It does not. But it changes what is available to study, and what is available to teach with. A classroom where process is visible is a classroom where evidence is possible.

The Patience the Field Will Need

The most honest thing in the Stanford report is its restraint. It does not declare that AI is good or bad for students. It says, with care, that we mostly do not know, and that the question deserves better evidence than we currently have. That is the kind of intellectual honesty the AI-in-education conversation has been missing.

It will take years to build the causal studies the field needs. But the precondition for those studies is the same precondition for better teaching: making the act of learning visible. When we can see the process, we can study it. When we can study it, we can teach it better.

The 134 state bills, the literacy curricula, the dashboards and detection tools, the policies and pilots, all proceed as if we know what we are doing. The Stanford review quietly suggests we do not, yet.

If we have made AI nearly universal in schools and still cannot rigorously say what it does to students, what is the missing data, and who is responsible for collecting it?

●

References

Sources cited in order of appearance. Click any inline number to jump.

The Studies We Have Not Run

Why the Evidence Is Thin

The Teacher's Version of the Problem

A Different Kind of Evidence

The Patience the Field Will Need

References

Continue reading

What They Are Watching

The Line They Asked For