How Metrum AI and Oregon State University Are Building the New Standard for Academic Assessment

by Brian Beeler on April 23, 2026

AI ◇ Enterprise

When we published our story on Oregon State University’s plankton imaging research last November, the headline was the science: AI-accelerated infrastructure aboard research vessels, processing terabytes of ocean data in near real-time before the ship ever reached port. But something else happened quietly in the weeks that followed. Word spread across campus about what a single Dell PowerEdge XE7745 with eight Solidigm D5-P5336 E3.S SSDs and NVIDIA RTX PRO 6000 GPUs could actually accomplish. Other departments started asking questions. Then they started making calls. Christopher Sullivan, Director of Research and Academic Computing at OSU’s College of Earth, Ocean, and Atmospheric Sciences, now wants a rack of these servers to meet the growing AI demand across the university, and the story driving that ambition goes well beyond plankton.

Oregon State has established itself as one of the most forward-thinking universities in the nation in its adoption of AI for both research and academic use. The infrastructure decisions being made on campus today, along with the formation of partnerships with companies such as Metrum AI, Dell, NVIDIA, and Solidigm, are not just academic experiments. They lay the groundwork for a new way for universities to deliver education, assess learning, and protect their students. This is the story of how that model was developed.

The Problem That Generative AI Made Worse

For decades, written assignments were central to academic evaluation. Submit a paper, show understanding, and get a grade. Generative AI has fundamentally altered that system. Now, a student can craft a polished, well-structured essay with little real engagement with the material, and even seasoned faculty can’t reliably tell if it is authentic work. The evidence of genuine understanding that universities relied on for generations has weakened.

The obvious alternative is oral evaluation. Ask students to explain their reasoning out loud, walk through their analysis, and defend their conclusions. That is hard to fake. The problem is scale. A professor teaching 200 students cannot sit across from each one and conduct a substantive oral exam. In the modern university, that constraint has effectively shelved oral assessment as a primary evaluation tool. Metrum AI was built to change that equation.

What Metrum AI Built

Metrum AI, co-founded by CEO Steen Graham and CTO Chetan Gadgil, was built around a simple conviction: AI should do real operational work, not just demonstrate potential. The company deploys multimodal AI agents that reason across video, audio, documents, and structured data for customers in industries from insurance to manufacturing. Metrum has developed a close partnership with Dell Technologies, validating its platforms against Dell’s enterprise server infrastructure across a range of GPU configurations. The academic evaluation system at Oregon State is not a pivot for Metrum; it is the same underlying capability applied to a new problem domain, with the same on-premises, human-in-the-loop design philosophy that runs through everything the company builds.

Applied to academic assessment, the platform processes recorded student video presentations using multimodal AI and returns rubric-aligned draft evaluations for faculty review. The pitch is specific: give instructors an AI partner that handles the repetitive, time-consuming extraction work so they can focus on the judgment calls that truly require human adjudication.

At a functional level, the platform carries out three operations. It extracts multimodal artifacts from submitted videos, generating timestamped audio transcripts with OpenAI Whisper and capturing slide content via visual analysis powered by Qwen3-VL-30B. It then applies instructor-designed rubrics to the extracted content, using Qwen3-30B-A3B reasoning models that run on vLLM. Finally, it presents draft evaluations with evidence pointers, linking each score to a specific transcript timestamp or slide identifier for faculty review and approval before anything reaches a student.

That final step is crucial. No score, comment, or piece of feedback is visible to students until an instructor has reviewed it, made any necessary modifications, and explicitly approved it. The system is built around faculty authority. Additionally, the platform operates entirely on-premises. This decision influences everything about how the system functions, who trusts it, and what hardware it requires.

From a Professor’s Side Project to a Provost Mandate

Jonathan Kalodimos is an Associate Professor of Finance and the Harley and Brigitte Smith Fellow in the College of Business at Oregon State University. His background is not what you might expect from someone at the center of an AI infrastructure story. Before joining OSU, he was a financial economist at the U.S. Securities and Exchange Commission, where he served as lead economist on Dodd-Frank Act Section 954, which established rules around executive compensation clawbacks. His research on corporate governance and financial regulation has been cited in The Wall Street Journal, The New York Times, Bloomberg, and the Harvard Business Review. He also, it turns out, has a physicist’s instinct to measure things precisely.

About a year ago, Kalodimos coded a simple tool for his MBA class: an AI agent to evaluate the oral component of case study presentations. The students were impressed with the quality of feedback. He presented the project during AI Week at Oregon State. Dell took notice, connected him to Metrum AI, and a classroom experiment became something much larger.

“Once you have the tool, you can refine your teaching style and your teaching methods to leverage the strength of the tool to provide a better educational experience.”

— Jonathan Kalodimos, Associate Professor of Finance and Harley & Brigitte Smith Fellow, Oregon State University

What Kalodimos is building toward is what he calls evidence-based extraction, underpinned by what he describes as rubric engineering. This encompasses determining which features are extractable from a student presentation, aggregating those features into learning outcomes, and providing faculty with a structured view of where each student demonstrated understanding and where they fell short. “The way I explain this to skeptical students,” he said, “is if I had a very detailed checklist, and I went through your presentation checking off things you did, that’s what the system is doing. Obviously way more sophisticated than that, but it’s allowing me to see all the opportunities for the student to demonstrate that they know this material.”

He offered two examples that illustrate what the system changes in practice. In the first, a student condensed a ten-minute presentation into five minutes, spoke in a monotone, and spoke at a pace as if English were a second language. His delivery obscured his comprehension entirely. “Even though I was listening carefully,” Kalodimos said, “I just couldn’t, or wouldn’t, break it down into that level of granularity to overcome the delivery element so I could focus on the actual evidence.” When he later walked through the AI-generated evidence breakdown in fifteen-second increments, it became clear the student understood the material. The delivery had been graded, not the knowledge.

The second case involved a student who built a presentation slowly, with what seemed like disjointed slides, and only pulled the argument together on the final slide. Watching live, Kalodimos had already formed a low opinion of the presentation. The system evaluated the work as a complete arc and scored it well. “I didn’t even think that would be a benefit of this type of evaluation,” he said. “It’s getting away from the time element of evaluation.”

Christopher Sullivan, Director of Research and Academic Computing, is the infrastructure lead on the deployment. His involvement sharpened when a critical compliance gap emerged in the original Metrum and Dell design.

FERPA, Data Sovereignty, and Why the Cloud Is Not the Answer

When Sullivan stepped in to build OSU’s on-premises implementation of the Metrum platform, the first thing he identified was a problem nobody had fully solved: The Family Educational Rights and Privacy Act (FERPA).

FERPA is the federal law governing student education records. It establishes strict requirements around who can access student data, under what conditions, and how it must be protected. For a system like Metrum’s, one that ingests student video submissions, generates transcripts, produces evaluations, and stores the complete history of every grading decision, FERPA compliance is not a checkbox; it’s an architectural constraint.

“We needed to be able to bring something on-premises that would meet all of my FERPA conditions,” Sullivan said, “but also have a large amount of storage space.” Cloud processing was not compatible with that requirement. Routing student video files, audio transcripts, and evaluation records through external AI APIs would mean transmitting personally identifiable student information to third-party systems outside the university’s direct control. The contractual and technical complexity of maintaining FERPA compliance in that environment, across every vendor in the chain, made it a non-starter.

There is also a practical student experience dimension. Students submitting recorded presentations are offering something personal: their voice, their face, their reasoning under pressure, sometimes in their non-native language. When they understand that their video is stored on an OSU server, processed by a model running on OSU hardware, and governed by OSU’s own data policies, the dynamic changes. Kalodimos saw this play out directly during the pilot. “The idea that this was a local model with local storage and OSU has the student’s back,” he said, “was palpable. We really need to use the institutional trust that OSU has built, protecting our students, leveraging these on-prem solutions.”

Cloud AI platforms are easy and quick to deploy, but they require institutions to place trust in a contract rather than in their architecture. For students who are already wary of how their data is managed, that distinction can significantly influence their willingness to adopt. On-premises deployment isn’t just about compliance; it establishes a foundation of trust.

The Pilot, the Provost, and What Comes Next at OSU

The pilot is underway. Approximately 500 students across multiple sections are submitting final projects at the close of finals week. Graded evaluations must be returned within 4 days of the last submission. The AI-generated reports have to be ready before faculty begin grading. “There’s a human component running in parallel,” Kalodimos noted, “but they need the report first. If it takes two days to process all of these, then the human element is even more compressed.” The pressure is real, the deadline is fixed, and the infrastructure is doing its job.

The pilot has surfaced something else worth noting. When a professor was recently promoted to an administrative role mid-term, an instructor had to step in on short notice and finish out the course. Having a consistent AI evaluation framework already in place, with defined rubrics and an established review workflow, gave that instructor a thread of continuity that otherwise would not have existed. “Having a consistent AI evaluation companion,” Kalodimos said, “is going to definitely improve the student experience” in exactly those situations when continuity of human instruction cannot be guaranteed.

The story eventually reached OSU’s Provost. Kalodimos presented the full stack: the Dell system, Solidigm storage performance, developer capabilities, and infrastructure benchmarks, in what was supposed to be a ten-minute meeting. It ran for forty minutes. The Provost followed up with an email to the CIO, the CTO, and Sullivan. OSU is now planning a university-wide deployment to make the resource available to faculty starting the spring term, managed by a newly defined Research Computing office that sits under the Provost and the research office.

Sullivan is thinking about that deployment in terms of rack-scale infrastructure. The same XE7745 platform that anchored the plankton imaging work and is now powering the Metrum AI evaluation pipeline is the foundation he wants to scale. The goal is a rack of these servers, available to float between academic compute and research compute workloads as demand shifts. Ideally, the servers would be dedicated to the Metrum evaluation pipeline during midterm and final submission surges, and redeployed to research workloads during quieter periods of the academic calendar. “We can take machines from that set and shove them into the academic compute side for a period of time, and then bring them back and leverage them for the research compute,” Sullivan said. “We want to be able to redeploy them on the fly.”

The organic adoption is already underway. Faculty from the College of Health and the College of Engineering have independently approached Kalodimos after hearing about the project through informal channels. The platform has not been formally announced beyond the pilot. It found its audience anyway.

The Capacity Problem Behind the Grading Problem

There is a version of this story that is only about speeding up grading. That is an incomplete vision.

The larger version is about class capacity. Sullivan described a 100-level geology course, a class that OSU treats as part of its core educational mission and that every student is meant to take. It currently runs two sections of 300 students each, for a total of 600 per quarter. The instructors are at their limit. Adding sections is not feasible given current teaching loads. “I can’t have the teachers do more work,” Sullivan said. “I need to create pathways for us to either create more sections by reducing the load, or put more students in the sections we’ve got.”

Kalodimos similarly framed the College of Business dimension. Professors with sections capped at 45 students for fire code reasons would have the option to explore large-lecture formats with breakout-room support once individualized evaluation can scale. “It’s not just about packing bodies,” he said. “It’s about maintaining quality while exploring different delivery modes.” The AI evaluation layer is what makes individualized assessment at a lecture-hall scale operationally possible.

“The AI is helping us increase the numbers without changing the impact or the message or what’s being learned.”

— Christopher Sullivan, Director of Research and Academic Computing, Oregon State University

Storage Was the Missing Piece

When Sullivan assessed what it would take to bring the Metrum system on-premises at OSU with full FERPA compliance, the GPU side of the equation was already established. The Metrum and Dell reference architecture had demonstrated that the XE7745 with NVIDIA RTX PRO 6000 GPUs could handle the inference workload at scale. What remained unsolved was storage.

The XE7745 is a 4U air-cooled platform optimized for GPU density. That design is its strength, but it comes with a real constraint: drive bay count is limited. “I needed to put a lot of space into a single piece of equipment without compromising speed,” Sullivan said, “because I didn’t want to lose all the value of the GPUs and everything that XE7745 was worth. And there really weren’t a lot of large-capacity SSD solutions out there to do that in the box.”

The storage layer in a system like this carries more than the headline AI workload. Video files arrive from the student portal and need somewhere to buffer immediately. Extracted audio tracks and timestamped transcripts are stored as discrete artifacts for faculty review. Slide images and OCR output occupy their own tier. The Supabase database tracking submission metadata, draft evaluations, faculty edits, and approval records runs continuously. Model weights for Whisper, Qwen3-VL, and the reasoning model need to load quickly enough to avoid inference bottlenecks. And the full audit trail for every AI-generated draft, every faculty override, and every approval action must be retained as a queryable record for accreditation reviews, academic integrity investigations, and administrative reporting.

Every one of those workloads lives on storage. The GPU gets the credit for the AI output. Storage keeps the GPU fed and working continuously.

Sullivan’s team selected the Solidigm D5-P5336 in the E3.S form factor. The XE7745 holds eight of these drives. At 30.72 TB per drive, that is over 245 TB of flash storage in a single 4U chassis. The D5-P5336 uses QLC NAND with enterprise firmware tuned for sustained write performance and data integrity, which matters here because the system is not handling occasional bursts. During peak submission windows around finals, it simultaneously ingests videos, writes transcripts, logs evaluation output, and updates the database.

As we documented in our ocean research story covering this same hardware configuration, the Solidigm drives in RAID 10 delivered sustained read and write performance without falling behind the processing pipeline. Storage was not the bottleneck. The architecture exposed the actual workload constraints, so the team could tune them where they mattered. That validated conclusion carries directly into the academic evaluation deployment.

OSU as a Blueprint for AI-Ready Higher Education

Oregon State’s approach to AI infrastructure is deliberate and worth examining as a model for other educational institutions. Rather than deploying AI tools opportunistically through cloud APIs, the university has made a series of architectural decisions that treat AI as a durable institutional capability rather than a vendor service. Hardware is standardized around platforms that span research and academic compute workloads. Storage is on-premises, high-density, and compliant by design. Faculty retain final authority over every evaluation the system produces.

Kalodimos is explicit about wanting the platform to travel beyond OSU. “Not every university is going to be as well-resourced as us,” he said. “I want to make sure this technology is available to all universities.” That foundation is what makes the broader argument for educational equity credible. An instructor at a smaller institution with a heavier teaching load and fewer resources arguably needs this tool the most.

“I need storage, and I need that storage to be fast. AI is a dead technology without storage. It’s a data-driven system. We had the algorithms back in the 1960s and 70s. We didn’t have any data to do it because we didn’t have any storage to actually hold that data.”

— Christopher Sullivan, Director of Research and Academic Computing, Oregon State University

Sullivan frames the hardware planning question the same way he frames every infrastructure decision at OSU. Models will change. The types of input students submit will evolve. The evaluation techniques faculty want to run will grow more sophisticated. “I’m going to get a bigger fork, a bigger knife, or a bigger spoon,” he said, “but it’s still a fork, knife, and spoon on the hardware side. I’m going to be changing the models and the inputs dramatically in the years to come, and it’s more important to me right now that the hardware keeps ahead of whatever those are going to be.”

Every processed transcript, every extracted slide, every draft evaluation, every approved grade, and every audit record must all live somewhere. In this system, there is 245 TB of Solidigm QLC flash on-premises within OSU’s infrastructure, doing the quiet work that makes the visible AI possible. The rack Sullivan is planning will not be the last one. The university is watching what this pilot produces, and other institutions will be watching what OSU does. That is what it means to lead in AI.

Brian Beeler

Brian is located in Cincinnati, Ohio and is the chief analyst and President of StorageReview.com.

Previous post: AMD Ryzen 9 9950X3D2 Dual Edition Review: 3D V-Cache on Both CCDs

Next post: Seagate One Touch Desktop HDD Review: 24TB Without the Power Brick