How one national lab is getting its supercomputers ready for the AI age
OAK RIDGE, Tenn. — At Oak Ridge National Laboratory, the government-funded science research facility nestled between Tennessee’s Great Smoky Mountains and Cumberland Plateau that is perhaps best known for its role in the Manhattan Project, two supercomputers are currently rattling away, speedily making calculations meant to help tackle some of the biggest problems facing humanity.
You wouldn’t be able to tell from looking at them. A supercomputer called Summit mostly comprises hundreds of black cabinets filled with cords, flashing lights and powerful graphics processing units, or GPUs. The sound of tens of thousands of spinning disks on the computer’s file systems, and air cooling technology for ancillary equipment, make the device sound somewhat like a wind turbine — and, at least to the naked eye, the contraption doesn’t look much different from any other corporate data center. Its next-door neighbor, Frontier, is set up in a similar manner across the hall, though it’s a little quieter and the cabinets have a different design.
Yet inside those arrays of cabinets are powerful specialty chips and components capable of, collectively, training some of the largest AI models known. Frontier is currently the world’s fastest supercomputer, and Summit is the world’s seventh-fastest supercomputer, according to rankings published earlier this month. Now, as the Biden administration boosts its focus on artificial intelligence and touts a new executive order for the technology, there’s growing interest in using these supercomputers to their full AI potential.
“The more computation you use, the better you do,” said Neil Thompson, a professor and the director of the FutureTech project at MIT’s Initiative on the Digital Economy. “There’s this incredibly predictive relationship between the amount of computing you use in an AI system and how well you can do.”
At the department level, the new executive order charges the Department of Energy with creating an office to coordinate AI development among the agency and its 17 national laboratories, including Oak Ridge. Critically, the order also calls on the DOE to use its computing and AI resources for foundation models that could support climate risk preparedness, national security and grid resilience, among other applications — which means increased focus on systems like Frontier and Summit.
“The executive order provided us clear direction to, first of all, leverage our capabilities to make sure that we are making advances in AI, but we’re doing it in a trustworthy and secure way,” said Ceren Susut, the DOE’s associate director of science for Advanced Scientific Computing Research. “That includes our expertise accumulated in the DOE national labs, and the workforce, of course, but also the compute capabilities that we have.”
The government’s AI specs
Supercomputers like Summit and Frontier can be measured in performance. Often, they’re measured in exaflops, defined as their ability to calculate a billion a billion — no, this isn’t a typo — floating point operations per second. Frontier sits at 1.194 exaflops, while Summit’s is a little less impressive, at 148.60 petaflops. But they can also be measured in their number of GPUs: Summit has slightly more than 28,000, while Frontier has nearly 10,000 more.
These chips are particularly helpful, experts explain, for the types of matrix algebra calculations one might need for training AI models. Notably, the DOE is nearing the completion of its Exascale Computer Project, an initiative across the national labs to rewrite software to be GPU, or AI, enabled. “Many of these applications are integrating AI techniques as one way in which they take advantage of GPUs,” Susut told FedScoop in an email.
In the same vein, one of the biggest requirements for building advanced AI systems, including AI tools developed with the help of the government, has become “compute,” or computational resources. For the same reason, the technical needs of the most powerful supercomputers, and most rigorous AI models, can often line up. That’s where systems like Frontier and Summit come in.
“I’ve read so many papers recently about how AI and [machine learning] need high bandwidth, low latency, high-performance networks around high memory nodes that have really fast processors on them,” said Bronson Messer, the director of science at the Oak Ridge Leadership Computing Facility, which houses the two supercomputers. “I’m like, wow, that’s exactly what I’ve always wanted for 20 years.”
MIT’s Thompson noted that in the field of computer vision, about 70 percent of the improvements in these systems can be attributed to increased computing power.
There are already efforts to train AI models, including large language models, at the lab. So far, researchers at Oak Ridge have used the lab’s computing resources to develop a machine learning algorithm designed to create simulations to boost greener flight technology; an algorithm to study potential links between different medical problems based on scans of millions of scientific publications; and datasets reflecting how molecules might be impacted by light — information that could eventually be used for medical imaging and solar cells applications.
There’s also a collaboration with the National Cancer Institute focused on building a better way of tracking cancer across the country, based on a large dataset, sometimes called a corpus, of medical documents.
“We end up with something on the order of 20 to 30 billion tokens, or words, within the corpus,” said John Gounley, a computational scientist at Oak Ridge working on that project. “That’s something where you can start legitimately training a large language model on a dataset that’s that large. So that’s where the supercomputer really comes in.”
More AI initiatives at the facility will soon go online. The DOE’s support for the Summit supercomputer has been extended, in part, to propel the National Artificial Intelligence Research Resource, which aims to improve government support for AI research infrastructure. Starting next year, several projects focused on building foundation models are set to start on Frontier, including an initiative that plans to use a foundation model focused on energy storage and a large language model built with a Veterans Affairs data warehouse.
How DOE pivots for the AI era
As part of the executive order, the Department of Energy is charged with building tools to mitigate AI risks, training new AI researchers, investigating biological and environmental hazards that could be caused by AI, and developing AI safety and security guidelines. But the agency doesn’t need to be pushed to invest in the technology.
This past summer, the DOE disclosed around 180 public AI use cases as part of a required inventory. Energy is also working on preliminary generative AI programs, including new IT guidance and a specified “Discovery Zone,” a sandbox for trying out the technology. Earlier this year, the Senate hosted a hearing focused specifically on the DOE’s work with the technology, and the agency’s office of science has requested more resources to support its work on AI, too.
But as the agency looks to deploy supercomputers for AI, there are challenges to consider. For one, the increased attention toward the technology marks a significant pivot for the supercomputing field, according to Paola Buitrago, the director of artificial intelligence and data at the Pittsburgh Supercomputing Center. Traditionally, research on supercomputers has focused on topics like genomics and computational astrophysics — research that has different requirements than artificial intelligence, she explained. Those limitations aren’t just technical, but apply to talent and workforce as well.
“Most of the power of the impressive supercomputers could not always be leveraged completely or efficiently to service the AI computing needs,” Buitrago said in an email. “There is a mindset in the supercomputing field that doesn’t completely align with what is needed to advance AI.”
And the government only has so many resources. While there are several supercomputers distributed across some of the national labs, Oak Ridge itself can only support so much research at a time. Lawrence Berkeley National Laboratory’s supercomputer might handle several hundred projects in a year, but Messer said Frontier and Summit have a smaller number of projects than other labs because the projects tend to run significantly longer.
There’s also more demand for supercomputing facilities than supply. Only a fraction of projects proposed to Oak Ridge are accepted. Meanwhile, while training foundation models is incredibly computationally demanding and only the largest supercomputers support developing them, building these systems is just one of several priorities that the agency must consider.
“DOE is actively considering these ideas and must also balance the use of our supercomputers across a range of high-priority mission applications,” said Susut, the DOE supercomputer expert. “Our supercomputers are open to the research community through merit-based competitive allocation programs, and we have a wide diversity of users.”
Even as the Department of Energy plans potential successors to Frontier, MIT’s Thompson noted that there are still other hurdles ahead.
For one, there’s a tradeoff between the flexibility of these computers and efficiency, especially as the agency seeks even greater performance. Supercomputers, of course, are extremely expensive systems — and costs aren’t dropping as fast as they used to. And they take time to build. At Oak Ridge, plans for a new computer, which will have AI as a key area of focus, are already in the works. But the device isn’t expected to go online until 2027.
“The reality is that the U.S. private sector has led research in AI starting in the past few years, as they have the data, the computing capacity and the talent,” Buitrago said. “Whether or not that continues to be the case depends on how much the government prioritizes AI and its needs. To extend, some may say the government is slowly catching up.”