Brainstorm AI Singapore 2024: Cutting Data Center Energy Consumption By 50%

  • 3 months ago
Presenter: Tim ROSENFIELD, Founder and CEO, Sustainable Metal Cloud
Transcript
00:00I've got a short amount of time, about five minutes here,
00:03to describe a little bit about what we do
00:06and how we fit into the AI and energy ecosystem
00:12before we have a little chat on stage.
00:14So I am Tim, co-founder, co-CEO of Sustainable Metal Cloud.
00:18I was trying to think of a way to describe
00:22what we do in a very short amount of time.
00:25And I guess I'd characterize it as we
00:28seek to solve AI's energy problem.
00:32AI has a substantial energy challenge and problem,
00:37which we'll unpack over the course of the next few slides.
00:40We started life in Australia, as you can tell from my accent,
00:44I'm Australian, about seven years ago.
00:47Our mission and our vision, myself and my co-founders,
00:51was as outsiders to the data center and AI,
00:57or as it was known then, HPC industry,
01:00how could we apply engineering to turn energy into knowledge
01:08more efficiently?
01:09Now, what I mean by that is, for most folks,
01:13the data center is a place that data is stored.
01:16It's now becoming a place that data is processed.
01:20These AI factories are actually large engineering challenges.
01:23And our process was, could we start afresh?
01:28Could we think of what is the best way to do this
01:31and to accomplish this?
01:32So we started in Australia, where
01:33we built a 20-megawatt engineering center.
01:37The problem that we're trying to solve for,
01:39how to turn energy into knowledge as cost-effectively
01:44and as efficiently as possible, starts with heat.
01:48Heat and cooling.
01:49And that is one of the drivers of AI's energy challenge.
01:53Roll it back one step, and the graph I've got on screen here,
01:57you can see a very clear trend.
02:00And what this is showing is that over the last decade or two,
02:04we've had real substantial improvements in chip density
02:08and chip efficiency.
02:10But as we've come to now two nanometer, three nanometer
02:13process nodes, we're getting less and less
02:16efficient with every step.
02:19And this has coincided with the rise of AI,
02:22which unfortunately uses the most energy-intensive
02:27and hottest chips out there.
02:29So behind me on the graph here, you can see in green,
02:32we have the power limits of NVIDIA's GPU series.
02:36So this is ramping up at a very quick rate.
02:40And this is driving a massive challenge in the data center.
02:44So more heat in the data center means
02:46you need more cooling to remove that heat.
02:49And that is why our solution is quite different.
02:54So we use liquid.
02:55We've developed a liquid platform
02:57that you actually take the servers, you put them in.
03:00It's called single-phase immersion, a full server
03:03in oil, and use the properties of oil,
03:06which has a heat coefficient 1,000 times greater
03:10than that of air, to remove the heat from the servers.
03:13Our system was designed to be modular,
03:17to go into any data center, and to host commonly found
03:21workloads in the AI ecosystem, like generative AI training
03:25using H100, Omniverse using L4DS,
03:29and other types of platforms.
03:32In a nutshell, what it is is we wanted
03:35to solve for taking any DC, like this diagram here,
03:40and coming up with a platform that
03:43would fit into that data center, whether that
03:46was built for liquid or not.
03:48So whether it was designed 15 years ago,
03:50or whether it was designed this year.
03:52Either way, we wanted something that
03:54was agnostic to the data center that it goes into.
03:58And then on top of that, we wanted
04:00to improve the efficiency of the servers themselves.
04:03So by integrating the data center
04:06with our new type of AI factory, with an NVIDIA GPU-based server,
04:13we could achieve breakthrough efficiency,
04:15which we'll get into in more detail in the next session.
04:18Long story short, by vertically integrating this solution,
04:22we cut around 45% to 50% of the energy
04:26to run NVIDIA GPU-based workloads today
04:29over standard air-based configurations.
04:33We've designed this to be containerized
04:35so that we can deploy it in an existing data center,
04:38or we can build greenfield data centers.
04:41What we're doing here in Singapore
04:43is we've deployed seven of these systems.
04:47We're deploying into India, we've deployed into Australia,
04:50and we're deploying into Thailand.
04:51And we're now looking to other regions and around the world.
04:54And the final slide before I leave you
04:56is how does this get used?
04:58So we're actually trying to take this
05:01to as many places as possible.
05:03And one of the ways we do that is
05:05installing the infrastructure in the data center,
05:08making it an AI factory, but then making it easy to use
05:12by allowing others to consume this service
05:14as a managed service,
05:16whether it's a cooling-as-a-service solution
05:17or whether it's a GPU-as-a-service solution
05:20with partners like NVIDIA.
05:22And customers around the world are able to access this today.

Recommended