Design Guide

The part of the AI-rack power and thermal problem that belongs to the network team, not facilities

When an AI cluster's power and cooling come up, the conversation defaults to facilities: rack density, liquid cooling for the GPUs, the megawatts the building needs. None of that is yours. But there is a distinct slice of the thermal problem that lands on the network team, and if you do not own it deliberately, it becomes the thing that quietly degrades your fabric.

This guide stays firmly in that lane.

⏱ 12 min read Engineering-focused | Practical | Operations-ready

Key Takeaways

  • At 800G and beyond, optical modules draw 14 to 20W per port and can consume more total power than the switching ASIC in a fully populated high-radix switch.
  • A hot, throttling optic degrades quietly and inflates tail latency across synchronized collectives, making thermal margin a reliability concern for AI fabrics.
  • The network team owns four distinct areas of the power and thermal problem: optics power budgets, switch placement and airflow, transceiver heat management, and LPO/CPO roadmap decisions.
  • Linear pluggable optics (LPO) and co-packaged optics (CPO) are the power-efficiency responses that should factor into multi-year fabric design decisions.

Bandwidth Is No Longer the Only Constraint

At 800G and beyond, the optics that deliver bandwidth consume enough power and shed enough heat that they shape the design as much as the bandwidth does. An 800G module commonly draws 14 to 20W per port, and the network team is no longer just provisioning capacity - it is managing a heat and power budget the optics dominate.

The Four Things the Network Team Owns

The network team's slice of the power and thermal problem breaks into four areas it controls directly, distinct from HVAC and facility power engineering.

Optics Power Budgets

Account for optics power explicitly at design time, on the order of 14 to 20W per 800G port. Reach drives power, so match each transceiver to the actual link rather than over-specifying "to be safe."

Switch Placement and Airflow

Match airflow direction to the aisle, avoid mounting switches where they ingest neighbors' exhaust, and treat density as a thermal decision, not just a port-count one.

Transceiver Heat

A transceiver near its thermal limit can throttle or fault. In a lossless AI fabric, an intermittently degrading link is worse than a clean failure because it silently raises tail latency.

LPO and CPO

Linear pluggable optics (LPO) and co-packaged optics (CPO) are the power-efficiency responses. Power efficiency is now a first-class optics selection criterion.

Questions to Ask Before Committing to an 800G Design

Bring these to any 800G design review. All of them sit squarely in the networking lane.

Quantify per-port and fully populated power

Establish per-port optics draw - 14 to 20W for an 800G module - and what total switch power looks like fully populated versus half-populated, so density planning starts from the real number.

Confirm the rack can cool the planned density

Given the optics power, determine the density the rack can actually cool and reconcile it with the port count you are planning.

Verify airflow direction and placement

Check that switch airflow direction is correct for the aisle and that the switch is not ingesting exhaust from high-power neighbors.

Locate the platform on the LPO and CPO roadmap

Ask where each platform sits on the linear pluggable optics (LPO) and co-packaged optics (CPO) trajectory, and whether that affects deploying now versus waiting.

What You'll Walk Away With

This guide provides three practical tools for managing the network team's slice of the AI fabric power and thermal challenge.

Optics Power Budget Method

A way to size optics power by reach and link, so total switch power is budgeted rather than discovered.

Airflow and Placement Checklist

The rack-level checks that keep intake air cool and module temperatures inside margin.

800G Design Review Question Set

The networking-lane questions that surface power and thermal risk before the design is committed.

Who This Guide Is For

This guide is designed for network professionals who need to understand and manage the power and thermal aspects of high-speed AI fabric design that fall within their domain of responsibility.

Enterprise network architects designing AI or GPU cluster fabrics will find practical methods for budgeting optics power and managing thermal constraints. Network teams that own optics, switching, and airflow decisions distinct from facilities can use this guide to establish clear ownership boundaries and technical approaches.

Anyone committing to an 800G design who needs to model power and thermal limits first will benefit from the structured approach to evaluating these constraints before finalizing architecture decisions.

Related Resources

FAQs

Frequently Asked Questions

Isn't power and cooling a facilities problem, not a network one?

Part of it is. Rack power and GPU cooling belong to facilities. But optics power budgets, switch airflow and placement, transceiver heat, and the linear pluggable optics (LPO) and co-packaged optics (CPO) trajectory are network-team decisions. If the network team does not own them deliberately, no one does, and the fabric pays for it.

How much power do 800G optics actually use?

Commonly around 14 to 20W per module. The bigger surprise is the aggregate: across a fully populated high-radix switch, optics can consume more total power than the switching ASIC, which is why they have to be budgeted explicitly.

What are LPO and CPO, and should I wait for them?

Linear pluggable optics (LPO) drops the DSP from the module to cut power while staying pluggable. Co-packaged optics (CPO) integrates the optics into the switch package for larger savings at the cost of serviceability. You do not need to wait, but ask where any platform sits on that roadmap before committing to a multi-year design.

Why is a hot, throttling optic such a big deal in an AI fabric?

Because it degrades quietly. A marginal link inflates tail latency across synchronized collectives, and the cluster runs at the speed of its slowest path, so one struggling transceiver can drag down Job Completion Time across many GPUs. That makes optics thermal margin a reliability concern, not just an efficiency one.

Need help designing your AI fabric?

IVI's network architects work with enterprise teams to design AI cluster fabrics that balance performance, power efficiency, and operational reality. We help you navigate the power and thermal constraints that matter to the network team.

Start a Conversation