GPUs used to have a simple business story: buy more, wire them well, train bigger models. That story is not false. It is just starting to resemble a children’s book.

The adult version has buildings, regions, power constraints, optical links, oversubscribed networks, packet loss, pipeline bubbles, model chunks, microbatches, and a quiet question with a very expensive answer: when the GPUs no longer fit comfortably inside one data center building, how should the training job be split?

A new paper from researchers at Harvard and Meta, ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training, gives that question a name and a system-level answer.1 The name is scale-across training: AI model training expanded across multiple data center buildings, and potentially across regions, rather than merely scaled within one homogeneous zone. The answer is not “use data parallelism,” not “use pipeline parallelism,” and not “buy more bandwidth, heroically.” It is more annoying and more useful: the best configuration depends on which communication pattern crosses the constrained link, how often it crosses, whether computation can hide it, and whether the network behaves like the one your whiteboard assumes.

This is the kind of paper whose abstract sounds like infrastructure plumbing and whose business implication is closer to capital allocation. At frontier scale, iteration time is not a software metric. It is the price of experimentation.

Scale-across begins when the building becomes part of the training system

The paper distinguishes scale-out from scale-across. Scale-out expands capacity across zones through a relatively homogeneous network. Scale-across crosses buildings, and eventually regions, where the network is no longer a polite abstraction. Cross-building links have higher latency, longer distances, and worse bandwidth oversubscription than intra-zone links. The paper’s Table 1 places cross-zone networks around 1:2–1:4 oversubscription and roughly 20–30 microseconds latency, while cross-building settings move to 1:3 and beyond, tens to hundreds of kilometers, and at least 50 microseconds.

That difference looks small only if one has never paid for idle H100s.

Meta’s described production setting already spans multiple buildings and supports 100K+ GPU training. In that production configuration, the authors use five forms of parallelism: tensor parallelism, context parallelism, expert parallelism, pipeline parallelism, and data parallelism. The high-bandwidth, latency-sensitive forms—TP, CP, and EP—stay inside an AI zone. The outer cross-building layer uses data parallelism, while pipeline parallelism sits on the cross-zone layer.

That placement is not a religious commitment. It is a consequence of traffic shape. Data parallelism, using FSDP in the paper’s production setting, communicates parameters and gradients through operations such as AllGather and ReduceScatter. Much of that communication can overlap with computation. Pipeline parallelism, by contrast, passes activations forward and activation gradients backward between adjacent pipeline stages. If adjacent stages sit in different buildings, that point-to-point traffic can land directly on the critical path.

The paper’s first contribution is therefore definitional but not merely semantic. It reframes distributed AI training as a geography-aware systems problem. The building is no longer just where the GPUs live. It becomes part of the optimization surface.

The false shortcut is choosing one outer parallelism rule

The likely misreading is easy to predict. Prior geo-distributed training work often makes pipeline parallelism across regions look natural: split the model, put different stages in different locations, reduce heavy synchronization. One might then assume that PP-out—pipeline parallelism as the outermost cross-building layer—is the answer to scale-across training.

The paper says: sometimes. Which is the technical equivalent of ruining a slogan, but improving a system.

The authors compare DP-out and PP-out under dense and mixture-of-experts model settings. The key mechanism is communication frequency.

With PP-out, cross-building communication can happen for every model chunk and every microbatch: activations in the forward pass, activation gradients in the backward pass. As the number of microbatches rises, the cross-building communication frequency rises with it. With DP-out, the outer FSDP communication happens around parameter gathering and gradient reduction; only part of this communication sits fully exposed on the critical path, and the total cross-building volume is less sensitive to the number of microbatches.

So the question is not “Which parallelism is better?” The question is:

Mechanism question Why it matters operationally
What communication crosses the constrained link? Cross-building bandwidth and latency do not punish all collectives equally.
How often does it cross per iteration? Microbatch count and model chunking can quietly multiply network exposure.
Can computation hide the delay? Longer MoE computation can mask communication that dense layers expose.
Is the network merely slow, or also lossy and imbalanced? Retransmission, load balancing, and congestion control change the effective cost of distance.

This is the center of the paper: placement is conditional because communication is not one object.

Dense models punish PP-out when microbatches multiply

For dense models, the paper’s evidence favors DP-out when the number of microbatches is large and cross-building oversubscription is high.

In the 17B dense model testbed experiments, the authors compare PP-out and DP-out across oversubscription ratios. At low oversubscription, the gap is narrow. But the threshold appears around 1:4. Beyond that, DP-out becomes clearly faster; at 1:16 oversubscription, DP-out gives a 28.47% speedup over PP-out.

This is main evidence, not a decorative benchmark. It isolates a practical deployment choice: when the cross-building network becomes constrained, PP-out’s repeated activation and gradient traffic starts to hurt.

The microbatch experiment explains why. Holding microbatch size fixed and reducing global batch size reduces the number of microbatches. Under the paper’s 17B dense configuration, PP-out is slower than DP-out at batch size 176, closes the gap at batch size 88, and overtakes DP-out as batch size falls further. In plain terms: PP-out is less painful when it has fewer chances to talk across buildings.

That is a useful correction to the usual “pipeline equals geography” intuition. Pipeline parallelism can align well with geographic partitioning, but only when its communication frequency remains tolerable. If the production workload has many pipeline stages, one-layer chunks, and many microbatches, PP-out may produce hundreds of cross-building activation and gradient communications per iteration. A clever partition can become a very expensive metronome.

The large-scale simulations reinforce the dense-model story. Using production workload traces with 100K GPUs across multiple buildings, PP-out is comparable or worse than DP-out, up to 6.60% slower at the default oversubscription ratio. The authors attribute this to the need for many microbatches to avoid pipeline idling and to the model chunk and schedule choices that increase cross-building communication frequency.

The business translation is simple enough to be dangerous: a configuration that looks network-efficient at small scale can become network-hungry when realistic microbatch and chunking choices are included. Spreadsheet architecture has again failed to survive contact with an actual system. We mourn briefly and move on.

MoE reverses the rule because expert parameters change the traffic shape

Then the paper changes the answer.

For mixture-of-experts models, PP-out can become better. The reason is not fashion. It is arithmetic.

In MoE models, increasing the number of experts increases the volume of data communicated by data parallelism because more expert parameters and gradients participate. Pipeline communication, however, is based on activations and activation gradients, so its volume does not grow in the same way as the number of experts increases. The result is a reversal: the cross-building burden shifts from PP traffic to DP traffic.

On the production testbed with a 40B MoE model and 128 H100 GPUs, PP-out scales better as both the number of experts and the oversubscription ratio increase. At a 1:16 oversubscription ratio with 128 experts, PP-out achieves a 49.4% speedup over DP-out.

That is not a small footnote. It is the paper’s most important anti-rule: the right outer parallelism depends on model architecture.

The simulations add nuance rather than simply cheering for PP-out. With TP4-PP16-DP128, high oversubscription substantially increases iteration time for MoE models using DP-out, while PP-out is less affected. Latency still matters, but MoE computation can overlap with cross-building point-to-point traffic more effectively than dense computation. With a different configuration, TP4-PP2-DP1024, fewer pipeline stages mean many FSDP collectives per iteration, and larger DP groups make long latency more damaging.

So even the MoE lesson is conditional. More experts push toward PP-out, but DP group size, number of pipeline stages, model chunking, and latency still shape the final answer. The useful conclusion is not “MoE wants PP-out.” The useful conclusion is: MoE changes which traffic becomes expensive, so the placement search must understand expert count and communication overlap.

Scheduling decides whether an intra-building optimization becomes a cross-building tax

Placement chooses which communication pattern crosses buildings. Scheduling decides how often and how awkwardly that pattern appears.

The paper evaluates two scheduling directions: data-parallel communication patterns and pipeline schedules.

For data parallelism, the key comparison is FSDP versus hierarchical HSDP. HSDP forms local groups, aggregates within those groups, and then synchronizes group leaders across buildings. This reduces cross-building communication frequency, but it uses more memory and can lower the maximum microbatch size. In the 17B dense testbed setting, HSDP performs slightly worse than FSDP at moderate oversubscription because the smaller microbatch size hurts compute utilization. At high oversubscription, however, the reduced cross-building traffic wins: HSDP achieves a 6.83% iteration speedup at 1:16 oversubscription.

This is an ablation-like result for communication pattern tradeoffs. It does not prove that HSDP should replace FSDP everywhere. It shows when the memory-for-communication trade becomes attractive: high oversubscription, small microbatches, and constrained long-distance links.

For pipeline scheduling, the paper compares DoraPP and Interleaved ZBV. DoraPP is efficient inside a building because it reduces bubbles and improves computation scheduling. But it uses a wrap-around pattern that can increase cross-boundary communication. Interleaved ZBV has more computation overhead, yet its V-shaped dependency pattern reduces cross-building communication. At lower oversubscription, DoraPP wins. Beyond high oversubscription—around 1:8 in the paper’s discussion—Interleaved ZBV becomes faster because avoiding constrained-link traffic matters more than minimizing local scheduling overhead.

This is the paper’s scheduling lesson in one sentence: the schedule that is elegant locally may be expensive geographically.

That point matters for teams adapting existing distributed training systems. It is tempting to carry over the best intra-cluster schedule and assume the network team will absorb the difference. The paper’s evidence suggests the opposite. Once the long link is involved, schedule design becomes network design by another name.

The network layer turns distance into retransmission, load balancing, and control-loop delay

The paper’s network-layer section is not there to say “latency is bad.” Everyone already knows that, including several interns and most routers.

Its value is showing that distance changes the behavior of network mechanisms that were designed for shorter, cleaner environments.

First, collective communication time increases with distance, roughly linearly in the authors’ NCCL measurements, with larger messages better able to approach line rate. Point-to-point SendRecv outperforms AllReduce because AllReduce has more stack overhead and requires more careful buffer/channel tuning. The measurement mainly supports ScaleAcross Explorer’s communication-time modeling and motivates why the optimizer cannot treat all communication primitives as interchangeable.

Second, packet loss interacts with latency. The authors simulate loss recovery using Go-Back-N, varying packet loss from 0.002% to 0.2% and latency from 10 to 1000 microseconds. PP-out is more sensitive to packet loss because each lost packet can create pipeline bubbles around point-to-point activation and gradient communication. At 1000 microseconds latency with 0.02% loss, iteration time rises by 1.30× for DP-out and 1.93× for PP-out relative to the 50-microsecond baseline. Without packet loss, the corresponding latency-only increases in the referenced topology are much smaller, 1.07× and 1.10×.

This is a robustness/sensitivity test with a clear purpose: it shows that long-distance training risk is not just average latency. It is latency multiplied by loss recovery and tail behavior.

Third, the load-balancing result complicates another familiar assumption. Packet spraying can outperform ECMP in cross-building settings because low-entropy training traffic can create persistent hash collisions under ECMP. But at cross-region latency, packet spraying hits a practical in-flight-packet limit at the NIC. In the paper’s simulation, at 1000 microseconds latency, packet spraying becomes 0.10% to 11.44% slower than ECMP when using no more than four queue pairs.

Fourth, congestion control becomes less obviously helpful over ultra-long links. ECN-based control depends on feedback, and feedback arrives later over high-latency paths. In the simulated single-tenant setting, disabling congestion control gives modest speedups of 0.89% and 1.52% at 100 and 1000 microseconds under high oversubscription. The authors are careful here: multi-tenant production congestion may behave differently. Good. That is exactly where caution belongs—not sprinkled everywhere like parsley, but placed where it changes interpretation.

The network section should be read as exploratory system characterization. It does not hand buyers a universal shopping list. It tells infrastructure teams which assumptions become brittle when the training job stretches across geography.

ScaleAcross Explorer is a search engine for interacting constraints

After characterizing placement, scheduling, and network behavior, the paper introduces ScaleAcross Explorer. It is not a new collective algorithm alone, nor a magical network protocol. It is a configuration optimizer.

Given model architecture, batch size, accelerator specifications, and network topology, ScaleAcross Explorer searches over parallelism configuration and placement, estimates computation and communication kernel times, reconstructs end-to-end iteration time under pipeline schedules, and recommends network-layer choices. Its output includes parallelism order, degrees of TP/CP/EP/PP/DP, microbatch size, model chunk sizes, communication pattern, pipeline schedule, and network technology guidance.

The search problem is enormous. For a model with 100 layers and 20 pipeline stages, the number of possible layer partitions alone can reach trillions. The authors use heuristics: TP is placed innermost, CP second-innermost, CP is bounded so each context shard has at least 2,048 tokens, chunk imbalance is capped to avoid stragglers, and PP-out search is pruned when more stages stop improving iteration time. Then they use Monte Carlo sampling: explore random partitions, evaluate many chunk configurations, learn features correlated with iteration time, and exploit promising candidates through perturbation.

This design is sensible because the objective is not smooth. One more pipeline stage, a different chunk size, or a changed outer parallelism can alter warm-up bubbles, communication frequency, overlap, memory feasibility, and network exposure. A nice linear cost model would be comforting. It would also be wrong often enough to become expensive.

The evaluation supports the optimizer, but the size of the win depends on the regime

The ScaleAcross Explorer evaluation has three major pieces.

Test Likely purpose What it supports What it does not prove
40B MoE testbed, 64 GPUs per cluster, 10 km distance Main evidence against production and Sailor baselines Full-stack search improves iteration time by choosing placement, microbatching, and chunk size together Universal gain across all MoE architectures and all cluster designs
17B dense research-cluster testbed Main evidence in dense setting The optimizer can also identify when PP-out is better under small-batch production settings That dense models always prefer PP-out; earlier placement tests show the opposite under many microbatches
100K-GPU production-trace simulation Scale validation and comparison with production configuration Even when production DP-out is already optimal, further schedule/parallelism tuning gives smaller but consistent gains That simulation captures every future heterogeneous multi-region deployment
Topology variants and latency/loss simulations Robustness and sensitivity tests Qualitative trends persist across topology changes and worsen under lossy long-latency links That the exact percentages transfer to another operator’s hardware and traffic mix
ECMP, packet spraying, congestion control tests Network-layer exploratory extension Long links can reverse familiar network-protocol preferences A final prescription for multi-tenant cross-region production networks

On the 40B MoE testbed, ScaleAcross Explorer achieves 8.93% to 54.39% speedup over the production configuration and 23.07% to 37.59% over Sailor. The paper attributes the improvement to three choices: reducing microbatches per pipeline by optimizing parallelism degrees, switching between DP-out and PP-out when appropriate, and selecting larger model chunks to reduce cross-cluster point-to-point communication overhead.

On the 17B dense model, the optimizer achieves 0.12% to 64.62% speedup over production configuration and 16.38% to 26.21% over Sailor. This may look surprising because the earlier dense-model placement discussion favored DP-out under many microbatches. The explanation is that the production setting compared here uses a small batch size, where PP-out can be better once the network is oversubscribed. This is exactly why the paper needs a mechanism-first reading. The same model family can lead to different placement choices when batch size and microbatch count change.

At 100K-GPU simulation scale, where the production run already uses DP-out and that placement remains optimal, the improvement is smaller: 1.04% to 8.10% across oversubscription and latency settings. That is not a failure. It is an honest result. When a production system is already close to the right outer placement, remaining gains come from refining intra-building PP and reducing pipeline warm-up idle time. The paper also notes that speedups shrink at long latency because the optimizer’s configuration uses fewer PP stages and larger DP groups, which can become more latency-sensitive.

Sailor performs poorly at 100K-GPU simulation scale, partly because it lacks context parallelism support and defaults to PP-out, which is suboptimal for the production workload studied. This is a comparison with prior work, but it should not be read as “Sailor is bad.” It is more precise to say: an optimizer built around one family of assumptions can become brittle when model architecture, long context, and network topology shift.

The business value is faster design-space exploration, not cheaper cables

What does this mean outside Meta-scale research infrastructure?

Directly, the paper shows that scale-across AI training cannot be optimized one layer at a time. Model architecture determines communication volume. Batch size determines microbatch count. Pipeline schedule determines how often point-to-point traffic crosses the boundary. Model chunk size determines communication frequency and compute/communication overlap. Network protocol choices determine whether long links behave as expected. ScaleAcross Explorer improves iteration time because it searches across these layers together.

Cognaptus inference: for organizations building frontier-scale, sovereign-scale, or hyperscaler-scale AI infrastructure, the practical value is reduced experimentation cost. Every full training configuration that does not need to be tested physically saves time, GPU opportunity cost, and engineering attention. If iteration time falls by even a few percent at 100K-GPU scale, the economic impact can be material. At that scale, one percent is not a rounding error; it is a small weather system made of capital expenditure.

What remains uncertain: the exact optimizer gains will not transfer mechanically to every environment. The paper’s strongest evidence comes from Meta’s production experience, a configurable H100 testbed, and an in-house packet-level simulator validated for production use. That is valuable evidence, but it is also specialized evidence. A smaller enterprise fine-tuning a 7B model on a few rented nodes should not read this paper as an action plan. It is not your bottleneck, unless your office coffee machine has 100K GPUs behind it. In which case, congratulations, and perhaps call Facilities.

A more useful business segmentation is:

Organization type Practical relevance
Hyperscalers and frontier labs Directly relevant: training jobs are large enough for geography-aware optimization to affect cost and schedule.
Sovereign AI cloud builders Highly relevant: power, land, cooling, and regional availability constraints may force multi-building or multi-region training.
GPU cloud providers Relevant as product differentiation: topology-aware training support can become a service capability.
Large enterprises fine-tuning models Mostly indirect: useful for understanding cloud pricing, availability, and future training-service architecture.
Ordinary AI application teams Educational but not operational: retrieval quality, workflow design, and deployment latency matter more.

The main business lesson is therefore not “everyone needs ScaleAcross Explorer.” It is: at frontier scale, AI infrastructure planning becomes a joint optimization problem across model design, software stack, and physical geography.

Boundaries that matter before anyone turns this into a slogan

The paper is careful about several boundaries, and those boundaries are important.

First, much of the characterization focuses on two-building or controlled multi-building settings, with simulations extending toward larger and longer-distance scenarios. The authors explicitly note that future scale-across training may involve many buildings with different accelerator generations, different GPU counts, different intra-building topologies, and heterogeneous distances. That future is messier than the already messy present. Charming.

Second, ScaleAcross Explorer deliberately focuses on lossless optimizations: parallelism placement, scheduling, chunking, and network choices that do not change the mathematical training objective relative to a single-cluster baseline. This is a strength for production adoption because it avoids altering model convergence and debugging semantics. But it also means the paper does not fully evaluate algorithmic alternatives such as asynchronous training, Local SGD variants, or partially stale updates. The authors mention these as future options once they become production ready.

Third, some network-layer results are exploratory and simulation-based. The congestion-control result, for example, comes from a single-tenant setup. Multi-tenant traffic could change buffer occupancy, congestion events, and control-loop behavior. The paper does not pretend otherwise.

Fourth, the optimizer’s value depends on accurate measurement and modeling. The paper’s configurable optical testbed is not a decorative lab toy; it is a foundation for calibrating the model. Without comparable measurement discipline, a similar optimizer risks becoming an elegant random-number generator wearing a systems paper costume.

The strategic takeaway: the next frontier model may be limited by geography before math

The paper’s best contribution is not the largest speedup number, though 64.62% over production configuration and 37.59% over Sailor are not exactly pocket change. The best contribution is the framework it forces on the reader.

Scale-across training is not merely distributed training with longer cables. It is a regime where physical infrastructure changes which software choices are optimal. Dense models and MoE models expose different traffic. Microbatches turn pipeline traffic into repeated long-link exposure. Scheduling transforms local efficiency into cross-building cost. Packet loss makes latency more than latency. Packet spraying can beat ECMP until distance makes NIC bookkeeping bite back. Congestion control can help until feedback arrives too late to be useful.

There is no one best rule because there is no one communication problem.

For business leaders, the message is not that they should memorize DP-out versus PP-out. Please don’t; strategy meetings are already long enough. The message is that frontier AI training capacity is now constrained by the interaction among power, land, networking, model architecture, and orchestration software. The organizations that treat these as separate procurement and engineering silos will overpay for idle time. The organizations that model them together will train faster, experiment cheaper, and probably make fewer heroic-but-wrong infrastructure decisions.

In the old story, scaling AI was about gathering compute. In the new story, compute has an address, and that address matters.

Cognaptus: Automate the Present, Incubate the Future.


  1. Minghao Li, Alicia Golden, Samuel Hsia, Michael Kuchnik, Adi Gangidi, Xu Zhang, Ashmitha Jeevaraj Shetty, Zachary DeVito, Weiwei Chu, Dong He, Haoci Zhang, Yuchen Hao, Ruoming Pang, James Hongyi Zeng, Ying Zhang, Minlan Yu, and Carole-Jean Wu, “ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training,” arXiv:2605.24326v1, May 2026, https://arxiv.org/abs/2605.24326↩︎