Gradient Sync Overlapping (GSO)

Distributed TrainingPatentBERTGPT-3

Optimized distributed training by overlapping gradient synchronization with compute via pipeline parallelism, achieving a 7% performance boost on BERT and GPT-3, leading to a US patent filing.

Situation

The GSO project aimed to reduce latency in gradient synchronization across workers during distributed training. The approach was to use pipeline parallelism — triggering gradient sync jobs as soon as accumulator-based gradients became available, rather than waiting for the full backward pass. This required a precise dependency graph to enable concurrent scheduling of sync jobs across different sections of the model.

Task

Generate a reliable dependency graph for the gradient sync process and implement asynchronous, concurrent scheduling of gradient sync jobs — working closely with the runtime team and integrating everything cleanly into the AI framework.

Actions

Leveraged the PEF parser to map dependencies between gradient accumulators, determining the optimal sync timing for each model section.
Collaborated with the runtime team to co-design and refine APIs that met the concurrency requirements of the framework.
Integrated the new APIs into the AI framework, enabling seamless interaction between model layers and the runtime scheduler.
Led end-to-end accuracy and performance testing, validating both model fidelity and the expected speed gains.

Result

The optimized gradient sync approach delivered a 7% performance improvement on large models including BERT and GPT-3. The novelty and impact of the technique were significant enough to lead to a US patent filing, reflecting both its technical merit and business value.