Gradient Sync Overlapping (GSO)
Situation
The GSO project aimed to reduce latency in gradient synchronization across workers during distributed training. The approach was to use pipeline parallelism — triggering gradient sync jobs as soon as accumulator-based gradients became available, rather than waiting for the full backward pass. This required a precise dependency graph to enable concurrent scheduling of sync jobs across different sections of the model.
Task
Generate a reliable dependency graph for the gradient sync process and implement asynchronous, concurrent scheduling of gradient sync jobs — working closely with the runtime team and integrating everything cleanly into the AI framework.
Actions
- Leveraged the PEF parser to map dependencies between gradient accumulators, determining the optimal sync timing for each model section.
- Collaborated with the runtime team to co-design and refine APIs that met the concurrency requirements of the framework.
- Integrated the new APIs into the AI framework, enabling seamless interaction between model layers and the runtime scheduler.
- Led end-to-end accuracy and performance testing, validating both model fidelity and the expected speed gains.
Result
The optimized gradient sync approach delivered a 7% performance improvement on large models including BERT and GPT-3. The novelty and impact of the technique were significant enough to lead to a US patent filing, reflecting both its technical merit and business value.