Ke Wen

San Francisco Bay Area United States

Join Prog.AI to see contacts

Summary

🤩

Rockstar

🎓

Top School

Ke Wen is a software engineer based in Menlo Park with seven years of experience building high-performance distributed ML systems. At Meta he implements PyTorch parallel APIs and maintains the communication layer (c10d) while helping ship distributed training and inference platforms like torchtitan and torchchat; previously he was a senior engineer on NVIDIA’s NCCL team. He is an active contributor to pytorch/pytorch and NVIDIA/nccl, with work that spans low-level C++ GPU communication improvements (socket/IB transports, tree algorithms) to framework-level features such as error-handling modes and FP8 support—bridging systems performance with ML framework needs. A Columbia PhD candidate and ATPESC alum, he combines research rigor with production-grade engineering to improve the reliability and speed of large-scale model training and inference.

7 years of coding experience

9 years of employment as a software developer

Columbia University in the City of New York

English, Chinese

Github Skills (16)

ccl10

sockets10

cuda10

pytorch10

socket10

c-language10

distributed-systems10

hpc10

parallel-computing10

performance-optimization10

multi-threading10

distributed-system10

networking10

c-programming-language10

python9

Programming languages (6)

TypeScriptC++CJupyter NotebookPythonCuda

Github contributions (5)

NVIDIA/nccl

Nov 2018 - Mar 2022

Optimized primitives for collective multi-GPU communication

Role in this project:

Back-end & Performance Engineer

Contributions:1 review, 15 commits, 7 PRs in 3 years 4 months

Contributions summary:Ke contributed significantly to optimizing the NCCL library, focusing on improving the performance of socket and IB transport layers. They implemented improvements to socket transport by splitting transfers and optimizing thread usage. The user also addressed several bug fixes related to IB devices and collective communication, ensuring correct behavior and preventing errors. Furthermore, they introduced improvements to the tree algorithm within NCCL, enhancing its overall efficiency.

cudampiinfinibandgpucluster-computing

pytorch/pytorch

Feb 2022 - Oct 2022

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Role in this project:

Back-end Developer & ML Engineer

Contributions:882 reviews, 61 commits, 238 PRs in 7 months

Contributions summary:Ke's contributions primarily involve modifications to the PyTorch distributed library, specifically related to the NCCL backend. Their work centers on enhancing the error handling, performance, and stability of distributed operations, including improvements to collective communication primitives and the introduction of new features such as adding a new `ErrorHandlingMode` for communication error handling. Furthermore, the commits reveal a focus on optimizing the performance of existing functions and adding features like the support of FP8 types. This involved modifying core C++ code to interact with NCCL libraries.

pythongpu-accelerationdeep-learninggpunumpy

Find and Hire Top DevelopersWe’ve analyzed the programming source code of over 60 million software developers on GitHub and scored them by 50,000 skills. Sign-up on Prog,AI to search for software developers.

Request Free Trial