Jongsoo Park is a Member of Technical Staff at OpenAI and a software/processor-architecture co-design specialist with 10 years of experience optimizing ML, HPC, and graph analytics workloads. He led SW/HW/model co-design at Meta — co-design lead for the Meta Training and Inference Accelerator and a core contributor to Llama3’s pretrain scalability and performance decisions. An active open-source performance engineer, his high-impact contributions include SIMD-optimized GEMM work in Facebook’s FBGEMM, bfloat16 and flash-attention fixes in PyTorch, and low-level AVX512 and Winograd tuning in libxsmm. His earlier research produced an LLVM back-end for a low-power microprocessor and code that powered a top HPCG benchmark and earned a Supercomputing best-paper award for low-communication FFTs. Based in Palo Alto with a Stanford PhD in Electrical Engineering, he blends deep systems research with production-focused engineering at scale.
Contributions:8 reviews, 316 commits, 344 PRs in 4 years 2 months
Contributions summary:Jongsoo focused on the implementation and optimization of core functionality within the fbgemm library, specifically contributing to matrix-matrix multiplication (GEMM) operations. Their work involved the addition of new methods, such as `equals` and `metaEquals`, to the `PackBMatrix` class, as well as significant refactoring of transpose code, optimizing it with SIMD instructions. Additionally, the user addressed rounding consistency issues and adapted the code to support group convolutions, highlighting their dedication to performance improvements and feature enhancements.
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Role in this project:
ML Engineer
Contributions:11 reviews, 217 commits, 299 PRs in 4 years 8 months
Contributions summary:Jongsoo primarily focused on improving and optimizing the performance of machine learning models and related infrastructure within the PyTorch ecosystem. Their contributions included fixing issues in the inductor compiler, a component used for optimizing model performance, and addressing problems in the transformer benchmark, particularly with scaled dot-product attention. They also added bfloat16 support in erfinv and made changes to the flash attention implementation.
pythongpu-accelerationdeep-learninggpunumpy
Find and Hire Top DevelopersWe’ve analyzed the programming source code of over 60 million software developers on GitHub and scored them by 50,000 skills. Sign-up on Prog,AI to search for software developers.