My research interests lie at the intersection of Machine Learning Systems, Optimization Theory, High Performance Computing, and AI for Science, where I leverage my mathematical and engineering talents to pioneer cutting-edge solutions. I work at Cornell Relax ML Lab advised by Prof. Chris De Sa on efficient machine learning algorithms and systems. Grounded in mathematical principles, our work aims to expedite large-scale, high-performance machine learning systems that are efficient, parallel, and distributed in real-world settings. Parallel to this, I collaborate on AI-driven molecule generation with talented Ph.D. students.
My academic journey has led me to a Master of Engineering in Computer Science at Cornell University. Prior to this, I pursued dual B.S. degrees in Computer Science and Mathematics at the University of Massachusetts Amherst, complemented by a minor in Japanese. My specialization in Mathematics focused on Applied Math and Scientific Computing. Prior to my graduation, I engaged in research at the UMass BioNLP Lab under Prof. Hong Yu, concentrating on bio-medical and clinical NLP applications within Electronic Health Records.
Download recent CV .
Check out garywei.dev for my non-academic personal website!
M.Eng. in Computer Science, 2023
Cornell University
B.S. in Computer Science, 2022
University of Massachusetts Amherst
B.S. in Mathematics, 2022
University of Massachusetts Amherst
High School Diploma, 2018
Beijing No.4 High School
The online Gradient Balancing (GraB) algorithm, which determines example orderings through the herding problem with per-sample gradients, has been proved as a theoretically optimal solution, consistently outperforming Random Reshuffling. Yet, a user-friendly and efficient implementation is absent in the community.
This work presents GraB-sampler, an efficient and easy-to-use Python library designed for seamless integration of the GraB algorithm, additionally proposing three novel variants. The best performance result of the GraB-sampler reproduces the training loss and test accuracy results with only the cost of 8.7% training time overhead and 0.85% peak GPU memory usage overhead.
De Novo Genome assembly is one of the most important tasks in computational biology. ELBA is the state-of-the-art distributed-memory parallel algorithm for overlap detection and layout simplification steps of De Novo genome assembly but exists a performance bottleneck in pairwise alignment.
In this work, we introduce 3 GPU schedulers for ELBA to accommodate multiple MPI processes and multiple GPUs. The GPU schedulers enable multiple MPI processes to perform computation on GPUs in a round-robin fashion. Both strong and weak scaling experiments show that 3 schedulers are able to significantly improve the performance of baseline while there is a trade-off between parallelism and GPU scheduler overhead. For the best performance implementation, the one-to-one scheduler achieves ~7-8x speed-up using 25 MPI processes compared with the baseline vanilla ELBA GPU scheduler.
TA of CS 4700: Foundations of Artificial Intelligence
Duties include:
Responsibilities include:
Deliverables:
Responsibilities include:
Research & Projects: