My research interests lie at the intersection of machine learning systems, theory, and AI-driven scientific advancements, where I leverage my mathematical and engineering talents to pioneer cutting-edge solutions. I work at Cornell Relax ML Lab with Prof. Chris De Sa on efficient machine learning algorithms and systems. Our work, grounded in mathematical principles, aims to expedite large-scale, high-performance machine learning systems that are efficient, parallel, and distributed in real-world settings. Parallel to this, I collaborate on AI-driven molecule generation with talented Ph.D. students.
My academic journey has led me to a Master of Engineering in Computer Science at Cornell University. Prior to this, I pursued dual B.S. degrees in Computer Science and Mathematics at the University of Massachusetts Amherst, complemented by a minor in Japanese. My specialization in Mathematics focused on Applied Math and Scientific Computing. Prior to my graduation, I engaged in research at the UMass BioNLP Lab under Prof. Hong Yu, concentrating on bio-medical and clinical NLP applications within Electronic Health Records.
Download recent CV .
Check out garywei.dev for my non-academic personal website!
M.Eng. in Computer Science, 2023
Cornell University
B.S. in Computer Science, 2022
University of Massachusetts Amherst
B.S. in Mathematics, 2022
University of Massachusetts Amherst
High School Diploma, 2018
Beijing No.4 High School
TA of CS 4700: Foundations of Artificial Intelligence
Duties include:
Responsibilities include:
Deliverables:
Responsibilities include:
Research & Projects:
De Novo Genome assembly is one of the most important tasks in computational biology. ELBA is the state-of-the-art distributed-memory parallel algorithm for overlap detection and layout simplification steps of De Novo genome assembly but exists a performance bottleneck in pairwise alignment.
In this work, we introduce 3 GPU schedulers for ELBA to accommodate multiple MPI processes and multiple GPUs. The GPU schedulers enable multiple MPI processes to perform computation on GPUs in a round-robin fashion. Both strong and weak scaling experiments show that 3 schedulers are able to significantly improve the performance of baseline while there is a trade-off between parallelism and GPU scheduler overhead. For the best performance implementation, the one-to-one scheduler achieves ~7-8x speed-up using 25 MPI processes compared with the baseline vanilla ELBA GPU scheduler.