Gary Wei | Machine Learning

ByteDance Ltd.

1199 Coleman Ave.

San Jose, CA 95110

My research interests lie at the intersection of Machine Learning Systems, High Performance Computing, and AI for Science, where I leverage my mathematical and engineering talents to pioneer cutting-edge solutions. I was a Research Assistant at Cornell Relax ML Lab advised by Prof. Chris De Sa on efficient machine learning algorithms and systems. Grounded in mathematical principles, our work aims to expedite large-scale, high-performance machine learning systems that are efficient, parallel, and distributed in real-world settings. Parallel to this, I collaborate on AI-driven molecule generation with talented Ph.D. students.

My academic journey has led me to a Master of Engineering in Computer Science at Cornell University. Prior to this, I pursued dual B.S. degrees in Computer Science and Mathematics at the University of Massachusetts Amherst, complemented by a minor in Japanese. My specialization in Mathematics focused on Applied Math and Scientific Computing. Prior to my graduation, I engaged in research at the UMass BioNLP Lab under Prof. Hong Yu, concentrating on bio-medical and clinical NLP applications within Electronic Health Records.

Check out garywei.dev for my personal website!

news

Sep 26, 2024	I am excited to share that our paper, “Navigating Chemical Space with Latent Flows,” has been accepted by NeurIPS 2024.
Jun 16, 2024	I am thrilled to announce that our paper, “Navigating Chemical Space with Latent Flows,” received the Spotlight (Top 10%) at the ICML 2024 AI for Science workshop.
Feb 29, 2024	I am happy to continue my position at the Relax ML Lab at Cornell University as a graduate researcher.
Dec 31, 2023	I am delighted to have graduated from Cornell University with a Master of Engineering in Computer Science.

latest posts

Apr 06, 2022	CTF Crypto Challenge Writeup: Cracking the Hatmash Hash Function with Matrix Operations
Apr 04, 2022	Advanced CTF Writeup: Exploiting Buffer Overflow in ZIP Parser with ret2dlresolve
Mar 04, 2021	简单理解Bash中子进程(child process)和子shell (subshell)的区别以及SHLVL和BASH_SUBSHELL

selected publications

NeurIPS 2024
Navigating Chemical Space with Latent Flows

Guanghao Wei^*, Yining Huang^*, Chenru Duan, Yue Song^†, and Yuanqi Du^†

In Advances in Neural Information Processing Systems, 2024

Abs DOI arXiv Bib PDF Supp Code Poster

Recent progress of deep generative models in the vision and language domain has stimulated significant interest in more structured data generation such as molecules. However, beyond generating new random molecules, efficient exploration and a comprehensive understanding of the vast chemical space are of great importance to molecular science and applications in drug design and materials discovery. In this paper, we propose a new framework, ChemFlow, to traverse chemical space through navigating the latent space learned by molecule generative models through flows. We introduce a dynamical system perspective that formulates the problem as learning a vector field that transports the mass of the molecular distribution to the region with desired molecular properties or structure diversity. Under this framework, we unify previous approaches on molecule latent space traversal and optimization and propose alternative competing methods incorporating different physical priors. We validate the efficacy of ChemFlow on molecule manipulation and single- and multi-objective molecule optimization tasks under both supervised and unsupervised molecular discovery settings. Codes and demos are publicly available on GitHub at https://github.com/garywei944/ChemFlow.
@inproceedings{wei2024navigating, title = {Navigating Chemical Space with Latent Flows}, author = {Wei, Guanghao and Huang, Yining and Duan, Chenru and Song, Yue and Du, Yuanqi}, year = {2024}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Globerson, A. and Mackey, L. and Belgrave, D. and Fan, A. and Paquet, U. and Tomczak, J. and Zhang, C.}, pages = {58663--58697}, volume = {37}, publisher = {Curran Associates, Inc.}, doi = {10.52202/079017-1870}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/6bbefb73c0ede70635823a18426b9208-Paper-Conference.pdf}, }
EMNLP 2024
README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP

Zonghai Yao, Nandyala Siddharth Kantu, Guanghao Wei, Hieu Tran, Zhangqi Duan, and 4 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Abs DOI arXiv Bib PDF Code

The advancement in healthcare has shifted focus toward patient-centric approaches, particularly in self-care and patient education, facilitated by access to Electronic Health Records (EHR). However, medical jargon in EHRs poses significant challenges in patient comprehension. To address this, we introduce a new task of automatically generating lay definitions, aiming to simplify complex medical terms into patient-friendly lay language. We first created the README dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions, each offering context-aware lay definitions manually annotated by domain experts. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality. We then used README as the training data for models and leveraged a Retrieval-Augmented Generation method to reduce hallucinations and improve the quality of model outputs. Our extensive automatic and human evaluations demonstrate that open-source mobile-friendly models, when fine-tuned with high-quality data, are capable of matching or even surpassing the performance of state-of-the-art closed-source large language models like ChatGPT. This research represents a significant stride in closing the knowledge gap in patient education and advancing patient-centric healthcare solutions.
@inproceedings{yao-etal-2024-readme, title = {README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP}, author = {Yao, Zonghai and Kantu, Nandyala Siddharth and Wei, Guanghao and Tran, Hieu and Duan, Zhangqi and Kwon, Sunjae and Yang, Zhichao and annotation team, README and Yu, Hong}, year = {2024}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, pages = {12609--12629}, volume = {37}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2024.findings-emnlp.737}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://aclanthology.org/2024.findings-emnlp.737/}, }
GraB-sampler: Optimal Permutation-based SGD Data Sampler for PyTorch

Guanghao Wei

2023

Abs DOI arXiv Bib PDF Code Website

The online Gradient Balancing (GraB) algorithm greedily choosing the examples ordering by solving the herding problem using per-sample gradients is proved to be the theoretically optimal solution that guarantees to outperform Random Reshuffling. However, there is currently no efficient implementation of GraB for the community to easily use it.

This work presents an efficient Python library, GraB-sampler, that allows the community to easily use GraB algorithms and proposes 5 variants of the GraB algorithm. The best performance result of the GraB-sampler reproduces the training loss and test accuracy results while only in the cost of 8.7% training time overhead and 0.85% peak GPU memory usage overhead.
@misc{wei2023grabsampler, title = {GraB-sampler: Optimal Permutation-based SGD Data Sampler for PyTorch}, author = {Wei, Guanghao}, year = {2023}, doi = {10.48550/arXiv.2309.16809}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, }