This presentation is accessible only to the Illinois State University community.

  • Off-Campus ISU Users: To download this item, click the "Off-Campus Download" button below. You will be prompted to log in with your ISU ULID and password.

Publication Date


Document Type


Degree Type



Information Technology


Shaoen Wu

Mentor Department

Information Technology


Noah Ziems

Co-Mentor Department

Information Technology


Reimplementing solutions to previously solved problems is not only inefficient but also introduces inadequate and error-prone code. Traditional methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their own flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation–neural code search– is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, in an effort to improve the performance of code search, we investigate the impact of various tokenization, pre-training objectives, and deep learning architectures on overall performance.

Off-Campus Download