David Yang

About

Hi, I'm David! I am an data scientist and machine learning engineer with a diverse background in statistics and biology. My expertise lies in processing complex and noisy datasets, deriving insights through statistical methods, and predictive analysis with techniques such as machine learning. Recently, I've delved into natural language processing and state-of-the-art deep learning architectures, finding great passion and fulfillment in these advanced technologies. As an advocate for innovation, I hope to work alongside like-minded individuals on products that will revolutionize our future....

My story

My innate curiosity to understand the underlying mechanisms behind the world around me led me to study a Bachelor's degree in Cellular, Molecular and Microbial Biology. It was during my undergraduate studies where I stumbled upon the field of Bioinformatics which lays at the intersection between biological science, applied statistics and computer technology. As an undergraduate Bioinformatics researcher, I developed a data pipeline, Nybbler, to perform targeted subsampling for genetic sequences of COVID-19. In addition, I also worked on other research projects including MinION, which enables real-time genetic sequencing by utilizing GPU parallelization with CUDA, under the supervision of Dr. Paul Gordon and Dr. Quan Long. Aside from my involvement in academia, I was also recognized for my strong academic track record and campus involvement through the means of various awards. During my studies, I hosted a medical conference, MedX; I was a faculty representative on the First Year Council; I also co-founded a national student-run, not-for-profit organization, COUHR, which operates in 4 cities across Canada and led the technological development department within the organization to create a mobile health tracking application for patients with Alzheimer's disease.

After recognizing that the forefront of medical and biological innovation is empowered and driven by the use of big data and artificial intelligence, I was motivated to pursue a Master's degree in Mathematics and Statistics to gain a deeper understanding of the mechanisms behind such advancements. This transition from a non-traditional background to a technical field involving robust mathematics, statistics, and computer science marked a difficult period in my academic journey. Through the guidance of my peers and my supervisor, Dr. Qingrun Zhang, I was able to maintain my academic excellence while engaging in various meaningful research data science projects. With the use of various machine learning and statistical testing techniques, I identified key genes and biomarkers correlated with diseases such as Glaucoma and Cancer. Meanwhile, I also devised and developed a novel optimization method which reduced run-times for evolutionary analysis by more than 2900% through parallelization.

Currently, I am working as a Data Mining Analyst at the University of Calgary, where I explore the application of state-of-the-art methods, such as transformers, to filter and summarize relevant biomedical scientific literature. My work focuses on enhancing the efficiency of data extraction and interpretation, driving advancements in biomedical research and bioinformatics. I am now seeking opportunities in data science and machine learning within a corporate setting, driven by a keen interest in applying these techniques in a business context. Having not yet worked in such an environment, I am eager to see the immediate, tangible results that these applications can deliver.

Outside of this, owing to my affinity for novelty, I hold many scattered interests in subjects such as artificial intelligence, finance, biotechnology, music and reading. Recently, the book that left the strongest impression on me was The Almanack of Naval Ravikant. I also make YouTube tutorials for Web Automation Development which have currently accumulated more than 100,000 views. Finally, I am very passionate about playing basketball and volleyball, and always enjoy a stimulating conservation!

Portfolio

See more projects on Github.

Viriation

Development

Engineered an automated system to fetch, process, and filter biomedical literature with BERT, extracting and displaying key insights relevant to the topic of interest (viruses) for users to annotate on a web interface and shared database.

Code

Gamify Life

Development

Designed and developed a web application that gamifies productivity and goal tracking, featuring user authentication, dynamic daily challenges, long-term quests and a rewards system with XP and in-game gold (in progress).

Code

Unravelling the truth behind COVID-19

Research Data Science

Identified critical sampling biases in SARS-CoV-2 sequence collection and developed a novel sampling strategy and data pipeline using Python and Perl which reduced bias by nearly 100% for over 2 million sequences.

Code Paper Presentation Visuals

Exploratory Analysis of Rhesus Macaque Data

Research Data Science

Pinpointed approximately 50 critical genomic factors linked to glaucoma in Rhesus Macaque monkeys from over 30,000 features through advanced dimensionality reduction and statistical testing with R. My data visualizations also effectively communicated these findings to external institutions, resulting in over $XXX,XXX in research grants.

Code

Algorithmic paralellization for Evolutionary Analysis

Research

Created an asynchronous parallelization method for the Markov chain Monte Carlo (MCMC) Algorithm involved in Bayesian inference (evolutionary) which reduced computational run-times by more than 2900% (~84 days).

Code Paper Slideshow Video

Blog

Optimizing Dollar Cost Averaging: Advanced Strategies for Market Downturns

October 15, 2024

In this post, we conduct a thorough comparison of various advanced dollar-cost averaging (DCA) investment strategies by performing historical backtesting using Nasdaq data. We examine methods that involve daily investments, increasing the amount on days when the market declines, with variations in the additional investment amount based on exponential, quadratic, or linear patterns relative to the percentage drop.

Read on Twitter