About


Hi, I'm David! I am an data scientist and machine learning engineer with a diverse background in statistics and biology. My expertise lies in processing complex and noisy datasets, deriving insights through statistical methods, and predictive analysis with techniques such as machine learning. Recently, I've delved into natural language processing and state-of-the-art deep learning architectures, finding great passion and fulfillment in these advanced technologies. As an advocate for innovation, I hope to work alongside like-minded individuals on products that will revolutionize our future....


My story

My innate curiosity to understand the underlying mechanisms behind the world around me led me to study a Bachelor's degree in Cellular, Molecular and Microbial Biology. It was during my undergraduate studies where I stumbled upon the field of Bioinformatics which lays at the intersection between biological science, applied statistics and computer technology. As an undergraduate Bioinformatics researcher, I developed a data pipeline, Nybbler, to perform targeted subsampling for genetic sequences of COVID-19. In addition, I also worked on other research projects including MinION, which enables real-time genetic sequencing by utilizing GPU parallelization with CUDA, under the supervision of Dr. Paul Gordon and Dr. Quan Long. Aside from my involvement in academia, I was also recognized for my strong academic track record and campus involvement through the means of various awards. During my studies, I hosted a medical conference, MedX; I was a faculty representative on the First Year Council; I also co-founded a national student-run, not-for-profit organization, COUHR, which operates in 4 cities across Canada and led the technological development department within the organization to create a mobile health tracking application for patients with Alzheimer's disease.

After recognizing that the forefront of medical and biological innovation is empowered and driven by the use of big data and artificial intelligence, I was motivated to pursue a Master's degree in Mathematics and Statistics to gain a deeper understanding of the mechanisms behind such advancements. This transition from a non-traditional background to a technical field involving robust mathematics, statistics, and computer science marked a difficult period in my academic journey. Through the guidance of my peers and my supervisor, Dr. Qingrun Zhang, I was able to maintain my academic excellence while engaging in various meaningful research data science projects. With the use of various machine learning and statistical testing techniques, I identified key genes and biomarkers correlated with diseases such as Glaucoma and Cancer. Meanwhile, I also devised and developed a novel optimization method which reduced run-times for evolutionary analysis by more than 2900% through parallelization.

Currently, I am working as a Data Mining Analyst at the University of Calgary, where I explore the application of state-of-the-art methods, such as transformers, to filter and summarize relevant biomedical scientific literature. My work focuses on enhancing the efficiency of data extraction and interpretation, driving advancements in biomedical research and bioinformatics. I am now seeking opportunities in data science and machine learning within a corporate setting, driven by a keen interest in applying these techniques in a business context. Having not yet worked in such an environment, I am eager to see the immediate, tangible results that these applications can deliver.

Outside of this, owing to my affinity for novelty, I hold many scattered interests in subjects such as artificial intelligence, finance, biotechnology, music and reading. Recently, the book that left the strongest impression on me was The Almanack of Naval Ravikant. I also make YouTube tutorials for Web Automation Development which have currently accumulated more than 100,000 views. Finally, I am very passionate about playing basketball and volleyball, and always enjoy a stimulating conservation!



Portfolio

Unravelling the truth behind COVID-19

Research Programming

1. Identified sampling bias in SARS-CoV-2 sequence collection by analyzing and visualizing COVID-19 data via Python & R Shiny.

2. Devised a novel representative sampling strategy based on scientific deductions of COVID-19 and implemented a data pipeline involving Python and Perl which reduced sampling bias during SARS-CoV-2 sequence selection (n = >2 million) by around 100%.

Algorithmic paralellization for Evolutionary Analysis

Research

Created an asynchronous parallelization method for the Markov chain Monte Carlo (MCMC) Algorithm involved in Bayesian inference (evolutionary) which reduced computational run-times by more than 2900% (~84 days).

Exploratory Analysis of Rhesus Macaque Data

Research Data Science

1. Pinpointed around 50 out of >30,000 important genomic factors related to Glaucoma disease in Rhesus Macaque monkeys via R by employing dimensionality reduction, data wrangling, and statistical testing techniques on noisy biological datasets (RNA sequencing & Metabolomics) with high dimensionality and multi-collinearity.

2. Generated scientific figures using data visualization libraries in R which elucidated key research findings from exploratory data analysis to external institutions leading to the receival of monetary grants valuing greater than $100,000.

Regression analysis on clinical data for heart disease

Data Science

Engineered logistic and lasso regression (i.e. data preprocessing, data visualization, feature selection & model evaluation) predictive models in R for a clinical dataset with ~90% accuracy when evaluated with cross-validation.

Battleship

Programming

Battleship game with single player (with computer) and two player options designed with Java and Junit test cases.

Bayesian Inference of Zero-Inflated Dataset

Data Science

Programmed custom Bayesian statistical models in R using OpenBUGS to statistically model zero-inflated datasets by approximation with the Gibbs sampling algorithm implemented from scratch.


See more projects on Github

Blog


No posts at the moment.