MoneyTree
March 2025
Python, TensorFlow, Scikit-Learn, Streamlit, Pandas, BeautifulSoup, NumPy, Requests, VADER, Gemini API, YFinance API, Git/Github
- Conceptualized and built end-to-end fintech product addressing beginner investor pain points, delivering AI-powered personal finance assistant with personalized investment recommendations and 2-year asset price predictions.
- Conducted user research and market analysis to identify user needs (risk tolerance, financial goals, investing knowledge), designing matching algorithm that connects users to top 3 investment opportunities from 1,600+ curated assets
- Led product development from ideation to launch, defining product requirements, user experience flows, and technical specifications while collaborating with development team to deliver award-winning solution
Cinephile
May 2025
Python, SQL (MySQL), NoSQL (MongoDB), NLP (Gemini API), TMDb API, Data Wrangling, Prompt Engineering, Schema Design, Git/GitHub
- Developed a natural language query engine that translates user requests into SQL and MongoDB queries, enabling structured search across movie and TV datasets.
- Engineered relational and NoSQL schemas optimized for media data (e.g., titles, genres, ratings, cast, streaming platforms), supporting efficient retrieval of complex queries.
- Collected and wrangled large-scale datasets from the TMDb API, transforming unstructured JSON responses into structured SQL tables and NoSQL documents.
- Applied prompt engineering with Gemini LLM to improve accuracy of automatically generated queries, overcoming challenges with joins, aggregations, and nested lookups.
- Strengthened expertise in database design, data pipelines, and NLP-driven query automation
Housing Price Predictor
May 2025
Python, Pandas, Scikit-learn, Matplotlib, NumPy, Linear Regression, Data Visualization, Feature Engineering
- Built an interactive ML application that filters housing data by user preferences and predicts prices using linear regression with R-squared accuracy reporting.
- Implemented data preprocessing pipeline with one-hot encoding for categorical variables and multi-criteria filtering based on square footage, bedrooms, bathrooms, year built, and neighborhood.
- Developed predictive model using scikit-learn with train-test split methodology, achieving quantifiable performance metrics through statistical evaluation.
- Created comprehensive data visualizations displaying actual vs. predicted housing prices with scatter plots and rolling average trend lines for model performance analysis.
- Designed end-to-end data science workflow from user input validation to automated visualization generation for real estate price analytics.
Determinants of Adult Income: A Longitudinal Analysis
December 2024
Python, STATA, Econometrics, Multiple Linear Regression, Longitudinal Data Analysis, Statistical Modeling, Data Visualization, Hypothesis Testing
- Analyzed longitudinal data from 8,984 respondents over 24 years using multiple linear regression to identify childhood predictors of adult income, achieving 21.8% adjusted R-squared.
- Applied backwards selection methodology to optimize model performance, examining relationships between family factors and adult earnings through systematic variable selection.
- Conducted statistical analysis revealing significant income disparities by gender ($9,975 gap) and race ($5,120 gap) with p<0.001 significance levels.
- Engineered quadratic features for parental education variables to capture diminishing returns effects and improve model explanatory power.
- Translated complex statistical findings into actionable insights, demonstrating ability to communicate data-driven results for policy and business applications.
Exoplanet Candidate Classification
May 2025
Python, Scikit-learn, Pandas, Matplotlib, PCA, Cross-Validation, Classification Models, Feature Engineering, Model Optimization
- Designed end-to-end ML workflow including data preprocessing, dimensionality reduction analysis, automated hyperparameter tuning, and model performance visualization for astronomical data classification.
- Built multi-algorithm classification system comparing Logistic Regression, KNN, Decision Tree, and SVM models to predict exoplanet candidates using NASA Kepler mission data with orbital and stellar parameters.
- Implemented automated model selection pipeline using RandomizedSearchCV and GridSearchCV for hyperparameter optimization, comparing PCA vs. non-PCA feature sets to maximize classification accuracy.
- Performed comprehensive model evaluation using confusion matrices, classification reports, and cross-validation techniques to assess precision, recall, and F1-scores across different exoplanet classification categories.
- Conducted feature correlation analysis to identify most influential predictors, discovering orbital inclination as the highest correlated variable with exoplanet detection probability.
