Sahil Sangani - Data Science Portfolio

About Me

I'm a Statistics major with minors in Computer Science and Data Science at the University of Illinois Urbana-Champaign, passionate about leveraging data to solve real-world problems. With a strong foundation in statistical modeling, machine learning, and data visualization, I'm actively seeking data science internship opportunities where I can apply my analytical skills and continue learning.

My academic journey has equipped me with hands-on experience in building predictive models, conducting exploratory data analysis, and deriving actionable insights from complex datasets. I'm particularly interested in applying statistical methods and machine learning techniques to domains like sports analytics, manufacturing optimization, and agricultural technology.

As a Project Lead at the Illinois Data Science Club and member of NeuroTech Club, I'm constantly expanding my knowledge and collaborating with peers on innovative projects. I'm eager to contribute to a forward-thinking team while gaining practical industry experience.

University of Illinois Urbana-Champaign

B.S. in Statistics | Minors: Computer Science, Data Science Expected: May 2027

Relevant Coursework: Statistical Modeling, Machine Learning, Modeling and Learning in Data Science, Stochastic Processes, Database Systems, Intro to Data Mining, Abstract Linear Algebra, Data Ethics, Statistical Programming Methods

Experience

Data & Web Solutions Developer (Project Based)

AGWISE | Remote

Sep 2025 – Present

Used multiple LLMs and automated queries to identify and extract third-party research, lab, field, and testimonial documents for agricultural biological companies
Trained and deployed an NLP-based sentiment analysis pipeline to label outcomes and drive citation-based company rankings

Data Challenge Participant

Sandia National Laboratories | Champaign, IL

Nov 2025

Analyzed real pre-production manufacturing data for Sandia using statistical tests and regression to identify drivers of scrap rate
Built and evaluated a logistic model (F1 = 0.93) to quantify effects of powder type, layout, artifacts, and location
Delivered recommendations showing optimal build conditions to reduce scrap and support Sandia's internal analysis

Assistant Tutor

University of Illinois Urbana-Champaign | Champaign, IL

Jan 2024 – May 2025

Delivered Java programming lessons, simplifying complex concepts for students with diverse technical backgrounds
Collaborated with professors, tutors, and students to enhance communication across technical and non-technical audiences
Managed online platforms, facilitated peer programming sessions, and supervised a four-week machine project

Leadership & Involvement

Project Lead

Illinois Data Science Club (iDSC) | UIUC

Spring 2026 – Present

Leading and mentoring 6-7 project teams in end-to-end data science initiatives, from idea generation to final presentation
Providing technical guidance on machine learning methodologies, data analysis techniques, and best practices
Overseeing project timelines, facilitating team collaboration, and ensuring deliverable quality across multiple concurrent projects
Leveraging experience from winning 1st place in Data Dive competition to mentor teams on competition strategy and presentation skills

Member & Data Dive Competition Winner

Illinois Data Science Club (iDSC) | UIUC

Fall 2025

Won 1st place among 50 teams in the 8-week Data Dive competition, predicting MLB player WAR using PCA and Gradient Boosting
Presented project findings to panel of academic and industry judges, demonstrating strong communication and technical skills
Collaborated with club members on data science projects and participated in technical workshops
Engaged in peer learning sessions focused on machine learning, statistical modeling, and data visualization

General Member

NeuroTech Club | UIUC

Fall 2025 – Present

Exploring the intersection of neuroscience and technology through hands-on projects and workshops
Learning about brain-computer interfaces, neural data analysis, and applications of machine learning in neuroscience
Collaborating with interdisciplinary team members from computer science, neuroscience, and engineering backgrounds

Featured Projects

MLB Player Value Prediction

🏆 1st Place - UIUC iDSC Data Dive | 50 Teams

Led an end-to-end data science project predicting MLB player WAR (Wins Above Replacement) using advanced machine learning techniques, winning 1st place in a competitive 8-week challenge.

Key Achievements:

Won 1st place at UIUC Illinois Data Science Club's Data Dive competition, competing against 50 teams over 8 weeks
Led complete project lifecycle from problem formulation and data collection to model evaluation and presentation
Applied Principal Component Analysis (PCA) for dimensionality reduction, identifying key performance indicators from complex baseball statistics
Implemented Gradient Boosting model optimized for both predictive accuracy and interpretability, enabling actionable insights for team management
Presented results to panel of academic and industry judges, effectively communicating technical methodology and business value
Emphasized model interpretability to provide clear explanations of which player attributes drive value predictions

Python PCA Gradient Boosting Scikit-learn Sports Analytics

View Project

Fourth-Down Decision Modeling

🏈 AUC: 0.68 | NFL Analytics

Built predictive models for fourth-down conversion success using NFL play data, comparing neural networks and logistic regression to evaluate model complexity trade-offs.

Key Achievements:

Developed comparable models achieving AUC ~0.68 using both neural networks and logistic regression
Analyzed real-world NFL play-by-play data to identify factors influencing fourth-down conversion success
Conducted comprehensive model comparison, finding limited performance gains from increased complexity on noisy real-world data
Demonstrated practical understanding of model selection trade-offs: interpretability vs. complexity
Applied insights relevant to NFL coaching decisions and game strategy optimization

Python Neural Networks Logistic Regression PyTorch NFL Data

View Project

Student Depression Risk Prediction

🎯 AUC: 0.936 | Team Collaboration

A machine learning project focused on mental health risk assessment for students, combining statistical analysis with predictive modeling to identify at-risk individuals.

Key Achievements:

Achieved exceptional model performance with AUC of 0.936, indicating strong discriminative ability between risk categories
Performed comprehensive feature engineering on student survey data, transforming raw responses into meaningful predictors including sleep patterns, financial stress indicators, and academic pressure metrics
Conducted in-depth coefficient analysis to identify the most significant risk factors, providing actionable insights for intervention strategies
Collaborated effectively in a team of four, coordinating data preprocessing, model training, and validation tasks
Designed model with real-world application potential for school counseling services and student support programs

Python Scikit-learn Logistic Regression Pandas Feature Engineering

View Project

Real-Time MLB Pitch Classification

⚾ 98.6% Accuracy | Sports Analytics

An advanced sports analytics project that leverages Statcast data to classify baseball pitches in real-time, demonstrating the intersection of machine learning and sports technology.

Key Achievements:

Developed highly accurate pitch classification model achieving 98.6% accuracy using K-Nearest Neighbors algorithm on Kevin Gausman's Statcast data
Engineered sophisticated features including release speed, spin rate (rpm), horizontal and vertical movement, and release point coordinates to capture pitch characteristics
Created comprehensive data visualizations using Seaborn to demonstrate pitch separability and model decision boundaries, enhancing interpretability for baseball analysts
Built modular, scalable framework that can be adapted for any MLB pitcher by simply substituting the dataset, making it practical for broadcast integration
Optimized hyperparameters through cross-validation to ensure model generalization across different game situations

Python K-Nearest Neighbors Seaborn Statcast API Sports Analytics

View Project

Tumor Size Prediction and Lifestyle Impact Modeling

📊 R² = 0.68 | Statistical Modeling in R

A data-driven analysis exploring how lifestyle and demographic factors influence tumor size using regression and ANOVA modeling in R.

Key Achievements:

Built multiple linear regression and two-way ANOVA models to analyze relationships between lifestyle factors and tumor size
Achieved R² = 0.68, explaining 68% of tumor size variability through behavioral and demographic predictors
Designed a dynamic, templatized R Markdown pipeline that automatically updates analyses and visualizations when new datasets or variables are used
Created interactive visualizations with ggplot2 to highlight main and interaction effects among predictors
Simulated lifestyle-change scenarios to assess predicted health improvements

R R Markdown Linear Regression ANOVA ggplot2

View Project

Exploratory Data & Regression Projects

📊 R² = 0.83 | Multi-Domain Analysis

A collection of exploratory data analysis and regression modeling projects across different domains, showcasing versatility in data science techniques and problem-solving approaches.

Key Achievements:

Analyzed college tuition dataset to understand pricing patterns, exploring relationships between institution characteristics, location, and tuition costs
Conducted used car price analysis, investigating how factors like mileage, age, brand, and features influence market value
Created publication-quality visualizations using Matplotlib and Seaborn, effectively communicating insights
Built interpretable linear regression models with strong predictive performance (R² = 0.83)
Practiced comprehensive data preprocessing including handling missing values, outlier detection, and feature scaling

Python Pandas Matplotlib Linear Regression EDA

View Project

Technical Skills

💻 Programming Languages

Python

Advanced

▼

My primary language for data science and programming. I've developed strong proficiency through extensive coursework, personal projects, and structured learning programs.

Learning Path: Coursework → Personal Projects → CodePath DSA Advanced Certificate

Experience: I use Python for everything from exploratory data analysis to building production-ready machine learning models. Comfortable with data manipulation, statistical analysis, machine learning pipelines, and algorithm implementation.

Key Libraries: Pandas (data manipulation), NumPy (numerical computing), Scikit-learn (machine learning), PyTorch (deep learning), Statsmodels (statistical modeling), Seaborn & Matplotlib (visualization)

Data Analysis Machine Learning DSA (Advanced) Statistical Computing Model Development CodePath Certified

R

Intermediate

▼

Learned R through statistics coursework and have developed strong skills in statistical analysis and data visualization. While I prefer Python for most data science tasks, I particularly appreciate R's visualization capabilities.

Learning Path: Statistics Coursework → R Markdown for Assignments → Advanced Visualization

Experience: Proficient in using ggplot2 for creating publication-quality visualizations. Extensive experience with R Markdown for creating reproducible research documents, which also helped me become proficient in LaTeX. Later expanded to Quarto for working with .ipynb files, enabling seamless integration between R and Python workflows.

Key Libraries: tidyverse (data manipulation), ggplot2 (visualization), statistical modeling packages

ggplot2 Visualizations R Markdown Statistical Analysis LaTeX Quarto tidyverse

SQL

Intermediate

▼

Self-taught SQL through online platforms and practical problem-solving. Comfortable with database querying, data manipulation, and writing efficient queries for data extraction and analysis.

Learning Path: Codecademy SQL Course → HackerRank Practice Problems → Real Project Applications

Experience: Proficient in writing complex SELECT queries, JOINs, subqueries, and aggregate functions. Experience with database design concepts and query optimization. Continuously practicing through HackerRank challenges to strengthen problem-solving skills.

Complex Queries JOINs & Subqueries Data Extraction Codecademy HackerRank Self-Taught

C++

Advanced

▼

Learned C++ through rigorous coursework focused on data structures and algorithms. Have developed multiple projects demonstrating practical application of systems programming concepts.

Learning Path: DSA Coursework → Multiple C++ Projects

Experience: Completed a comprehensive Data Structures and Algorithms course taught in C++, gaining hands-on experience with memory management, pointers, templates, and STL containers. Built several projects that showcase algorithmic thinking and efficient code implementation. Comfortable with low-level programming concepts and performance optimization.

Data Structures Algorithms Memory Management STL Systems Programming

Java

Intermediate

▼

Java was my first programming language and remains a strong foundation of my technical skills. I have extensive experience from both academic work and practical application.

Learning Path: First Programming Language → Coursework → Android Development → 3 Semesters of Tutoring

Experience: Developed an Android app using Android Studio, demonstrating mobile development capabilities. As an Assistant Tutor for three semesters at UIUC, I taught Java programming to students with diverse backgrounds, which deepened my understanding of core concepts like OOP, data structures, and software design patterns. This teaching experience honed my ability to explain complex technical concepts clearly.

Object-Oriented Programming Android Studio Mobile Development Teaching Experience Data Structures

🛠️ Tools & Technologies

SAS

Learning

▼

Currently preparing to learn SAS as part of upcoming coursework next semester. Eager to expand my statistical programming toolkit with industry-standard software.

Learning Path: Upcoming Coursework (Spring 2026)

Anticipated Focus: Statistical analysis, data management, and reporting using SAS programming. Looking forward to applying SAS in real-world data analytics scenarios and adding enterprise-level statistical software to my skill set.

Upcoming Coursework Statistical Software Data Analytics

Tableau

Beginner

▼

Business intelligence and data visualization tool that I learned through peer mentorship.

Learning Path: Learned from a Friend → Self-Practice

Experience: Familiar with creating interactive dashboards and visualizations. Currently building practical experience and planning future projects to showcase Tableau capabilities in presenting data insights to non-technical audiences.

Dashboards Data Visualization Business Intelligence

Development Tools

Advanced

▼

Proficient with essential development tools learned through coursework and hands-on project work.

Learning Path: Coursework → Project Applications → Daily Use

Tools: Git/GitHub (version control), Jupyter Notebook (interactive development), VS Code (primary IDE), R Studio (R development), Docker (containerization), Excel (data analysis)

Experience: Comfortable with version control workflows, collaborative coding, and using modern IDEs for efficient development. Regular use of these tools across multiple projects has made them integral to my development process.

Git & GitHub Jupyter Notebook VS Code Docker Excel

Click on any skill card to learn more about my experience and learning journey!