Research Projects

Projects spanning retrieval-augmented generation, medical imaging, second-order optimization, NLP, and graph neural networks — all with detailed technical reports.

Systems & Data Engineering

E-Commerce Behavior Analytics Platform

Featured

A high-performance analytics platform that turns 385M+ raw e-commerce events into sub-second business intelligence. Three-tier cloud-native stack: React 18 + Material-UI frontend on Netlify, FastAPI backend on Google Cloud Run, PostgreSQL 14 with star schema on Cloud SQL.

300–600× faster <1s queries 385M events 52GB dataset

Monthly partitioning (7 partitions) reduced scan size by 85% · 5 materialized views for pre-computed aggregations · B-tree indexing on product_id, user_session, event_type. Storage overhead: ~35% (~$3/month on Google Cloud SQL).

PostgreSQL 14 FastAPI React 18 Star Schema Material-UI Recharts Google Cloud SQL Cloud Run Docker

Live Demo GitHub Report

Generative & Vision

Medical Image Enhancement (Pix2Pix)

Implemented and extended the Pix2Pix conditional GAN for automated chest X-ray enhancement. Built a synthetic degradation pipeline (Gaussian noise σ=15, blur 3×3, JPEG quality 50) for paired training data on NIH ChestX-ray14 (4,999 frontal radiographs). Extended with Self-Attention SAGAN-style modules at the U-Net bottleneck.

PSNR 39.97 dB SSIM 0.9755 200 epochs Tesla T4 (16GB)

Key finding: Self-attention added 2.5M parameters and 50% training overhead but did not improve metrics — X-ray enhancement is a local operation well-served by U-Net skip connections.

PyTorch Pix2Pix / cGAN U-Net PatchGAN Self-Attention NIH ChestX-ray14 PSNR / SSIM

Detailed Report

Optimization & Theory

nlTGCR: Second-Order Optimizer

Designed a scalable second-order optimization algorithm using the Fisher Information Matrix (FIM) as a symmetric positive-definite Hessian approximation. Applied Nyström approximation (rank-k subsampling) for cheap FIM inversion and Kronecker-factored preconditioning (K-FAC) for linear layers. Used JAX-style JIT compilation for C-level matrix operation speeds.

17× faster per epoch +3.2% accuracy vs Adam (MLP) 0.42s/epoch (5-layer MLP)

CIFAR-10 results: nlTGCR outperformed Adam/RMSProp on MLPs (54.52% vs 51.3%) with 17× faster epoch time. On CNNs, accuracy was comparable — convolutional structure breaks dense-Hessian assumption. Submitted to ICMLC '25.

PyTorch Fisher Information Matrix Nyström Approx. K-FAC JIT Compilation CIFAR-10

Detailed Report

NLP & Summarization

PEGASUS Scientific Paper Summarizer

Abstractive summarization pipeline for arXiv papers using google/pegasus-pubmed. Built preprocessing pipeline (URL removal, LaTeX stripping, special-character handling) preserving domain-specific vocabulary. Trained on 1,000 papers with beam search (width 4, length penalty 0.8) on A100 (40GB) with 16-bit mixed precision.

ROUGE-1: 0.377 ROUGE-2: 0.126 ROUGE-L: 0.219 1,000 train / 100 val / 100 test

PEGASUS PyTorch Lightning Hugging Face Transformers A100 / CUDA AdamW (lr=2e-5) 16-bit Mixed Precision

Report Model on Hugging Face

RAG-BioQA

Retrieval-augmented generation framework for long-form biomedical question answering on the PubMedQA dataset. Dense retrieval via BioBERT embeddings + FAISS indexing. Re-ranking pipeline comparing BM25, ColBERT, and MonoT5. Generator fine-tuned with LoRA for parameter-efficient T5 adaptation.

BioBERT FAISS T5 + LoRA ColBERT MonoT5 BM25 PubMedQA

Graph ML

GNN Document Classification (CORA)

Document relationship modeling using Graph Neural Networks on the CORA dataset. Combined citation networks, co-authorship signals, and semantic similarity for graph construction. Implemented and compared GCN, GAT, and GraphSAGE architectures for document classification and clustering.

PyTorch Geometric GCN GAT GraphSAGE CORA Dataset Citation Networks

Code

Applied AI

TasteMatch: AI Dietitian Chatbot

LLM-powered personal dietitian for users managing chronic conditions like diabetes. Analyzes user preferences, kitchen inventory, and dietary restrictions to generate personalized meal recommendations. Verifies nutritional facts against established diabetes care guidelines with glycemic index verification and portion size calculations.

Ollama FastAPI LLMs RAG Diabetes Care Conversational AI

Research Agenda

My research explores how large language models and deep learning can be applied to real-world problems in healthcare and science. I'm particularly interested in:

Retrieval-Augmented Generation — improving factual accuracy and domain specificity in LLMs through dense retrieval and re-ranking pipelines
Medical Image Analysis — using GANs and attention mechanisms to enhance diagnostic quality of medical imagery
Digital Health — extracting meaningful signals from passive technology usage data to monitor functional decline in aging populations
Graph Neural Networks — modeling complex relational data for classification and clustering tasks

Current: Digital Health Monitoring

Active

At the Emory FIT Lab, I'm extracting and analyzing Amazon Alexa voice interaction logs to identify technology engagement patterns that correlate with functional decline in older adults. Building automated data extraction pipelines with Python/Selenium and developing ML models to detect meaningful behavioral changes over time.

Python Selenium Digital Health Time Series Analysis Emory FIT Lab