Long Tien Bui

Predicting Wellbeing & Anxiety from Survey Data

FIT5197 · Statistical Data Modelling · Monash University | Jul 2024 – Nov 2024

Statistical Data Modelling project

Overview

An end-to-end statistical modelling project built entirely in R, covering both a regression task (predicting an individual's happiness score) and a 5-class classification task (predicting alwaysAnxious) from ~70 life-and-wellbeing survey questions. Both tasks were submitted to a live Kaggle competition, with models iteratively refined against the public leaderboard.

What was the challenge?

The assignment imposed two non-trivial constraints: core statistical methods had to be implemented from first principles without external libraries, and the classification target was heavily imbalanced across five ordinal classes. The goal was to move beyond "throw a model at it" and reason carefully about predictor significance, regularisation, resampling, and hyperparameter behaviour — then defend every choice on the leaderboard.

What I did

  • Multiple Linear Regression from scratch: Fit a full linear model on ~70 survey predictors, then wrote R code to automatically extract significant predictors at α = 0.01 and rank the Top 5 strongest by t-statistic — no broom or helper libraries.
  • Feature engineering & diagnostics: Derived interpretable features (e.g. BMI from height/weight), inspected residuals, and pruned predictors based on significance and collinearity.
  • Regularised regression: Implemented and compared Lasso regression to handle the high-dimensional predictor space, using cross-validation to select the penalty term.
  • Tree-based ensembles: Trained Random Forest and XGBoost regressors, tuned max_depth, eta, nrounds, and L1/L2 regularisation, and tracked each configuration against the Kaggle public score.
  • Imbalanced multi-class classification: For the alwaysAnxious target (5 ordinal classes), applied upSample to rebalance the training distribution and compared it against SMOTE-family approaches.
  • XGBoost multi-class tuning: Built a multi:softmax XGBoost classifier, systematically sweeping max_depth (3–10) and L1/L2 penalties, and documenting that aggressive regularisation hurt performance in this regime — an experimental finding worth more than the final score itself.
  • Reproducibility: Seeded every stochastic step, wrote a clean marker-facing entry point for local datasets, and maintained a referenced bibliography of every technique and source consulted.

Tools & skills

R · Multiple Linear Regression · Lasso · Random Forest · XGBoost · Cross-Validation · Hyperparameter Tuning · Imbalanced Data (upSample / SMOTE) · Feature Engineering · Model Diagnostics · Kaggle Competition Workflow

Outcome

The project reinforced that statistical modelling is as much about disciplined experimentation as it is about algorithms — knowing when a more complex model (or heavier regularisation) actively hurts, and being able to justify the final pick with evidence from both cross-validation and the leaderboard. It also deepened my fluency in R as a first-class modelling language alongside Python.