HDB Price Prediction Model
A HDB price prediction model trained with XGBoost
Singapore's public housing market is large, opaque, and data-rich. Over 227,000 HDB resale transactions have been registered since 2017 — but buyers and sellers still have to piece together comparable prices manually from listings and government portals. This project turns that data into an instant price estimate with a ±5% confidence range, served through a web interface anyone can use without an account.
What It Does
You type a block number and street name. The form auto-fills the town, lease commencement year, and typical floor area from historical transaction data for that exact block. Fill in the remaining details — storey range, flat type, flat model, floor area, transaction month — and submit.
Within a second you get a predicted resale price alongside a breakdown of the key factors: distance to the CBD, nearest MRT, schools, remaining lease, building age, and estate maturity. Location access lets the browser pre-fill the town based on your coordinates.
Tech Stack
The backend is FastAPI in a Docker container deployed on Render, with XGBoost 2.1.0 as the model runtime. The frontend is a single HTML file with vanilla JS — no framework, no build step. Rate limiting is handled by slowapi (10 predictions/min per IP).
How It Predicts
When you submit, the API geocodes your address against a pre-built cache of 227,000+ HDB addresses — no live API call needed. It then computes distances to the CBD, regional MRT hubs, all 171 stations, schools, hawker centres, malls, and parks using the Haversine formula. These features, along with lease decay curves and market volume indicators, are fed into the model to produce a prediction.
One non-obvious decision: using 4 regional MRT hub distances instead of all 171 stations actually improved accuracy. With a dense transit network, distance to the nearest station collapses to near-zero for almost every flat — it loses all discriminating power. The four hubs (Ang Mo Kio, Woodlands, Jurong East, Tampines) act as North/South/East/West centrality anchors instead.
Deployment Constraints
Running on Render's free tier means 512MB RAM. The full 5-fold ensemble consistently OOM-crashed on startup, so only the best-performing fold (fold 3, R² = 0.9799) is loaded in production. Switching to XGBoost's binary .ubj format over text JSON cut the per-model memory spike from ~313MB to ~269MB — the difference between booting and crashing.
Results
| Metric | Value |
|---|---|
| Deployed model R² | 0.9799 |
| Test RMSE (full ensemble) | ~$33–35k |
| Baseline RMSE | $55,116 |
| RMSE reduction | 36–40% |
The ~$33–35k RMSE is roughly 5–7% of the median resale price, achieved through feature engineering alone — no additional data sources beyond publicly available amenity data and the geocoded address cache.