An end-to-end machine learning pipeline on 1.99 million FEMA flood insurance claims — from raw data to an XGBoost model that explains 66.9% of claim variation and supports real-time claims triage.
The National Flood Insurance Program (FEMA) is the primary source of flood insurance in the U.S. With over 5 million active policies and $70B+ in historical payouts, accurately predicting individual claim severity is a high-value operational problem — it drives claims triage, reinsurance reserving, and portfolio risk management.
Target variable: amountPaidOnBuildingClaim — actual NFIP payout for structural damage, ranging from $0.01 to $250,000.
How severe the flood was — captured by waterDepth, flood_zone_risk_score, and is_hurricane_season. A shallow flood causes minimal damage; six feet of surge is catastrophic.
How exposed the structure was — elevatedBuildingIndicator, height_above_BFE, postFIRMConstructionIndicator, and building_age_at_loss.
Policy terms — totalBuildingInsuranceCoverage, deductible_amount, and buildingPropertyValue all determine the ultimate settlement amount.
A shallow flood on a non-elevated Pre-FIRM building in Zone VE is catastrophic. The same depth on a Post-FIRM elevated building in Zone X causes almost none. These conditions interact — linear models cannot capture them.
Historical flood insurance claims from FEMA covering multiple decades, 50+ states, and every flood zone category. Starting from 73 raw columns, we retained a clean working dataset after removing leakage, high-cardinality identifiers, and invalid records.
Raw columns are rarely in optimal form for ML. We engineered 22 features across five domain categories:
| Category | Features | Rationale |
|---|---|---|
| Building Age | building_age_at_loss | Older buildings predate flood codes; structurally more vulnerable |
| Elevation & Freeboard | height_above_BFE, height_above_ground, freeboard_positive, freeboard_deficit | Floor height above BFE is the primary physical determinant of whether water enters a building |
| Hazard Timing | is_hurricane_season, water_depth_log | June–Nov is Atlantic hurricane season; log depth approximates the non-linear depth-damage curve |
| Flood Zone | flood_zone_risk_score, is_high_risk_zone | FEMA zones have a clear ordinal risk hierarchy encoded explicitly |
| Building Type & Geography | 8 is_* flags, is_coastal_state, risk_zone_elevation_interaction, deductible_amount, cat_year_coastal, state_target_enc, year_target_enc | Structure type, coastal exposure, catastrophe year × coastal interaction |
Each EDA question was chosen to directly inform modelling decisions — not just describe the data, but test whether a feature is worth including and in what form.
Distribution of amountPaidOnBuildingClaim — raw, log-transformed, and by frequency tier.
The raw distribution is severely right-skewed — most claims are small ($500–$15K) but extreme claims up to $250K exist. Log-transformation converts this into a roughly bell-shaped distribution, which matters for Linear Regression's error assumptions. The small-to-moderate claims dominate frequency, meaning a model that only predicts large losses accurately would fail on the majority of real-world claims.
Temporal patterns: claim volume and severity by month, season, and loss year.
Claims are dramatically concentrated in August–October — peak Atlantic hurricane season. The 2005 (Katrina), 2012 (Sandy), and 2017 (Harvey, Irma, Maria) hurricane years are visible as distinct severity spikes. is_hurricane_season shows one of the strongest linear correlations with the target (0.239), directly justifying its inclusion as a binary feature.
Depth-damage relationship — raw vs. log-transformed, and the non-linear curve.
The relationship between water depth and claim amount is non-linear and concave — a classic depth-damage curve from flood engineering. The first few inches of water cause a disproportionate jump in losses; marginal damage diminishes at depth extremes. This non-linearity directly motivates the water_depth_log engineered feature, which improves linear correlation from 0.156 (raw) to 0.248 (log).
Claim severity and volume by flood zone category — testing whether FEMA's risk hierarchy is reflected in actual payouts.
High-risk zones (AE, VE) do produce higher average severity, but Zone X (low-risk) generates the most total claim volume due to sheer policy count — the "volume paradox." Zone designation captures the broad severity signal, but physical hazard features (water depth, elevation) are essential complements. A claim in Zone X during Hurricane Harvey can exceed one in Zone AE during a minor rain event.
Elevated vs. non-elevated buildings, Pre-FIRM vs. Post-FIRM, and the freeboard threshold effect.
Building-level interventions reduce losses, but more nuanced than expected. Crossing above BFE cuts median claims by 63% (from $29K to $11K) — a threshold effect, not linear. Post-FIRM buildings show higher average claims than Pre-FIRM ($41K vs $30K) due to higher property values (TIV effect), not greater vulnerability. 81% of claims (1.6M) are non-elevated buildings, meaning the higher-severity group is also the most common type in the portfolio.
State-level total paid, average severity, and a frequency vs. severity scatter for the top 20 states.
Losses are dramatically concentrated: Louisiana ($15.9B), Florida ($13.1B), and Texas ($12.5B) account for 63.3% of all NFIP building claim payments. LA leads on volume (~360K claims at $44K avg), FL leads on per-claim severity ($44,784). Mississippi is a hidden risk — moderate volume (~50K claims) but 3rd-highest average severity at $43,343, driven by Katrina concentration. NY/NJ appear in top-15 severity due entirely to Hurricane Sandy (2012).
Correlation matrix across 20 key features and their individual correlations with the target.
Individual feature-to-target correlations are modest (max ~0.25), which is normal for insurance loss prediction. water_depth_log (0.248) and is_hurricane_season (0.239) are the strongest linear signals. Critically, freeboard_positive shows only −0.078 linear correlation, yet SHAP analysis reveals elevation as a top XGBoost feature — because its protective effect is conditional on flood zone and depth. This is the core insight: linear correlation cannot detect conditional interactions, which is precisely why XGBoost outperforms Linear Regression.
Profile comparison: top 10% vs. bottom 30% claims across 13 features.
A high-severity NFIP claim has a consistent, physically sensible profile: 8.55 inches of water depth (vs 2.21 for low-severity — 3.9×), 39.2% in high-risk zones (vs 11.2% — 3.5×), 94.5% in coastal states (vs 71.9%), and $317K average TIV (vs $113K — 2.8×). Buildings average 39.9 years old vs 27.4 years. The profiles are coherent and grounded in flood science — the model is learning real patterns, not statistical noise.
We trained three models in increasing order of complexity, using each as a benchmark for the next. All evaluated on a held-out test set of ~380K claims.
| Model | R² | RMSE | MAE | Variation Explained | Note |
|---|---|---|---|---|---|
| Linear Regression | 0.307 | $39,888 | $25,384 | 30.7% | Baseline — fails on non-linear interactions |
| Random Forest | 0.551 | $32,116 | $19,777 | 55.1% | +80% improvement over LR |
| XGBoost v1 ✓ | 0.669 | $27,577 | $16,474 | 66.9% | Best — sequential boosting on residuals |
| XGBoost v2 (Optuna) | 0.666 | $27,694 | $16,564 | 66.6% | Automated search — slightly underperforms manual tuning |
After training, we ran the model on both the training set and the held-out test set to check if it was memorizing data rather than learning real patterns. A small gap between train and test performance means the model generalizes well.
No overfitting detected. A gap of 0.057 is well within the healthy range.
The model is learning the real underlying patterns of flood damage — not memorizing training claims. It performs almost equally well on claims it has never seen before, which confirms it can be deployed reliably on new incoming claims.
💡 What would bad overfitting look like? If Train R² were 0.95 and Test R² were 0.40, the model would be memorizing training data and failing on new claims. A gap under 0.10 is the industry standard for acceptable generalization.
Two complementary interpretation methods: XGBoost's built-in gain importance (global, which features are used most) and SHAP values (per-prediction, which features pushed each claim up or down).
| Rank | Feature | Importance | Interpretation |
|---|---|---|---|
| 1 | cat_year_coastal | 0.2676 | Catastrophe year × coastal state — Katrina (2005), Harvey (2017), Sandy (2012) dominate the signal |
| 2 | elevatedBuildingIndicator | 0.0632 | Single most impactful structural predictor across all flood zones |
| 3 | waterDepth | 0.0478 | How deep the water was — non-linear depth-damage relationship captured by tree splits |
| 4 | buildingPropertyValue | 0.0339 | Higher value building = larger absolute dollar loss (scale effect) |
| 5 | longitude | 0.0275 | Geographic exposure — Gulf Coast vs. inland U.S. |
SHAP decomposes every individual prediction into the contribution of each feature. Unlike global importance, SHAP shows both direction (does this push the prediction up or down?) and magnitude for every claim in a 5,000-claim sample.
🔑 Key insight from SHAP: freeboard_positive shows only −0.078 linear correlation with the target, but SHAP reveals it as a top contributor — because its effect is conditional. When water depth is high AND the building is not elevated AND the zone is AE/VE, freeboard contributions are large and negative (reduces predicted loss when above BFE). Linear correlation averages this out to near-zero; SHAP exposes it.
Actual vs. predicted scatter and residuals plot on a 5,000-claim holdout sample. A well-calibrated model should scatter around the 45° diagonal with residuals centered near zero.
The model under-predicts the largest catastrophic claims ($150K+) — a known limitation of tree-based models on heavy-tailed distributions. This informs the deployment recommendation: use the model's top 15% predicted segment (not just top 10%) as the triage priority queue to include a buffer for underestimated tail events.
The XGBoost model predicts expected claim payout with R² = 0.669 and MAE = $16,474. Each recommendation below is a direct, immediate application of this predictive capability.
The problem today: After a hurricane, thousands of claims land simultaneously and are processed in arrival order. A $180,000 loss at position 800 in the queue gets the same priority as a $2,000 loss at position 1.
What the model does: Score every incoming claim instantly using information already collected at submission — water depth, flood zone, building value, location, elevation status. Rank by predicted severity. Dispatch senior adjusters to the top 15% first.
Impact: Faster settlement for the largest losses. Reduced litigation. Better resource allocation at the moment it matters most.
The problem today: NFIP knows which properties are insured but not which ones are most likely to generate large claims next season.
What the model does: Run the model on every in-force policy using current property characteristics (set waterDepth to the median historical value for that flood zone). Filter for: postFIRMConstructionIndicator = 0 + elevatedBuildingIndicator = 0 + is_coastal_state = 1 — the highest-risk segment confirmed by both EDA Q5 and the model's feature importance.
Impact: Proactive mitigation outreach to the right properties. Elevation certificate campaigns targeted where they reduce the most expected loss.
The problem today: Elevation-based premium discounts are set using flat rate tables — the same discount regardless of the property's specific flood zone, building value, or location.
What the model does: Run two predictions on the same property — once with elevatedBuildingIndicator = 0, once with elevatedBuildingIndicator = 1. The difference is a property-specific dollar estimate of how much elevation reduces expected loss. elevatedBuildingIndicator is our #2 feature by gain importance (0.0632).
Impact: Property-specific elevation incentives. More accurate premium pricing. Data-driven justification for mitigation grants.
Storm track data, National Weather Service peak stage readings, and event-level inundation maps would dramatically improve predictions for catastrophe-year claims.
Census tract or ZIP-level flood history, elevation data from USGS DEMs, and distance-to-water features would replace the coarse state-level geographic signal.
Foundation type, building materials, first-floor elevation certificates, and roof construction details are known to adjusters but not in the NFIP redacted dataset.
A production version should retrain annually on recent catastrophe years to prevent concept drift as climate patterns shift coastal risk profiles over time.
Starting from raw FEMA data with 73 columns and 1.99 million records, this project delivered a production-ready flood claim severity model through a rigorous end-to-end pipeline.
The key finding of this project is not the R² score — it is the demonstration that conditional feature interactions govern flood losses. Elevation only matters in certain zones at certain water depths. Hurricane season only matters in coastal states. Linear models average across all these conditions and fail. XGBoost discovers them automatically through tree splits — which is why it explains 66.9% of variation vs. only 30.7% for Linear Regression on the same data.