ISOM 835  ·  Predictive Analytics & Machine Learning  ·  Suffolk University MSBA  ·  Spring 2026

Predicting
Flood Claim Severity
for the NFIP

An end-to-end machine learning pipeline on 1.99 million FEMA flood insurance claims — from raw data to an XGBoost model that explains 66.9% of claim variation and supports real-time claims triage.

0.669
XGBoost R²
$16,474
Mean Absolute Error
1.99M
Claims Analyzed
$70B+
Historical Claim Payments
Business Context

Why Flood Claim Prediction Matters

The National Flood Insurance Program (FEMA) is the primary source of flood insurance in the U.S. With over 5 million active policies and $70B+ in historical payouts, accurately predicting individual claim severity is a high-value operational problem — it drives claims triage, reinsurance reserving, and portfolio risk management.

Target variable: amountPaidOnBuildingClaim — actual NFIP payout for structural damage, ranging from $0.01 to $250,000.

🌊

Hazard Intensity

How severe the flood was — captured by waterDepth, flood_zone_risk_score, and is_hurricane_season. A shallow flood causes minimal damage; six feet of surge is catastrophic.

🏠

Building Vulnerability

How exposed the structure was — elevatedBuildingIndicator, height_above_BFE, postFIRMConstructionIndicator, and building_age_at_loss.

📋

Financial Structure

Policy terms — totalBuildingInsuranceCoverage, deductible_amount, and buildingPropertyValue all determine the ultimate settlement amount.

🧠

Why Linear Models Fail

A shallow flood on a non-elevated Pre-FIRM building in Zone VE is catastrophic. The same depth on a Post-FIRM elevated building in Zone X causes almost none. These conditions interact — linear models cannot capture them.

Project Structure

End-to-End ML Pipeline

Part 1
The Situation
Load 1.99M raw records · Data quality audit · Target variable cleaning
Part 2
The Discovery
22 engineered features · 8 EDA questions answered
Part 3
The Model
Linear → Random Forest → XGBoost with Optuna tuning
Part 4
The Recommendation
SHAP interpretation · Diagnostics · 3 business recommendations
Data

FEMA NFIP Redacted Claims V2

Historical flood insurance claims from FEMA covering multiple decades, 50+ states, and every flood zone category. Starting from 73 raw columns, we retained a clean working dataset after removing leakage, high-cardinality identifiers, and invalid records.

1.99M
Raw Records
73
Original Columns
22
Engineered Features
80/20
Train / Test Split

Feature Engineering — 22 New Predictive Features

Raw columns are rarely in optimal form for ML. We engineered 22 features across five domain categories:

CategoryFeaturesRationale
Building Agebuilding_age_at_lossOlder buildings predate flood codes; structurally more vulnerable
Elevation & Freeboardheight_above_BFE, height_above_ground, freeboard_positive, freeboard_deficitFloor height above BFE is the primary physical determinant of whether water enters a building
Hazard Timingis_hurricane_season, water_depth_logJune–Nov is Atlantic hurricane season; log depth approximates the non-linear depth-damage curve
Flood Zoneflood_zone_risk_score, is_high_risk_zoneFEMA zones have a clear ordinal risk hierarchy encoded explicitly
Building Type & Geography8 is_* flags, is_coastal_state, risk_zone_elevation_interaction, deductible_amount, cat_year_coastal, state_target_enc, year_target_encStructure type, coastal exposure, catastrophe year × coastal interaction
Exploratory Data Analysis

8 Questions That Shaped the Model

Each EDA question was chosen to directly inform modelling decisions — not just describe the data, but test whether a feature is worth including and in what form.

EDA 01

What Does the Target Variable Actually Look Like?

Distribution of amountPaidOnBuildingClaim — raw, log-transformed, and by frequency tier.

EDA 1 — Target Distribution

Key Finding

The raw distribution is severely right-skewed — most claims are small ($500–$15K) but extreme claims up to $250K exist. Log-transformation converts this into a roughly bell-shaped distribution, which matters for Linear Regression's error assumptions. The small-to-moderate claims dominate frequency, meaning a model that only predicts large losses accurately would fail on the majority of real-world claims.

EDA 02

When Do Floods Cause the Most Severe Damage?

Temporal patterns: claim volume and severity by month, season, and loss year.

EDA 2 — Temporal Patterns

Key Finding

Claims are dramatically concentrated in August–October — peak Atlantic hurricane season. The 2005 (Katrina), 2012 (Sandy), and 2017 (Harvey, Irma, Maria) hurricane years are visible as distinct severity spikes. is_hurricane_season shows one of the strongest linear correlations with the target (0.239), directly justifying its inclusion as a binary feature.

EDA 03

Does Water Depth Actually Predict Claim Size?

Depth-damage relationship — raw vs. log-transformed, and the non-linear curve.

EDA 3 — Water Depth

Key Finding

The relationship between water depth and claim amount is non-linear and concave — a classic depth-damage curve from flood engineering. The first few inches of water cause a disproportionate jump in losses; marginal damage diminishes at depth extremes. This non-linearity directly motivates the water_depth_log engineered feature, which improves linear correlation from 0.156 (raw) to 0.248 (log).

EDA 04

Do FEMA's Flood Zones Predict Losses?

Claim severity and volume by flood zone category — testing whether FEMA's risk hierarchy is reflected in actual payouts.

EDA 4 — Flood Zones

Key Finding

High-risk zones (AE, VE) do produce higher average severity, but Zone X (low-risk) generates the most total claim volume due to sheer policy count — the "volume paradox." Zone designation captures the broad severity signal, but physical hazard features (water depth, elevation) are essential complements. A claim in Zone X during Hurricane Harvey can exceed one in Zone AE during a minor rain event.

EDA 05

Do Elevation & Construction Era Actually Reduce Losses?

Elevated vs. non-elevated buildings, Pre-FIRM vs. Post-FIRM, and the freeboard threshold effect.

EDA 5 — Building Vulnerability

Key Finding

Building-level interventions reduce losses, but more nuanced than expected. Crossing above BFE cuts median claims by 63% (from $29K to $11K) — a threshold effect, not linear. Post-FIRM buildings show higher average claims than Pre-FIRM ($41K vs $30K) due to higher property values (TIV effect), not greater vulnerability. 81% of claims (1.6M) are non-elevated buildings, meaning the higher-severity group is also the most common type in the portfolio.

EDA 06

Where Are Flood Losses Geographically Concentrated?

State-level total paid, average severity, and a frequency vs. severity scatter for the top 20 states.

EDA 6 — Geography

Key Finding

Losses are dramatically concentrated: Louisiana ($15.9B), Florida ($13.1B), and Texas ($12.5B) account for 63.3% of all NFIP building claim payments. LA leads on volume (~360K claims at $44K avg), FL leads on per-claim severity ($44,784). Mississippi is a hidden risk — moderate volume (~50K claims) but 3rd-highest average severity at $43,343, driven by Katrina concentration. NY/NJ appear in top-15 severity due entirely to Hurricane Sandy (2012).

EDA 07

How Are All Features Related to Each Other?

Correlation matrix across 20 key features and their individual correlations with the target.

EDA 7 — Correlation Matrix

Key Finding

Individual feature-to-target correlations are modest (max ~0.25), which is normal for insurance loss prediction. water_depth_log (0.248) and is_hurricane_season (0.239) are the strongest linear signals. Critically, freeboard_positive shows only −0.078 linear correlation, yet SHAP analysis reveals elevation as a top XGBoost feature — because its protective effect is conditional on flood zone and depth. This is the core insight: linear correlation cannot detect conditional interactions, which is precisely why XGBoost outperforms Linear Regression.

EDA 08

What Does a High-Severity vs. Low-Severity Claim Look Like?

Profile comparison: top 10% vs. bottom 30% claims across 13 features.

EDA 8 — Claim Profiles

Key Finding

A high-severity NFIP claim has a consistent, physically sensible profile: 8.55 inches of water depth (vs 2.21 for low-severity — 3.9×), 39.2% in high-risk zones (vs 11.2% — 3.5×), 94.5% in coastal states (vs 71.9%), and $317K average TIV (vs $113K — 2.8×). Buildings average 39.9 years old vs 27.4 years. The profiles are coherent and grounded in flood science — the model is learning real patterns, not statistical noise.

Model Development

Three Progressively Powerful Models

We trained three models in increasing order of complexity, using each as a benchmark for the next. All evaluated on a held-out test set of ~380K claims.

Linear Regression
Baseline — assumes linearity
0.307
RMSE$39,888
MAE$25,384
Variation Explained: 30.7%
Random Forest
100 trees · max_depth=10
0.551
RMSE$32,116
MAE$19,777
Variation Explained: 55.1% (+80% vs LR)
✓ Final Model
XGBoost v1
450 trees · gradient boosting · manually tuned
0.669
RMSE$27,577
MAE$16,474
Variation Explained: 66.9%
ModelRMSEMAEVariation ExplainedNote
Linear Regression0.307$39,888$25,38430.7%Baseline — fails on non-linear interactions
Random Forest0.551$32,116$19,77755.1%+80% improvement over LR
XGBoost v1 ✓0.669$27,577$16,47466.9%Best — sequential boosting on residuals
XGBoost v2 (Optuna)0.666$27,694$16,56466.6%Automated search — slightly underperforms manual tuning

Overfitting Check

After training, we ran the model on both the training set and the held-out test set to check if it was memorizing data rather than learning real patterns. A small gap between train and test performance means the model generalizes well.

Train R²
0.7254
On 1.57M training claims
Test R²
0.6688
On 397K unseen claims
Gap
0.057
Threshold: < 0.10 is healthy

No overfitting detected. A gap of 0.057 is well within the healthy range.

The model is learning the real underlying patterns of flood damage — not memorizing training claims. It performs almost equally well on claims it has never seen before, which confirms it can be deployed reliably on new incoming claims.

💡 What would bad overfitting look like? If Train R² were 0.95 and Test R² were 0.40, the model would be memorizing training data and failing on new claims. A gap under 0.10 is the industry standard for acceptable generalization.

Model Comparison Chart
Figure: Model performance comparison — R², RMSE, and MAE across all four models
Model Interpretation

What Drives Flood Claim Severity?

Two complementary interpretation methods: XGBoost's built-in gain importance (global, which features are used most) and SHAP values (per-prediction, which features pushed each claim up or down).

Feature Importance
Figure: XGBoost feature importance by gain — top predictors of flood claim severity
RankFeatureImportanceInterpretation
1cat_year_coastal0.2676
Catastrophe year × coastal state — Katrina (2005), Harvey (2017), Sandy (2012) dominate the signal
2elevatedBuildingIndicator0.0632
Single most impactful structural predictor across all flood zones
3waterDepth0.0478
How deep the water was — non-linear depth-damage relationship captured by tree splits
4buildingPropertyValue0.0339
Higher value building = larger absolute dollar loss (scale effect)
5longitude0.0275
Geographic exposure — Gulf Coast vs. inland U.S.

SHAP Beeswarm — Per-Prediction Explanations

SHAP decomposes every individual prediction into the contribution of each feature. Unlike global importance, SHAP shows both direction (does this push the prediction up or down?) and magnitude for every claim in a 5,000-claim sample.

🔑 Key insight from SHAP: freeboard_positive shows only −0.078 linear correlation with the target, but SHAP reveals it as a top contributor — because its effect is conditional. When water depth is high AND the building is not elevated AND the zone is AE/VE, freeboard contributions are large and negative (reduces predicted loss when above BFE). Linear correlation averages this out to near-zero; SHAP exposes it.

SHAP Beeswarm
Figure: SHAP beeswarm plot on 5,000-claim sample — Red = high feature value, Blue = low feature value, X-axis = impact on predicted claim ($)

Model Diagnostics

Actual vs. predicted scatter and residuals plot on a 5,000-claim holdout sample. A well-calibrated model should scatter around the 45° diagonal with residuals centered near zero.

Model Diagnostics
Figure: Actual vs. Predicted (left) and Residuals Plot (right) — XGBoost v1 on 5,000-claim sample

The model under-predicts the largest catastrophic claims ($150K+) — a known limitation of tree-based models on heavy-tailed distributions. This informs the deployment recommendation: use the model's top 15% predicted segment (not just top 10%) as the triage priority queue to include a buffer for underestimated tail events.

Part 4 — The Recommendation

Three Direct Applications for FEMA / NFIP

The XGBoost model predicts expected claim payout with R² = 0.669 and MAE = $16,474. Each recommendation below is a direct, immediate application of this predictive capability.

Recommendation 01

🚨 Triage Claims the Moment They Arrive

The problem today: After a hurricane, thousands of claims land simultaneously and are processed in arrival order. A $180,000 loss at position 800 in the queue gets the same priority as a $2,000 loss at position 1.

What the model does: Score every incoming claim instantly using information already collected at submission — water depth, flood zone, building value, location, elevation status. Rank by predicted severity. Dispatch senior adjusters to the top 15% first.

Impact: Faster settlement for the largest losses. Reduced litigation. Better resource allocation at the moment it matters most.

Caveat: The model under-predicts very large catastrophic claims ($150K+). Use the top 15% as the triage priority queue — not just the top 10% — to buffer for tail underestimation.
Recommendation 02

🔍 Find the Highest-Risk Properties Before They Flood

The problem today: NFIP knows which properties are insured but not which ones are most likely to generate large claims next season.

What the model does: Run the model on every in-force policy using current property characteristics (set waterDepth to the median historical value for that flood zone). Filter for: postFIRMConstructionIndicator = 0 + elevatedBuildingIndicator = 0 + is_coastal_state = 1 — the highest-risk segment confirmed by both EDA Q5 and the model's feature importance.

Impact: Proactive mitigation outreach to the right properties. Elevation certificate campaigns targeted where they reduce the most expected loss.

Caveat: This score assumes the property floods — it is not a flood probability estimate. Pair with historical flood frequency data for a complete risk picture.
Recommendation 03

📐 Put a Dollar Figure on Building Elevation

The problem today: Elevation-based premium discounts are set using flat rate tables — the same discount regardless of the property's specific flood zone, building value, or location.

What the model does: Run two predictions on the same property — once with elevatedBuildingIndicator = 0, once with elevatedBuildingIndicator = 1. The difference is a property-specific dollar estimate of how much elevation reduces expected loss. elevatedBuildingIndicator is our #2 feature by gain importance (0.0632).

Impact: Property-specific elevation incentives. More accurate premium pricing. Data-driven justification for mitigation grants.

Caveat: This is a model estimate, not a guaranteed saving. The SHAP analysis shows the elevation effect is modest per-claim — treat the dollar figure as a guide, not a precise actuarial calculation.

What Would Make This Model Better

🌐 Real Flood Event Features

Storm track data, National Weather Service peak stage readings, and event-level inundation maps would dramatically improve predictions for catastrophe-year claims.

📍 Neighborhood-Level Geography

Census tract or ZIP-level flood history, elevation data from USGS DEMs, and distance-to-water features would replace the coarse state-level geographic signal.

🏗️ Structural Details

Foundation type, building materials, first-floor elevation certificates, and roof construction details are known to adjusters but not in the NFIP redacted dataset.

⏱️ Temporal Deployment

A production version should retrain annually on recent catastrophe years to prevent concept drift as climate patterns shift coastal risk profiles over time.

Final Summary

From 1.99M Claims to Actionable Predictions

Starting from raw FEMA data with 73 columns and 1.99 million records, this project delivered a production-ready flood claim severity model through a rigorous end-to-end pipeline.

R² 0.669
Variation Explained by XGBoost
$16,474
Mean Absolute Error
+118%
R² Improvement vs. Linear Regression
22
Domain-Engineered Features

Core Insight

The key finding of this project is not the R² score — it is the demonstration that conditional feature interactions govern flood losses. Elevation only matters in certain zones at certain water depths. Hurricane season only matters in coastal states. Linear models average across all these conditions and fail. XGBoost discovers them automatically through tree splits — which is why it explains 66.9% of variation vs. only 30.7% for Linear Regression on the same data.