NFIP Flood Claim Severity Prediction

Business Context

Why Flood Claim Prediction Matters

The National Flood Insurance Program (FEMA) is the primary source of flood insurance in the U.S. With over 5 million active policies and $70B+ in historical payouts, accurately predicting individual claim severity is a high-value operational problem — it drives claims triage, reinsurance reserving, and portfolio risk management.

Target variable: amountPaidOnBuildingClaim — actual NFIP payout for structural damage, ranging from $0.01 to $250,000.

🌊

Hazard Intensity

How severe the flood was — captured by waterDepth, flood_zone_risk_score, and is_hurricane_season. A shallow flood causes minimal damage; six feet of surge is catastrophic.

🏠

Building Vulnerability

How exposed the structure was — elevatedBuildingIndicator, height_above_BFE, postFIRMConstructionIndicator, and building_age_at_loss.

📋

Financial Structure

Policy terms — totalBuildingInsuranceCoverage, deductible_amount, and buildingPropertyValue all determine the ultimate settlement amount.

🧠

Why Linear Models Fail

A shallow flood on a non-elevated Pre-FIRM building in Zone VE is catastrophic. The same depth on a Post-FIRM elevated building in Zone X causes almost none. These conditions interact — linear models cannot capture them.

Project Structure

End-to-End ML Pipeline

Part 1

The Situation

Load 1.99M raw records · Data quality audit · Target variable cleaning

Part 2

The Discovery

22 engineered features · 8 EDA questions answered

Part 3

The Model

Linear → Random Forest → XGBoost with Optuna tuning

Part 4

The Recommendation

SHAP interpretation · Diagnostics · 3 business recommendations

Data

FEMA NFIP Redacted Claims V2

Historical flood insurance claims from FEMA covering multiple decades, 50+ states, and every flood zone category. Starting from 73 raw columns, we retained a clean working dataset after removing leakage, high-cardinality identifiers, and invalid records.

1.99M

Raw Records

73

Original Columns

22

Engineered Features

80/20

Train / Test Split

Feature Engineering — 22 New Predictive Features

Raw columns are rarely in optimal form for ML. We engineered 22 features across five domain categories:

Category	Features	Rationale
Building Age	building_age_at_loss	Older buildings predate flood codes; structurally more vulnerable
Elevation & Freeboard	height_above_BFE, height_above_ground, freeboard_positive, freeboard_deficit	Floor height above BFE is the primary physical determinant of whether water enters a building
Hazard Timing	is_hurricane_season, water_depth_log	June–Nov is Atlantic hurricane season; log depth approximates the non-linear depth-damage curve
Flood Zone	flood_zone_risk_score, is_high_risk_zone	FEMA zones have a clear ordinal risk hierarchy encoded explicitly
Building Type & Geography	8 is_* flags, is_coastal_state, risk_zone_elevation_interaction, deductible_amount, cat_year_coastal, state_target_enc, year_target_enc	Structure type, coastal exposure, catastrophe year × coastal interaction

Exploratory Data Analysis

8 Questions That Shaped the Model

Each EDA question was chosen to directly inform modelling decisions — not just describe the data, but test whether a feature is worth including and in what form.

EDA 01

What Does the Target Variable Actually Look Like?

Distribution of amountPaidOnBuildingClaim — raw, log-transformed, and by frequency tier.

Key Finding

The raw distribution is severely right-skewed — most claims are small ($500–$15K) but extreme claims up to $250K exist. Log-transformation converts this into a roughly bell-shaped distribution, which matters for Linear Regression's error assumptions. The small-to-moderate claims dominate frequency, meaning a model that only predicts large losses accurately would fail on the majority of real-world claims.

EDA 02

When Do Floods Cause the Most Severe Damage?

Temporal patterns: claim volume and severity by month, season, and loss year.

Key Finding

Claims are dramatically concentrated in August–October — peak Atlantic hurricane season. The 2005 (Katrina), 2012 (Sandy), and 2017 (Harvey, Irma, Maria) hurricane years are visible as distinct severity spikes. is_hurricane_season shows one of the strongest linear correlations with the target (0.239), directly justifying its inclusion as a binary feature.

EDA 03

Does Water Depth Actually Predict Claim Size?

Depth-damage relationship — raw vs. log-transformed, and the non-linear curve.

Key Finding

The relationship between water depth and claim amount is non-linear and concave — a classic depth-damage curve from flood engineering. The first few inches of water cause a disproportionate jump in losses; marginal damage diminishes at depth extremes. This non-linearity directly motivates the water_depth_log engineered feature, which improves linear correlation from 0.156 (raw) to 0.248 (log).

EDA 04

Do FEMA's Flood Zones Predict Losses?

Claim severity and volume by flood zone category — testing whether FEMA's risk hierarchy is reflected in actual payouts.

Key Finding

High-risk zones (AE, VE) do produce higher average severity, but Zone X (low-risk) generates the most total claim volume due to sheer policy count — the "volume paradox." Zone designation captures the broad severity signal, but physical hazard features (water depth, elevation) are essential complements. A claim in Zone X during Hurricane Harvey can exceed one in Zone AE during a minor rain event.

EDA 05

Do Elevation & Construction Era Actually Reduce Losses?

Elevated vs. non-elevated buildings, Pre-FIRM vs. Post-FIRM, and the freeboard threshold effect.

Key Finding

Building-level interventions reduce losses, but more nuanced than expected. Crossing above BFE cuts median claims by 63% (from $29K to $11K) — a threshold effect, not linear. Post-FIRM buildings show higher average claims than Pre-FIRM ($41K vs $30K) due to higher property values (TIV effect), not greater vulnerability. 81% of claims (1.6M) are non-elevated buildings, meaning the higher-severity group is also the most common type in the portfolio.

EDA 06

Where Are Flood Losses Geographically Concentrated?

State-level total paid, average severity, and a frequency vs. severity scatter for the top 20 states.

Key Finding

Losses are dramatically concentrated: Louisiana ($15.9B), Florida ($13.1B), and Texas ($12.5B) account for 63.3% of all NFIP building claim payments. LA leads on volume (~360K claims at $44K avg), FL leads on per-claim severity ($44,784). Mississippi is a hidden risk — moderate volume (~50K claims) but 3rd-highest average severity at $43,343, driven by Katrina concentration. NY/NJ appear in top-15 severity due entirely to Hurricane Sandy (2012).

EDA 07

How Are All Features Related to Each Other?

Correlation matrix across 20 key features and their individual correlations with the target.

Key Finding

Individual feature-to-target correlations are modest (max ~0.25), which is normal for insurance loss prediction. water_depth_log (0.248) and is_hurricane_season (0.239) are the strongest linear signals. Critically, freeboard_positive shows only −0.078 linear correlation, yet SHAP analysis reveals elevation as a top XGBoost feature — because its protective effect is conditional on flood zone and depth. This is the core insight: linear correlation cannot detect conditional interactions, which is precisely why XGBoost outperforms Linear Regression.

EDA 08

What Does a High-Severity vs. Low-Severity Claim Look Like?

Profile comparison: top 10% vs. bottom 30% claims across 13 features.

Key Finding

A high-severity NFIP claim has a consistent, physically sensible profile: 8.55 inches of water depth (vs 2.21 for low-severity — 3.9×), 39.2% in high-risk zones (vs 11.2% — 3.5×), 94.5% in coastal states (vs 71.9%), and $317K average TIV (vs $113K — 2.8×). Buildings average 39.9 years old vs 27.4 years. The profiles are coherent and grounded in flood science — the model is learning real patterns, not statistical noise.

Model Development

Three Progressively Powerful Models

We trained three models in increasing order of complexity, using each as a benchmark for the next. All evaluated on a held-out test set of ~380K claims.

Linear Regression

Baseline — assumes linearity

R²0.307

RMSE$39,888

MAE$25,384

Variation Explained: 30.7%

Random Forest

100 trees · max_depth=10

R²0.551

RMSE$32,116

MAE$19,777

Variation Explained: 55.1% (+80% vs LR)

✓ Final Model

XGBoost v1

450 trees · gradient boosting · manually tuned

R²0.669

RMSE$27,577

MAE$16,474

Variation Explained: 66.9%

Model	R²	RMSE	MAE	Variation Explained	Note
Linear Regression	0.307	$39,888	$25,384	30.7%	Baseline — fails on non-linear interactions
Random Forest	0.551	$32,116	$19,777	55.1%	+80% improvement over LR
XGBoost v1 ✓	0.669	$27,577	$16,474	66.9%	Best — sequential boosting on residuals
XGBoost v2 (Optuna)	0.666	$27,694	$16,564	66.6%	Automated search — slightly underperforms manual tuning

Overfitting Check

After training, we ran the model on both the training set and the held-out test set to check if it was memorizing data rather than learning real patterns. A small gap between train and test performance means the model generalizes well.

Train R²

0.7254

On 1.57M training claims

Test R²

0.6688

On 397K unseen claims

Gap

0.057

Threshold: < 0.10 is healthy

✅

No overfitting detected. A gap of 0.057 is well within the healthy range.

The model is learning the real underlying patterns of flood damage — not memorizing training claims. It performs almost equally well on claims it has never seen before, which confirms it can be deployed reliably on new incoming claims.

💡 What would bad overfitting look like? If Train R² were 0.95 and Test R² were 0.40, the model would be memorizing training data and failing on new claims. A gap under 0.10 is the industry standard for acceptable generalization.

Figure: Model performance comparison — R², RMSE, and MAE across all four models

Model Interpretation

What Drives Flood Claim Severity?

Two complementary interpretation methods: XGBoost's built-in gain importance (global, which features are used most) and SHAP values (per-prediction, which features pushed each claim up or down).

Figure: XGBoost feature importance by gain — top predictors of flood claim severity

Rank	Feature	Importance	Interpretation
1	cat_year_coastal	0.2676	Catastrophe year × coastal state — Katrina (2005), Harvey (2017), Sandy (2012) dominate the signal
2	elevatedBuildingIndicator	0.0632	Single most impactful structural predictor across all flood zones
3	waterDepth	0.0478	How deep the water was — non-linear depth-damage relationship captured by tree splits
4	buildingPropertyValue	0.0339	Higher value building = larger absolute dollar loss (scale effect)
5	longitude	0.0275	Geographic exposure — Gulf Coast vs. inland U.S.

SHAP Beeswarm — Per-Prediction Explanations

SHAP decomposes every individual prediction into the contribution of each feature. Unlike global importance, SHAP shows both direction (does this push the prediction up or down?) and magnitude for every claim in a 5,000-claim sample.

🔑 Key insight from SHAP: freeboard_positive shows only −0.078 linear correlation with the target, but SHAP reveals it as a top contributor — because its effect is conditional. When water depth is high AND the building is not elevated AND the zone is AE/VE, freeboard contributions are large and negative (reduces predicted loss when above BFE). Linear correlation averages this out to near-zero; SHAP exposes it.

Figure: SHAP beeswarm plot on 5,000-claim sample — Red = high feature value, Blue = low feature value, X-axis = impact on predicted claim ($)

Model Diagnostics

Actual vs. predicted scatter and residuals plot on a 5,000-claim holdout sample. A well-calibrated model should scatter around the 45° diagonal with residuals centered near zero.

Figure: Actual vs. Predicted (left) and Residuals Plot (right) — XGBoost v1 on 5,000-claim sample

The model under-predicts the largest catastrophic claims ($150K+) — a known limitation of tree-based models on heavy-tailed distributions. This informs the deployment recommendation: use the model's top 15% predicted segment (not just top 10%) as the triage priority queue to include a buffer for underestimated tail events.

Part 4 — The Recommendation

Three Direct Applications for FEMA / NFIP

The XGBoost model predicts expected claim payout with R² = 0.669 and MAE = $16,474. Each recommendation below is a direct, immediate application of this predictive capability.

Recommendation 01

🚨 Triage Claims the Moment They Arrive

The problem today: After a hurricane, thousands of claims land simultaneously and are processed in arrival order. A $180,000 loss at position 800 in the queue gets the same priority as a $2,000 loss at position 1.

What the model does: Score every incoming claim instantly using information already collected at submission — water depth, flood zone, building value, location, elevation status. Rank by predicted severity. Dispatch senior adjusters to the top 15% first.

Impact: Faster settlement for the largest losses. Reduced litigation. Better resource allocation at the moment it matters most.

Caveat: The model under-predicts very large catastrophic claims ($150K+). Use the top 15% as the triage priority queue — not just the top 10% — to buffer for tail underestimation.

Recommendation 02

🔍 Find the Highest-Risk Properties Before They Flood

The problem today: NFIP knows which properties are insured but not which ones are most likely to generate large claims next season.

What the model does: Run the model on every in-force policy using current property characteristics (set waterDepth to the median historical value for that flood zone). Filter for: postFIRMConstructionIndicator = 0 + elevatedBuildingIndicator = 0 + is_coastal_state = 1 — the highest-risk segment confirmed by both EDA Q5 and the model's feature importance.

Impact: Proactive mitigation outreach to the right properties. Elevation certificate campaigns targeted where they reduce the most expected loss.

Caveat: This score assumes the property floods — it is not a flood probability estimate. Pair with historical flood frequency data for a complete risk picture.

Recommendation 03

📐 Put a Dollar Figure on Building Elevation

The problem today: Elevation-based premium discounts are set using flat rate tables — the same discount regardless of the property's specific flood zone, building value, or location.

What the model does: Run two predictions on the same property — once with elevatedBuildingIndicator = 0, once with elevatedBuildingIndicator = 1. The difference is a property-specific dollar estimate of how much elevation reduces expected loss. elevatedBuildingIndicator is our #2 feature by gain importance (0.0632).

Impact: Property-specific elevation incentives. More accurate premium pricing. Data-driven justification for mitigation grants.

Caveat: This is a model estimate, not a guaranteed saving. The SHAP analysis shows the elevation effect is modest per-claim — treat the dollar figure as a guide, not a precise actuarial calculation.

What Would Make This Model Better

🌐 Real Flood Event Features

Storm track data, National Weather Service peak stage readings, and event-level inundation maps would dramatically improve predictions for catastrophe-year claims.

📍 Neighborhood-Level Geography

Census tract or ZIP-level flood history, elevation data from USGS DEMs, and distance-to-water features would replace the coarse state-level geographic signal.

🏗️ Structural Details

Foundation type, building materials, first-floor elevation certificates, and roof construction details are known to adjusters but not in the NFIP redacted dataset.

⏱️ Temporal Deployment

A production version should retrain annually on recent catastrophe years to prevent concept drift as climate patterns shift coastal risk profiles over time.

Final Summary

From 1.99M Claims to Actionable Predictions

Starting from raw FEMA data with 73 columns and 1.99 million records, this project delivered a production-ready flood claim severity model through a rigorous end-to-end pipeline.

R² 0.669

Variation Explained by XGBoost

$16,474

Mean Absolute Error

+118%

R² Improvement vs. Linear Regression

22

Domain-Engineered Features

Core Insight

The key finding of this project is not the R² score — it is the demonstration that conditional feature interactions govern flood losses. Elevation only matters in certain zones at certain water depths. Hurricane season only matters in coastal states. Linear models average across all these conditions and fail. XGBoost discovers them automatically through tree splits — which is why it explains 66.9% of variation vs. only 30.7% for Linear Regression on the same data.

PredictingFlood Claim Severityfor the NFIP

Why Flood Claim Prediction Matters

Hazard Intensity

Building Vulnerability

Financial Structure

Why Linear Models Fail

End-to-End ML Pipeline

FEMA NFIP Redacted Claims V2

Feature Engineering — 22 New Predictive Features

8 Questions That Shaped the Model

What Does the Target Variable Actually Look Like?

Key Finding

When Do Floods Cause the Most Severe Damage?

Key Finding

Does Water Depth Actually Predict Claim Size?

Key Finding

Do FEMA's Flood Zones Predict Losses?

Key Finding

Do Elevation & Construction Era Actually Reduce Losses?

Key Finding

Where Are Flood Losses Geographically Concentrated?

Key Finding

How Are All Features Related to Each Other?

Key Finding

What Does a High-Severity vs. Low-Severity Claim Look Like?

Key Finding

Three Progressively Powerful Models

Overfitting Check

What Drives Flood Claim Severity?

SHAP Beeswarm — Per-Prediction Explanations

Model Diagnostics

Three Direct Applications for FEMA / NFIP

🚨 Triage Claims the Moment They Arrive

🔍 Find the Highest-Risk Properties Before They Flood

📐 Put a Dollar Figure on Building Elevation

What Would Make This Model Better

🌐 Real Flood Event Features

📍 Neighborhood-Level Geography

🏗️ Structural Details

⏱️ Temporal Deployment

From 1.99M Claims to Actionable Predictions

Core Insight

Predicting
Flood Claim Severity
for the NFIP