How do you make sure your AI doesn’t suddenly make up hurricanes everywhere?
It’s tricky. First, it doesn't learn, then it gets enamoured with rare events!
Here are 8 techniques and tools that help:
1. Balanced Sampling
We don’t have to balance rare and common events 1:1.
The common wisdom is that most models will learn fine at an imbalance of 1:7.
𝐩𝐢𝐩 𝐢𝐧𝐬𝐭𝐚𝐥𝐥 𝐢𝐦𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝-𝐥𝐞𝐚𝐫𝐧
If you’re worried about over-predicting, look at the last point!
2. Use Evaluation Metrics
Use evaluation metrics that work with imbalanced data like the F1 Score:
𝐬𝐤𝐥𝐞𝐚𝐫𝐧.𝐦𝐞𝐭𝐫𝐢𝐜𝐬.𝐟𝟏_𝐬𝐜𝐨𝐫𝐞()
3. Adjust Decision Threshold
Predicting a probability and choosing a threshold enables experts to choose the most accurate decision boundary!
But be aware that the Receiver-Operating-Curve (ROC) is sensitive to class imbalances, use the Precision-Recall-Curve instead.
𝐬𝐤𝐥𝐞𝐚𝐫𝐧.𝐦𝐞𝐭𝐫𝐢𝐜𝐬.𝐩𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧_𝐫𝐞𝐜𝐚𝐥𝐥_𝐜𝐮𝐫𝐯𝐞()
4. Cost-sensitive Learning
Assign different costs to different prediction errors.
This can be done by assigning a cost function to the decision matrix or by using cost-sensitive algorithms like XGBoost and siblings!
𝐱𝐠𝐛_𝐜𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐞𝐫.𝐟𝐢𝐭(𝐗, 𝐲, 𝐬𝐚𝐦𝐩𝐥𝐞_𝐰𝐞𝐢𝐠𝐡𝐭=𝐜𝐥𝐚𝐬𝐬𝐞𝐬_𝐰𝐞𝐢𝐠𝐡𝐭𝐬)
5. Feature Engineering
In some cases, good ol’ feature engineering does wonders for the algorithm to get a good glimpse of the relations in a dataset.
6. Collect More Data
… for hurricanes, maybe do some simulations.
We have enough of them already.
7. Continuous Model Monitoring
Implement monitoring strategies that look at model drift and flag predictions that deviate significantly.
I write about model drift in the book I give away to my newsletter readers as a gift! Available on my profile when you follow
8. Model Calibration
Train the model on balanced data, like in #1.
Then we’ll add a model calibration on top to make sure that the model only gets to predict the event as many times as it’s present in the training data.
𝐬𝐤𝐥𝐞𝐚𝐫𝐧.𝐜𝐚𝐥𝐢𝐛𝐫𝐚𝐭𝐢𝐨𝐧.𝐂𝐚𝐥𝐢𝐛𝐫𝐚𝐭𝐞𝐝𝐂𝐥𝐚𝐬𝐬𝐢𝐟𝐢𝐞𝐫𝐂𝐕()
So many ways to make machine learning work in the real world!