My satellite image segmentation model wasn’t learning, and we were wasting a $60,000 contract.
Our classes were skewed 1 to 30!
That was the true target of our model, but we fixed it with an easy trick.
In the early days of my career transition out of the oil and gas industry to the world of machine learning, I found myself facing a challenging project involving satellite image segmentation. It was a thrilling detective story, and it's a story worth sharing.
We were tasked with creating a satellite image segmentation model, and as you may know, Earth sciences often come with this inherent problem. Some classes in the imagery data were much rarer than others, which presented a real challenge.
In the Earth sciences, it's common to encounter situations where the classes you want to identify are not evenly distributed. In our case, we had classes like "forest," "water bodies," and "urban areas." These are naturally imbalanced because, well, urban areas are far less common than, say, forests. The typical class ratio might be more like 1 to 50, making it necessary to address this imbalance to create any useful machine learning model!
It got worse!
We didn’t have time to segment more data before the first milestone meeting. The model had to work with what we had!
We needed to tackle this class imbalance head-on, and one of the methods we explored was Random Undersampling. This technique involves randomly removing some of the majority class samples to balance the class distribution. In Earth sciences, this would mean reducing the number of samples from common classes like "forest" or "water bodies."
In our case, Random Undersampling proved to be really effective, but it came with a crucial insight: we didn't need to aim for a perfect 1 to 1 class ratio, which is a common misconception.
We aimed to maintain a ratio somewhere along the lines of 1 to 7.
Balancing the smallest and largest classes while preserving some of the data from the majority class was the right choice.
Here's where things got interesting. While Random Undersampling was a practical approach, we noticed that Synthetic Minority Over-sampling Technique (SMOTE) was frequently mentioned in blog posts and research papers.
But it seemed to be conspicuously absent in real-world applications.
SMOTE is a method for oversampling the minority class by generating synthetic samples based on existing data points. While it's a fascinating concept, its practical application often required careful fine-tuning and sometimes didn't provide the expected results, especially when dealing with satellite imagery.
In the real world, where data is often messy, noisy, and complex, Random Undersampling proved to be a faster and more robust solution.
Sometimes the simplest solutions are the most effective!
@jesper I've thought a lot about random undersampling and my intuition is that we should preserve the Zipf / Pareto distribution of classes but simply make it less pronounced. I think it should be possible to do by choosing a line with lower intersect and less slope than the natural fit in loglog space. Then finding the ratio in real space and using that as a probability to filter a data point.
@drgroftehauge Yeah, there's a trade-off to make the signal even "learnable". So I would reduce the imbalance to the point of the model learning. Then calibrate the probabilities after. (So basically the practical "engineering heavy" approach)