How to Tackle Imbalanced Data with DataRobot

June 22, 2021

· 2 min read

This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about the DataRobot AI Platform, data science, and more.

Even with imbalanced data, DataRobot can build a good model. In this article, we explain imbalanced data and share how to best attack imbalanced data with DataRobot.

Imbalanced data occurs when there is an overrepresentation of a certain value inside the target variable. Imbalanced datasets for binary classification projects have an overrepresentation of the majority class compared to the minority class; for example, the target variable graph in Figure 1 shows the imbalance in is_bad. For regression projects, the imbalanced dataset tends to have an overrepresentation of zeros in the target variable as shown in the target variable graph for IncurredClaims (Figure 2). In these cases, when the minority class is significantly more important than the majority class, you need to take this imbalance into account when building models.

Figure 1. A binary classification project with imbalanced data

Figure 2. A regression project with imbalanced data

DataRobot automatically handles imbalanced data during its model building process by picking the best model optimization metric given the distribution of the values in the target variable.

For imbalanced binary classification datasets, DataRobot uses the LogLoss metric which is known to be robust to imbalanced data (Figure 3). For zero-inflated regression datasets, DataRobot uses Tweedie Deviance (Figure 4).

Figure 3. The Logloss metric recommended by DataRobot for an imbalanced binary classification projects

Figure 4. The Tweedie Deviance optimization metric recommended by DataRobot for zero inflated regression problems

Additionally, DataRobot includes a variety of modeling algorithms which are robust to imbalanced data, such as the different kinds of Tree-based models (Figure 5) and the frequency-severity models (Figure 6).

Figure 5. A sample of Tree based models found in DataRobot

Figure 6. A sample of Frequency Severity models found in DataRobot

You can also use the Smart Downsampling technique (from the Advanced Options tab) to intentionally downsample the majority class in order to build faster models with similar accuracy (Figure 7). This can be done for both classification or zero-inflated regression problems.

Figure 7. The Smart Downsampling tab in DataRobot

Finally, you could leverage information provided in the ROC curve to minimize the effect of imbalanced data to model performance. For instance, depending on the level of tolerance your business has for false positives versus false negatives, you could use the Prediction Distribution graph (Figure 8 ) to adjust the prediction thresholds for your classifiers. This could improve the predictions of the minority class.

Figure 8. The Prediction Distribution section inside the ROC Curve tab

You could also pay more attention to the Matthews Correlation Coefficient (Figure 9) for model evaluation because it will give you a better sense of the model’s performance over the other metrics in that tab. The Mathews Correlation Coefficient was specifically designed to evaluate models built from imbalanced datasets.

Figure 9. The Matthews Correlation Coefficient inside the ROC Curve tab

More Information

Wikipedia: Matthews correlation coefficient

DataRobot documentation portal: Smart Downsampling

About the author

Linda Haviland

Community Manager

Meet Linda Haviland

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

See other posts in AI & ML Expertise

Subscribe to our Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

How to Tackle Imbalanced Data with DataRobot

More Information

Choosing the Right Vector Embedding Model for Your Generative AI Use Case

Reflecting on the Richness of Black Art

6 Reasons Why Generative AI Initiatives Fail and How to Overcome Them

Related Posts

Thanks! Check your inbox to confirm your subscription.