Robustness and Stability as Dimensions of Trusted AI
A model in production encounters all sorts of unclean, chaotic data, from typos to anomalous events, which can trigger unintended behavior. Find out how to test whether your model is ready for the real world.
Importance of Model Robustness and Stability for Trust in AI
Protecting your productionalized model from uncertain predictions is the subject of AI humility, but identifying those weaknesses in advance and tailoring your modeling approach to minimize them is the core concept behind robust and stable model performance.
What Does It Mean for an AI System to Be Robust and Stable?
The answer to this question is intimately tied to another: How does your model respond to changes in the data? You can explore these questions by subjecting your model to testing.
For model stability, the relevant changes in your data are typically very small disturbances, against which you want ideally to see stable and consistent behavior from your model. Unless there is a recognized hard threshold at which behavior should change, when an input variable is disturbed a small amount, the resulting prediction should likely change only a small amount.
For model robustness, it is more appropriate to think about large disturbances or values that the model has never seen before, which are likely the result of one of three scenarios:
- A change in the process introduced new values
- Certain values are missing when they weren’t before
- These new values are associated with ultimately unrealistic and erroneous inputs
Although additional protection against such outlying inputs should be operationalized as part of AI humility, it is worth understanding how your model’s predictions will respond to these data disturbances. Will your prediction blow up and assume a very high value? Will it plateau asymptotically or be driven to zero?
Sometimes you can anticipate these issues in processing data before you send it to a model to score. For example, consider a model exposed to a value it has never seen before. In a pricing model for insurance, you input information on the make and model on the car. The training data has been cleaned, and instances of “Lexus” are properly capitalized. But a sales representative on a call using the model to provide a quote enters “lexus”. Ideally, the model will not treat this as a never-before-seen value, but will have a preprocessing step to ensure it is case-insensitive and behaves as expected.
Reproducibility is a special case of data disturbance–no disturbance at all. An auditable, reliable model should return the same prediction for the same input values consistently.
How Do I Ensure That Model Behavior Is Stable, Robust, and Reproducible?
Although the details of these concerns might be unique to machine learning, best practices in software development already outline steps to help ensure that a final product is robust and stable.
Unit testing describes the process of breaking down an application into its smallest and most discrete modules that satisfy unique functions. Then unit testing creates automated tests around them that safeguard their behavior when confronted with new inputs. Examples of these inputs include an input value of the wrong data type such as a miskeyed character, or a forbidden value, such as a negative number for a score of 0-10.
Integration testing then investigates if these individual software modules behave and interact together as intended. You can extend these concepts directly to productionalizing a machine learning model, holding it to the same standards as any other application in your broader ecosystem.
Robustness and Stability Are Just Pieces of the Puzzle
Robustness and stability are only a few dimensions of model performance that directly contribute to the trustworthiness of generated predictive models. The full list includes the following: