The Demise of A/B Testing - ADEPT Decisions

In this post, Jarrod McElhinney journeys through his experience of A/B testing.

A/B (or Champion/Challenger) testing

When I started in the credit industry as a programmer working on FICO’s TRIAD product back in the late 1990s, I was introduced to a number of concepts that were firmly entrenched in the credit vetting and account management lexicon. My first introduction to the mathematics and prediction capability of scoring and scorecards blew my mind, but it was A/B (or Champion/Challenger) testing that would ultimately become a massively important part of my career.

Fast forward a few years to when I relocated back to South Africa and completed my migration from an IT professional to a more business focused role as a credit risk management consultant. There I would have the privilege of working with some of the most talented people in the industry, as we moved credit strategies from experiential best practice, through to basic outcomes data driven, and ultimately to outcome modelled solutions. It was an incredible time to be part of the industry, when credit risk consultants and credit risk managers were learning together, continually striving to empirically beat the returns of the previous generation of strategies, using increasingly advanced tools to do so.

A/B testing across the credit life cycle

Underpinning everything that we did back then was rigorous and clean A/B testing. We would design new strategies across all areas of the credit life cycle, and then painstakingly work out how to combine these strategies into the randomly allocated test groups that we had available. This approach enabled us to measure the impact of single strategies, but also their effect when applied in combination with other strategies in other account management areas. Combining aggressive limit management strategies on one side, with rehabilitative collections strategies on the other, and all sorts of marketing and pricing tests in the middle.

Due to the properly structured A/B strategy assignment, and the statistically tested random number assignment that was used to ensure that all test groups were equal at the start of a test cycle, we were able to quantify to a very fine degree of detail the value of a strategy to the business. A variety of outcome criteria, from revenue, costs, bad debt, operational impact and ultimately profitability were measured and analysed through the course of a strategy generation. This enabled us to crown new winners and move forward to the next generation of testing, secure in the knowledge that not only had our strategies had a positive impact on the portfolios but knowing exactly how much value they had added.

Testing, testing and testing

For a number of the organisations that I worked with, these generations of strategies evolved over many years, exceeding 10 or even 20 testing cycles. For each testing cycle, we managed somehow to improve the results for the business with the new challengers that we created. That is not to say that all challengers were successful; we certainly had our missteps along the way. The key outcome was that we always knew what we had tested, why and how we did it, and what the impact was on the business, both good and bad. A tremendous amount of intellectual capital was built up amongst the credit risk teams as they really learnt what made their customer population tick.

At first glance, it seems that the reason we were able to continue with this approach for so long, was the fact that the environment was stable. We did not have the proliferation of data sources and computing power that we have now, which enables data analysts to continually refine their models and predictions. However, on reflection, the credit environment in the 2000s was not at all stable.

External impact on testing frequency

From a data standpoint, the credit bureaux were expanding massively, going from offering a single credit score to offering a vast range of data characteristics that organisations could bring into their strategies. The frequency of data retrieval increased too, with credit grantors moving from quarterly batches of data for account management use (referred to as FatMAN runs) to monthly data, and to real time trigger events that fed data in as and when pertinent changes occurred.

From a strategy design standpoint, the tools were advancing at an incredible rate. While we started the millennium iterating on best practice expert strategies, where we would incrementally push the boundaries for each generation, we closed the decade with highly accurate profit measuring models. These were used first with tree-based data prediction models, and later with strategy optimisation to define the next generation of strategy tests. As our tools improved, we adjusted our strategy design methodology and continued to iterate and improve our credit strategies.

Use your testing tools

In recent years, in my role with CRC, I have been responsible for working with our clients to ensure that they maximise the value and benefits of the ADEPT Decisions Platform (ADP) decision engine in which they have invested.

This decision engine is first rate and has all of the tools built from the ground up to support A/B testing. From the automatic allocation of representative random numbers to customers, through the allocation of strategies and reporting built around random digit groups, to the what-if analysis that empowers clients to test their latest ideas on historic data before committing them to tests.

However, as excited as I am about these capabilities, and as keen as my clients are to use this functionality when they first embark on the ADP integration, my experience has been that they do not often make use of this testing functionality once the system is live.

A/B Testing today?

So, what has happened between the first decade of the millennium and now? How have we arrived at a credit risk environment where A/B testing is no longer seen as the best way to advance strategies? And is my experience isolated, or is this an industry-wide phenomenon?

Although I cannot answer the last question, I would love to get some feedback from the credit risk community. Is your organisation still employing rigorous A/B testing (if it ever did)? If not, what approaches are you following to test and learn, and are these giving you the results that you are looking for? Please do comment on this article and let me know, as I will be very grateful for the feedback.

Below are some of my opinions as to why I think A/B testing has lost its popularity, and also what I see as the impact of the new approaches that are being followed. As always, it is just my opinion, and I would really appreciate other opinions and viewpoints in the comments.

From my observations over the last few years, it appears that many organisations have exchanged rigour for speed, which are two competing approaches that do not have to be mutually exclusive. While the approach of running A/B tests for 6-12 months before assessing and crowning a winner (even longer in credit originations) was acceptable and understood by businesses ten years ago, it just is not rapid enough now.

Do better strategies still require testing?

It also seems that many organisations have started equating better models with better strategies, believing (incorrectly in a lot of cases) that a better Gini in their predictive models directly translates to better credit risk management strategies. This approach is clearly flawed!

In these organisations, data scientists have become the new credit risk managers. Continually building new models and incorporating new data, they can statistically demonstrate an improvement in the predictive power of their models at an accelerated rate. The deployment of these models is taken to be strategic business improvement.

Fundamentally, the data scientists are not wrong in their approach. They are achieving their objectives by continually refining the predictive models that are deployed in the credit strategy. What seems to have been lost along the way is both how the models are used and the understanding that customer reaction to actions outside of what has been done historically cannot be modelled. Thus these should be thoroughly tested. In order to empirically test these new actions, one needs to apply them to a portion of the customer base and track the performance against the current actions.

Removing test groups removes strategy accountability

The primary issue associated with applying strategy or model changes across the entire customer base without test groups is the lack of measurable outcomes, and thus accountability for the new strategy and models. Just because a new model showed an increase in Gini in development, does not mean that its application in a credit granting strategy will deliver positive results.

Even if we do assume that the model will deliver better results, there is a limit to how much value can be added to the organisation with a better segmenting model. To illustrate this point with an example from the originations area, if the Gini improvement of new models comes mainly from below the applications cut-off, and there is a limited swap set between the accepted and declined populations, the actual effect of a new improved model will be negligible to the profitability of the organisation.

A/B Testing tests the strategy and guards your bottom line

It is the actual credit strategy where the majority of the value to the organisation resides. This is where we apply the tools and models to improve and measure the lender’s bottom line. While good predictive models are a crucial component of a credit strategy, amending the credit acceptance rate and loan or limit amount will have a far bigger impact on the business than incrementally better predictive power in models.

Furthermore, the only way to quantify the value of these changes is to deploy these strategies randomly to a portion of the customer base and measure them in parallel against the current strategy. Welcome back, A/B testing!

That is enough on the subject for now. Next month I aim to look into A/B testing in a little more detail, so keep an eye out for that article. Until then, I would appreciate some reader feedback on the opinions expressed here. Let us get a discussion going on our LinkedIn page!