Testing Whether An IVF Prediction Model “Works”
Prediction modeling requires extensive interdisciplinary work. I’d like to remind readers that I’m not a statistician or prediction modeler myself. But having directed and worked for seven years now with a top-notch team to develop and apply prediction modeling to infertility, I have come to see myself as a guide to the meshing of these two disciplines. I thank our Chief Statistician, Dr. Bokyung Choi, for discussions and review of this blog post to ensure its accuracy.
Prediction modeling is downright exciting. I’ve already talked about our approach to building IVF prediction models, but for our statisticians, the more challenging task is to test whether a prediction model “works”. Anyone can build a prediction model these days, but how do we know if we can trust it?
There are many levels of validation. Many researchers perform a type of validation that is called internal validation, meaning that they test how well the prediction model works on a portion of the data that was used to develop or train that same model. This approach alone is not rigorous enough, because prediction models tend to do superbly when applied to the data that was used to build them. It’s like a self-fulfilling prophecy.
External Validation
Our research team performs external validation. This is one of those terms that mean completely different things to physicians, researchers, statisticians, and business people. First, I’ll talk about how most physicians and researchers interpret external validation, and then I will explain how statisticians think of this term.
Conventionally, researchers use external validation to mean that laboratory results (e.g., the predictive value of a molecular biomarker) that were established based on one clinical center’s experience can also be applied to another center’s patients. Thus, the word external in this context refers to a different patient population, or a healthcare facility that is geographically separate.
In our prediction modeling work, external validation means that we apply the model to an independent data set (called the test set) to test whether the predicted probabilities of outcomes are consistent with the true outcomes. The use of an independent data set is required to establish the accuracy and reproducibility of the IVF prediction model itself.
What exactly are we validating? In the validation work, we test how well the IVF prediction model performs on the independent test set in several measures: predictive power, discrimination, calibration, dynamic range, and reclassification. These quantitative measures allow us to compare different models using the same metrics. We cannot judge the performance or utility (usefulness) of a model unless we know how it performs in all these areas.
Predictive Power
Predictive power measures how much more likely the test data are represented by one prediction model over another (e.g. the control model). Non-statisticians would ask, “Which model is better? Which model gives a better, more accurate prediction?” Statisticians would ask whether the data fit the model well, or whether the fit is good. To answer these questions, we measure predictive power using a number called “log-likelihood,” which literally means “how likely is it that the data will fit the model”. We measure this fit with a method called posterior log-likelihood and obtain its odds ratio compared to the age control model. (Univfy’s research team has coined this measure PLORA, for posterior log-likelihood of the odds ratio against the age model.)
Let’s say we’re building a prediction model to predict the probability of having a live birth with the first IVF cycle, when given a set of clinical factors. The actual technical measures (log-likelihood) of the prediction model and the age control model are negative numbers that may seem meaningless without a reference. Therefore, we establish a reference, which is the average live birth rate or the probability of having a live birth without using any predictors, not even age. With this reference and a formula that we constructed, we can determine the improvement of a prediction model over the prediction of the age control model, and show this measure as a percentage of improvement. This percentage of improvement helps us to determine whether one prediction model is “better”, “more accurate”, or “has higher predictive power” than another.
About the Author:
Mylene Yao, M.D. | Co-Founder and CEO of Univfy®
Dr. Mylene Yao is a board-certified OB/GYN with more than 20 years of experience in clinical and reproductive medicine research. Prior to founding Univfy, she was on the faculty at Stanford University, where she led NIH-funded fertility and embryo genetics research and developed The Univfy AI Platform with the academic founding team. See her full bio here.