In partnership with HelioCampus, BSU is leveraging its Data Warehouse to use predictive modeling to better understand why students leave BSU. Using this data will help us to develop process and policies that will help us retain more students to degree completion. At the start of this project in early 2023 we had a set of core deliverables as follows.

In order to get to these deliverables this is the set of steps we followed:

We defined the problem as observed declines in student retention. BSU had been experiencing declining first Fall student retention for a number of terms. BSU leadership wanted to understand better why this decline was occurring and to ultimately get recommendations to improve the retention of first-year students. We started with the following definition of the problem.

The initial process includes an exploration of BSUs data in descriptive form and looking for what may be insightful correlations. We were also are guided by existing research in student success and what factors are most likely to have predictive value. One mistake we must avoid when building a data science model is the temptation to throw variables at the algorithm and hope the computer can sort it out. This is almost always a mistake because the algorithm does not have enough context to understand the nuances of every field. A simple example of this is when student ID is included in model development. When that mistake is made, and if the time periods are not well-defined, the algorithm can find student ID as a predictor of persistence but we understand this is just an artifact and not an important contributor to the model.

The development of the model led us through a process of additional data collection and data transformations, feature engineering, model development and feedback from subject matter experts. After testing several models we settled on the decision that a random forest was the best way to predict student retention using the variables that we determined were important. A random forest is modified decision tree that breaks the tree into many parts and then the individual trees all “vote” on the best way to classify the data to come up with a result.


In the validation portion we did quite a bit of work making sure that the model did not produce misleading data. For example, we found that the date of submission of a FAFSA was highly correlated to student retention, however we later determined that the date of the FAFSA sometimes came in after the Fall term had started. This field was therefore providing data to the retrospective model that we would not have when running this model at the start of a Fall term when that data would not be available yet. In testing the accuracy of the model we focused on negative recall. We wanted to choose the model that would help us best identify students who will not return. The reason for this is that if our retention rate is 85% then a model that predicts every student will return is 85% correct which may seem good but it would have incorrectly classified every student who didn’t return.

Now that this work is complete on the first iteration we have productionized the model so that it can be run on demand. Our first use of the model was to help guide outreach this summer to students who were registered for Fall 23 but were predicted by the model that they only have a very low likelihood of returning. This was a small group but the initial results are encouraging.  Our developing plan is for the cohort that started in Fall 23 and what we can do, using this model, to support their success.