In our last post we recapped the year one developments on BSU’s Data Warehouse. As we go forward, we will be able to leverage that warehouse to produce more and better data for the institution. In this post we discuss the developments in year two (ended August 2023); specifically the implementation of the Academic Performance Management tool and the first data science predictive model developed using DWH data.
As a new tool for helping to manage the institution we are excited about the ongoing roll out of APM. The implementation of this project involved several cross-departmental staff and many hours of validation and development. This tool will allow for the detailed analysis and optimization of course offerings across the institution. We are currently in the process of working with early adopters to refine and make this tool more widely available.
In August 2023 we completed our first data science model of student retention using DWH data. This exciting project will allow us to use machine learning over time to become better at supporting student success. We built our first model to examine first-year student retention and the factors that drive attrition. As we add more validated data to our DWH we anticipate developing more nuanced models over time that explicate the less obvious factors that impede student progress. For now, we have identified first term GPA as, by far, the best predictor of whether they will return. You can see the primary factors in the following chart:
This analysis is the result of a random forest analysis. This is an ensemble method that generates hundreds of small decision trees on only a portion of the total model features and rows. In this way the model “votes” on the final outcome and this process increases the accuracy without overfitting. When looking at student retention it is particularly important that the model do well on negative recall, or how well the model identifies those who do not retain. Since nearly 80% of our students retain into the subsequent fall of their freshman term if we just predicted 100% would return…we would be 80% right. That may sound like a good result but we would be 100% wrong on negative recall. Tuning the model for these considerations requires careful design.
Here is a flow chart overview of the process we follow:
We are currently working with University College to use these results in an impactful way to support students.