College Basketball Model Part II

Brendan Cox
May 12, 2020
4 min read

Dean Oliver's Four Factors and my recent Data Mining and Predictive Analytics Masters class inspired me to create another D-I college basketball model. The first college basketball model I made was a convoluted multiple linear regression model that luckily was the best in the world in predictive accuracy over the last month of the 2018 season.

This time around, I used R to make four models, Logistic Four Factors, vs Boosting Four Factors, Logistic Full, and Boosting Full, to attempt to predict the winner of games between two D-I teams based on 2018-2019 season data.

Sports-reference.com NCAA men’s basketball data for the 2018 - 2019 season was used. Sports-reference.com was founded in 2000. Since then, it has been one of if not the most reliable sports statistics database found. The college basketball data on the site tracks back to 1949.

Four groups of parameters for a sample size of 353 Division I schools were concatenated to generate the overall dataset, which included: 1) basic school stats; 2) basic opponent stats; 3) advanced school stats; and 4) advanced opponent stats.

The original database consists of 107 variables. For the "Full" model, almost all of the percentage variables were used, along with Strength of Schedule (SOS). This turns out to be 58 independent variables. The Four Factors used factors are represented by Effective Field Goal Percentage, Turnover Rate, Total Rebounding Rate, and Free Throws per Field Goals Attempted, for the team and its opponents respectively, which are more or less the same variables used by Oliver. Only SOS was added to these four, and totaled 16 independent variables.

The research question not only wanted to learn about the dependent variable but the independent variables as well. With having two models use all percentage variables and two models using only the four factors, we could learn just how effective using the four factors is in explaining who won games.

During the modeling process, the wish was to see the best model at each possible number of variables and compare the adjusted R^2 between them. However, computationally this is unrealistically large to compute, and the model showed error messages. Therefore, the decision was made to leave all 58 variables in the model and simply compare it to the four factors as the most logical reduced model.

The idea was that differences in the four models would provide strengths and weaknesses to shine between them and show one or two models that were superior to the others. Logistic regression and boosting are different methods structurally. Logistic regression essentially states how likely it would be for a team to win the game, whereas boosting could extend beyond the limits of 0 and 1 but answer the same question.

The measure used to compare all four models was accuracy. The boosting models had strikingly similar results that both were inferior to the logistic regression models. The best model of the four was the full logistic regression model. The four accuracy percentages were as follows:

Four Factors Logistic Regression - 76.22%
Full Logistic Regression - 75.55%
Four Factors Boosting - 71.85%
Full Boosting - 71.71%

Four Factors Logistic Regression Data Examples

In this scenario, where the dependent variable was binary, the modeling method designed to deal with binary problems in logistic regression performed better than the high-powered ensemble method of boosting. Also, despite far fewer variables, the Four Factors fared better in both cases. This shows that using just the four factors, at the very least, explains a majority of how good a team is statistically compared to all the statistics available. Regular fans can find real value in analyzing just the four factors to determine team performance.

Compared to other 2019 college basketball models, the 76.22% is once again the best in the world. However, a 1:1 comparison is incorrect in this scenario, as my model was able to look at a season's worth of data to calculate a season's worth of results, whereas other models were evaluated a week at a time. This being said, I am very very with the results of my model, and I am hoping to add it back into the fold in the coming years.

Oliver notes that Shooting shot captures 40% of the weight of the four factors. Our four factors boosting model compares well, with the four variables with the highest relative influence being Effective Field Goal Percentage, adding up to 36.2%, if you exclude SOS. Free throw rate has the lowest relative influence as well.

Non-Division I games are excluded from the ratings.” Unfortunately, the independent variables used do include statistics from Non-Division I games. However, this is a small minority of the total games (estimated 5%), only slightly skewing the data for certain teams. Therefore, the model is reasonable and can be trusted on an overall scale. Another limitation was neutral games had the high seed as the "home team". I was unable to filter this out in my data, but obviously would look to avoid this in the future.

Feel free to download my R Markdown results here.