Kraus Anne

Kraus Anne

Citation preview

Recent Methods from Statistics and Machine Learning for Credit Scoring Anne Kraus

M¨unchen 2014

Recent Methods from Statistics and Machine Learning for Credit Scoring Anne Kraus

Dissertation an der Fakult¨at f¨ur Mathematik, Informatik und Statistik der Ludwig–Maximilians–Universit¨at M¨unchen vorgelegt von Anne Kraus aus Schweinfurt

M¨unchen, den 10. M¨arz 2014

Erstgutachter: Prof. Dr. Helmut K¨ uchenhoff Zweitgutachter: Prof. Dr. Martin Missong Tag der Disputation: 22. Mai 2014

Mein besonderer Dank gilt • Prof. Dr. Helmut K¨ uchenhoff, f¨ ur die M¨oglichkeit bei ihm zu promovieren, die Begeisterung f¨ ur das Thema und die hervorragende Betreuung in den letzten Jahren • Prof. Stefan Mittnik, Ph.D., f¨ ur die Zweitbetreuung im Rahmen des Promotionsprogramms • Prof. Dr. Martin Missong, f¨ ur die Bereitschaft, das Zweitgutachten zu u ¨bernehmen • allen Doktoranden und Mitarbeitern am Institut f¨ ur Statistik, f¨ ur die freundliche Aufnahme und Unterst¨ utzung, ganz besonders Monia Mahling • meinem Arbeitgeber, f¨ ur die M¨oglichkeit der Freistellung • meinen Kollegen, f¨ ur ihr Interesse und viele gemeinsame Mittagspausen • meinen Freunden, f¨ ur viel Verst¨andnis und Motivation, allen voran Karin Schr¨oter • meinen Geschwistern Eva Schmitt und Wolfgang Kraus samt Familien, f¨ ur Aufmunterung und Ablenkung • meinen Eltern, f¨ ur ihre grenzenlose Unterst¨ utzung in jeglicher Hinsicht • Martin Tusch, f¨ ur unendlichen R¨ uckhalt

Abstract Credit scoring models are the basis for financial institutions like retail and consumer credit banks. The purpose of the models is to evaluate the likelihood of credit applicants defaulting in order to decide whether to grant them credit. The area under the receiver operating characteristic (ROC) curve (AUC) is one of the most commonly used measures to evaluate predictive performance in credit scoring. The aim of this thesis is to benchmark different methods for building scoring models in order to maximize the AUC. While this measure is used to evaluate the predictive accuracy of the presented algorithms, the AUC is especially introduced as direct optimization criterion. The logistic regression model is the most widely used method for creating credit scorecards and classifying applicants into risk classes. Since this development process, based on the logit model, is standard in the retail banking practice, the predictive accuracy of this proceeding is used for benchmark reasons throughout this thesis. The AUC approach is a main task introduced within this work. Instead of using the maximum likelihood estimation, the AUC is considered as objective function to optimize it directly. The coefficients are estimated by calculating the AUC measure with Wilcoxon– Mann–Whitney and by using the Nelder–Mead algorithm for the optimization. The AUC optimization denotes a distribution-free approach, which is analyzed within a simulation study for investigating the theoretical considerations. It can be shown that the approach still works even if the underlying distribution is not logistic. In addition to the AUC approach and classical well-known methods like generalized additive models, new methods from statistics and machine learning are evaluated for the credit scoring case. Conditional inference trees, model-based recursive partitioning methods and random forests are presented as recursive partitioning algorithms. Boosting algorithms are also explored by additionally using the AUC as a loss function. The empirical evaluation is based on data from a German bank. From the application scoring, 26 attributes are included in the analysis. Besides the AUC, different performance measures are used for evaluating the predictive performance of scoring models. While classification trees cannot improve predictive accuracy for the current credit scoring case, the AUC approach and special boosting methods provide outperforming results compared to the robust classical scoring models regarding the predictive performance with the