MSBD 5012 Group 1 Project Presentation - Credit Card Default Prediction

MSBD 5012 Group 1 Project Presentation - Credit Card Default Prediction

SUBTITLE'S INFO:

Language: English

Type: Robot

Number of phrases: 442

Number of words: 2424

Number of symbols: 10961

DOWNLOAD SUBTITLES:

DOWNLOAD AUDIO AND VIDEO:

SUBTITLES:

Subtitles generated by robot
00:01
welcome today we are going to present on our project on credit card default prediction presentation will be in four parts the first part will be our problem definition the second part will be about our data set the first the third part will be about the modeling and the last part will be about the result and visualizations now let us introduce our problem a credit card is a method of payment which the cardholder can purchase goods or services without paying cash
00:33
using credit card is convenient to pay for almost anything a credit card also allows you to pay for something when you shop online however when credit card is a type of unsecured loan and an unsecured loan is a loan given without any collateral such as a house or fixed deposit when the bills come and you can't pay for them they easily accumulate and the default risk becomes higher and higher our team focus on finding the relationship between virus cardholder and the ability to repay
01:04
we will use machine learning to learn all available features of all card holders in the data set and predict whether or not they will default in the next month in this project we will try to investigate the result of predating default using different models with different training methods and data preprocessing techniques now let me introduce our data set the source of our data set is kegel we can download our data from the language slide
01:36
there are 23 tradable features and two untrainable features in our dataset the two untrainable features include the id number of the cardholder and the default status which is also our label the 23 trainable features include credit given the gender education level the marriage status and age in year of our cardholders it also includes the repayment status the amount of bill statement and the amount of previous statement of six months of our cardholders the data type include
02:07
numeric boolean ordinal and categorical the next step was analyzing our data and visualizing the result first we try to check if there are any nodes in the dataset and lucky for us there are no nouns in any of the columns next we try to analyze the distribution of labels and as we can see in the nested pie chart there are more non-default samples than default samples we can also see that the proportion of defaulting of male is much higher than that of the
02:40
female we also investigated the dispersion of age of our sample and we can find that it mostly lies between 20 to 40. the line graph shows the credit different to the card holder as and as we can see the curve is much smoother in the left hand side of the graph than right hand side of the graph our next step was analyzing the feature to feature correlation and the result is shown in the heat map
03:11
in the left hand side the absolute value of price piercing correlation are mostly between 0 and 0.4 which means that the data are not very highly correlated in most cases except for two cases which is the repayment status and the billing amount the pairwise correlation of repayment status are around 0.4 to 0.8 which is marginally acceptable for some pairs but the correlation of billing amounts are between 0.8 and 0.95
03:42
which is unacceptable in some cases however there are some observations it for us it makes more sense to adopt repayment status and billing amount in deciding whether or not a cardholder will default in the next month and the result does it make that legitimate regression more or less reliable due to multicult linearity the tech of our team is that the feature make more sense only when they're used together so it is acceptable to use them even though they have high pairwise
04:14
pearson correlation we also analyzed the label to feature correlation and the result is shown in a table in the left hand side we can see that the maximum correlation is about 0.3 but as mentioned in the previous slide some features are useful only when they are used together like payment billing amount and amount paid so it is hard to interpret this feature when they are used alone which also causes the low correlation
04:44
as mentioned before the distribution of label is quite unbalanced as default has been less frequent in reality it is harder for a circular default data so we used the package cost both we stand for synthetic minority of a sampling technique to solve this issue here are some screenshots of how we implement mode to our model as we can see from the result below the dataset become balanced after we
05:17
apply smoke to them we also applied normalization to our features since some of our data have different range and variants for example like age and billing amounts some model may struggle to train and classify without scaling so we employ the min max scalar to normalize the data the formula is as follows first we subtract the minimum of the features from the data then we divide it by the difference between the maximum and the minimum of the feature
05:48
now let's talk about the methodology we have used in model training the procedure of model training is as follows first we fit our data set into different models then we will do great search and cross-validation to find the best parameter then we will ensemble the model and do wage optimization and finally it will use our final and sampled model for model optimization we will separate our data sets into two parts
06:20
the first part is training set the second part is testing sets changing sets has 60 percent of data and testing sets has 20 percent of data we will first pick the 80 percent of training sets for chain link and during the training process we will do cost validation and grid search and then we will fit the chain model with testing set to find the testing error we have selected 7 models the models we have used are logistic regression
06:50
support vector machine xg boost deep neural network light gbm random forest gradient boosting qi for the optimization we have used hyper parameter tuning cost validation model and sample and weight tuning let's talk about each model we have used the first model we have used is logistic regression logistic regression calculate the probability of each class by using sigmoid function then the model will pick the highest
07:21
probability cost the second model is support vector machine supplemental machine find the best hyper pain by maximizing the margins which means that the hyperpage choose should have the maximum distance to the nearest data point of each classes random forest is a kind of ensemble learning it constructs multiple decision sheets and each tree will generate an output the final random forest model output is the mode of the decision chain outputs
07:52
the individual tissue in random forms does not use all features in data data sets instead they use only substantial features as a result the cheese has more variations and ultimately results in lower correlation and higher diversification are called cheats gradient boosting is a method which changes a strong learner by combining and improving a bunch of weak learners a weak learner is a model which is better than random guests by small margin however a big learner
08:26
has low capacity low chaining cost and it is not easy to overfit due to their simplicity a gradient boosting tree model adds a new g in each iteration which targets the errors of the previous iteration extreme boost is a kind of gradient boosting sheath but it also adds a l2 equalization to prevent overfitting of the capacity of the qi large gbm is also a kind of brilliant
08:56
boosting tree but it is more efficient when dealing with a huge amount of data the other difference is instead of going each level each time the light gpm go on each leaf each time the last model we have used is deep neural network deep neural network is composed of input layer hidden layers and the final output layer we have tried different settings of this model such as the number of hidden layers
09:26
and activation function the result will be explained in next section for each model we use research cost validation to find the best hyper parameter to do this we separate our training set into 5 volt 4 volt will be used for training and one fold will be used for validation and for each set of hyper parameter we will get 5 auc scored and we will calculate the final performance based on the average of auc score
09:59
and finally we will pick the hyper parameter with the highest average auc scores after fine tuning with hyper parameter we example models to perform protection during these process each model will vote for each prediction the model and sample will make the final decision based on weights the majority fault we decided to keep only one gradient booster g because they are similar models and we choose keeping xgb boost because it's performed better than other two
10:30
models and the final procedure is ways to name we tried different weight or votes and choose the best one with the highest auc let's talk about the results and visualization so we try to compare the results between the default parameters and also the best parameters generated by the glitcher cv function on my other model hd boost give us the best result which provides the highest auc value at 0.927 on the testing stages optimize the model using the best hybrid
11:04
parameters perform better as we expected for the future we can think perhaps we can use more parameters for the tuning to indicate the use to smokes we also do some experiments so among other models we find xgb was using smoke to give us the best results which provide the highest auc value and other models using smoke perform better result as we expected this one is a summarized table for the hyperparameter tuning the first row is the selected parameter
11:35
we choose to train the models the second one is the selected value for different parameters the first one is the best hyperparameters we find after training so this one is the comparisons table so we use the optimized hybrid test in the string test speed status and compare the result as shown as below so we find that the optimized hybrid parameter with the course validation the pros on the testing case again for the x3 boost model can give us the
12:06
best result which will find the highest auc value as 0.927 so we also try to use the example model as we talked before we do 7 models experiment here and finally we see that's 5 of them the first one is rf and dnn lr svm and x3 boost so we found that the heavier rating inverse model definitely performed poor as we expected so you can see the best score on the
12:38
testing data set is 0.77 and for the future we can think about maybe we can have a linear regression model using 5 model as a feature so we can know the best varietal ratio so um we do x experiments on each of the models so we also plot the roc curve as shown below so uh this one is for the logistic regression this one is water svm this one is for the hg
13:08
boost this one is for the dnn this one is for the random forest and this one is for the gbt so the results we try to compare among different methods the low validation cost validation and model assemble we find that the course validation in the model example has a better general results which give us a higher voice testing agency and for the example model it doesn't necessarily perform better than the single model so
13:41
we may not use it it's maybe it's also okay in conclusion extreme boost model give us the best result in all seven models the optimizer models using best hybrid parameters generated by the crystal series function perform better as we expected all models using small functions performs better result as we expected as well and assembling two good models doesn't give us a better model so maybe single model is already enough to
14:14
do this project and overfitting occurs when there are no validation as expected for the future word we can try to see to use the random search rather than the predefined research to see if there any difference on the parameter selection and we may also use the linear regression in finding the best rating in assigning the ensemble model weight and finally we can also try to compare the models using a different metric so this one is for constraint for k means customering sometimes we are not
14:46
sure how to choose the value of k hence we use an elbow curve so we can see that when k is larger than 5 the curve is becoming a straight line so we choose k the 5 to cluster the age with credit difference and there are some pro demonstration so we this one is um to mention we use collab to do this project and this one is we do some visualizations so mainly we see bonds and mapped it
15:19
yeah and this one is how we do the output curve and clustering here is the script for one of the seven models we do in this project so we use the random forest classifiers from the scikit-learn library to do the coding and we also use the quistious siri function and also the shuffle speed to do the hyperparameters tuning and also the course validations and for the dnn we use the mlp classifiers from the second learned neural network to do and we also
15:51
tried to plot the roc curve as well here we provide the assemble model script for the left one we do the evening ratio for the white one we assign the rating to name for different models so there are some reference that we go through to finish this project so that's all for our presentation thank you

DOWNLOAD SUBTITLES: