### SUBTITLES:

Subtitles generated by robot

00:01

welcome today we are going to present on
our project
on credit card default prediction
presentation will be in four parts the
first part will be
our problem definition the second part
will be about our data set
the first the third part will be about
the modeling and the last part will be
about the result and visualizations
now let us introduce our problem
a credit card is a method of payment
which the cardholder can purchase goods
or services without paying cash

00:33

using credit card is convenient to pay
for almost anything
a credit card also allows you to pay for
something when you shop online
however when credit card is a type of
unsecured
loan and an unsecured loan is a loan
given without any collateral such as a
house
or fixed deposit when the bills come and
you can't pay for them
they easily accumulate and the default
risk becomes higher and higher
our team focus on finding the
relationship between virus cardholder
and the ability to repay

01:04

we will use machine learning to learn
all available features
of all card holders in the data set and
predict whether or not they will default
in the next month
in this project we will try to
investigate the result of predating
default
using different models with different
training methods
and data preprocessing techniques
now let me introduce our data set
the source of our data set is kegel we
can download our data from the language
slide

01:36

there are 23 tradable features and two
untrainable features
in our dataset the two untrainable
features include the id number of the
cardholder
and the default status which is also our
label the 23 trainable features include
credit given the gender education level
the marriage status and age in year of
our cardholders
it also includes the repayment status
the amount of bill statement and the
amount of previous statement of six
months
of our cardholders the data type include

02:07

numeric
boolean ordinal and categorical
the next step was analyzing our data and
visualizing the result
first we try to check if there are any
nodes in the dataset
and lucky for us there are no nouns in
any of the columns
next we try to analyze the distribution
of labels
and as we can see in the nested pie
chart there are more non-default samples
than default samples
we can also see that the proportion of
defaulting of
male is much higher than that of the

02:40

female
we also investigated the dispersion of
age of our sample
and we can find that it mostly lies
between
20 to 40. the line graph shows
the credit different to the card holder
as and as we can see
the curve is much smoother in the left
hand side of the graph
than right hand side of the graph
our next step was analyzing the feature
to feature correlation
and the result is shown in the heat map

03:11

in the left hand side
the absolute value of price piercing
correlation are mostly between
0 and 0.4 which means that the data are
not very highly correlated in most cases
except for two cases which is the
repayment status and the billing amount
the pairwise correlation of repayment
status are around 0.4 to 0.8
which is marginally acceptable for some
pairs
but the correlation of billing amounts
are between 0.8 and 0.95

03:42

which is unacceptable in some cases
however
there are some observations it
for us it makes more sense to adopt
repayment status and billing amount in
deciding whether or not a cardholder
will default in the next month
and the result does it make that
legitimate regression
more or less reliable due to multicult
linearity
the tech of our team is that the feature
make more sense
only when they're used together so it is
acceptable to use them
even though they have high pairwise

04:14

pearson correlation
we also analyzed the label to feature
correlation
and the result is shown in a table in
the left hand side
we can see that the maximum correlation
is about 0.3
but as mentioned in the previous slide
some features are useful only when they
are used together like payment
billing amount and amount paid so it is
hard to interpret this feature when they
are used alone
which also causes the low correlation

04:44

as mentioned before the distribution of
label is quite unbalanced
as default has been less frequent in
reality
it is harder for a circular default data
so we used the package cost both we
stand for synthetic minority of a
sampling technique to solve this
issue here are some screenshots of how
we
implement mode to our model as we can
see from the result below
the dataset become balanced after we

05:17

apply smoke to them
we also applied normalization to our
features since some of our data have
different range and variants for example
like age and billing amounts
some model may struggle to train and
classify without scaling
so we employ the min max scalar to
normalize the data
the formula is as follows first we
subtract the minimum of the features
from the data then we divide it by the
difference between the maximum
and the minimum of the feature

05:48

now let's talk about the methodology we
have used
in model training the procedure of model
training is as follows
first we fit our data set into different
models
then we will do great search and
cross-validation
to find the best parameter
then we will ensemble the model and do
wage optimization
and finally it will use our final and
sampled model
for model optimization we will separate
our data sets into two parts

06:20

the first part is training set the
second part is testing sets
changing sets has 60 percent of data and
testing sets has 20 percent of data
we will first pick the 80 percent of
training sets for chain link
and during the training process we will
do cost validation
and grid search and then we will fit the
chain
model with testing set to find the
testing error
we have selected 7 models the models we
have used are logistic regression

06:50

support vector machine xg boost deep
neural network
light gbm random forest gradient
boosting
qi for the optimization we have used
hyper parameter tuning
cost validation model and sample and
weight tuning
let's talk about each model we have used
the first model we have used
is logistic regression logistic
regression calculate the probability of
each class by using sigmoid function
then the model will pick the highest

07:21

probability cost
the second model is support vector
machine supplemental machine
find the best hyper pain by maximizing
the margins
which means that the hyperpage choose
should have the maximum
distance to the nearest data point of
each classes
random forest is a kind of ensemble
learning
it constructs multiple decision sheets
and each tree will generate an output
the final random forest model output is
the mode of the decision chain outputs

07:52

the individual tissue in random forms
does not use all features in data data
sets instead
they use only substantial features as a
result
the cheese has more variations and
ultimately results in lower correlation
and higher diversification are called
cheats
gradient boosting is a method which
changes a strong learner
by combining and improving a bunch of
weak learners
a weak learner is a model which is
better than random guests
by small margin however a big learner

08:26

has low
capacity low chaining cost and it is not
easy to overfit due to their simplicity
a gradient boosting tree model adds a
new g
in each iteration which targets the
errors of the previous iteration
extreme boost is a kind of gradient
boosting sheath but it also adds a
l2 equalization to prevent overfitting
of the capacity of the qi
large gbm is also a kind of brilliant

08:56

boosting tree
but it is more efficient when dealing
with a huge amount of data
the other difference is instead of going
each level
each time the light gpm go on each leaf
each time
the last model we have used is deep
neural network
deep neural network is composed of input
layer hidden layers and the final output
layer we have tried different settings
of this model such as the number of
hidden layers

09:26

and activation function the result will
be explained
in next section for each model we use
research cost validation
to find the best hyper parameter to do
this
we separate our training set into 5 volt
4 volt will be used for training and one
fold will be used for validation
and for each set of hyper parameter we
will get 5
auc scored and we will calculate the
final performance
based on the average of auc score

09:59

and finally we will pick the hyper
parameter with the highest average auc
scores
after fine tuning with hyper parameter
we example models to perform protection
during these process each model will
vote for each prediction
the model and sample will make the final
decision based on weights the majority
fault
we decided to keep only one gradient
booster g because they are similar
models
and we choose keeping xgb boost because
it's performed better than other two

10:30

models
and the final procedure is ways to name
we tried different weight
or votes and choose the best one with
the highest auc
let's talk about the results and
visualization
so we try to compare the results between
the default parameters and also the best
parameters
generated by the glitcher cv function on
my other model
hd boost give us the best result which
provides the highest auc value at 0.927
on the testing stages
optimize the model using the best hybrid

11:04

parameters perform
better as we expected for the future we
can think perhaps we can use more
parameters for the tuning
to indicate the use to smokes we also do
some experiments
so among other models we find xgb was
using smoke to give us the best results
which provide the highest auc value
and other models using smoke perform
better result as we expected
this one is a summarized table for the
hyperparameter tuning
the first row is the selected parameter

11:35

we choose to train the models
the second one is the selected value for
different parameters
the first one is the best
hyperparameters we find
after training so this one is the
comparisons table
so we use the optimized hybrid test in
the string test speed status and compare
the result as shown as
below so we find that the optimized
hybrid parameter with the course
validation the pros on the testing case
again
for the x3 boost model can give us the

12:06

best result which will find the highest
auc value as 0.927
so we also try to use the example model
as we talked before we do 7 models
experiment here
and finally we see that's 5 of them the
first one is
rf and dnn lr svm and x3 boost
so we found that the heavier rating
inverse model
definitely performed poor as we expected
so you can see the best score on the

12:38

testing data set is 0.77
and for the future we can think about
maybe we can have a linear regression
model using
5 model as a feature so we can know
the best varietal ratio
so um we do x experiments on each of the
models so we also plot the roc
curve as shown below so uh this one is
for the logistic regression
this one is water svm this one is for
the hg

13:08

boost
this one is for the dnn this one is for
the random forest
and this one is for the gbt
so the results we try to compare
among different methods the low
validation cost validation and model
assemble
we find that the course validation in
the model example has a better general
results which give us a higher voice
testing agency
and for the example model it doesn't
necessarily perform better than the
single model so

13:41

we may not use it it's maybe
it's also okay in conclusion
extreme boost model give us the best
result in all seven models
the optimizer models using best hybrid
parameters
generated by the crystal series function
perform better as we expected
all models using small functions
performs better result as we expected as
well
and assembling two good models doesn't
give us a better model so
maybe single model is already enough to

14:14

do this project
and overfitting occurs when there are no
validation as expected
for the future word we can try to see to
use the random search
rather than the predefined research to
see if there any difference
on the parameter selection and we may
also use the linear regression in
finding the best rating in assigning the
ensemble model weight
and finally we can also try to compare
the models using a different metric
so this one is for constraint for k
means customering sometimes we are not

14:46

sure how to choose the value of k
hence we use an elbow curve so we can
see that when
k is larger than 5 the curve is becoming
a straight line
so we choose k the 5 to cluster the age
with
credit difference
and there are some pro demonstration so
we this one is um to mention we use
collab to do this project
and this one is we do some
visualizations
so mainly we see bonds and mapped it

15:19

yeah and this one is how we do the
output curve and clustering
here is the script for one of the seven
models we
do in this project so we use the random
forest
classifiers from the scikit-learn
library to do the coding
and we also use the quistious siri
function and also the shuffle speed to
do the hyperparameters tuning and also
the course validations
and for the dnn we use the mlp
classifiers from the
second learned neural network to do and
we also

15:51

tried to plot the roc curve as well
here we provide the assemble model
script for the left one we
do the evening ratio for the white one
we assign the
rating to name for different models
so there are some reference that we go
through to finish this project so that's
all for our presentation
thank you

Watch, read, educate! © 2021