This week, we will start the first graduate course on graphs and networks. Slides are available online.

Later on, we will see additional applications, in the context of flows, matching, transportation, etc.

This week, we will start the first graduate course on graphs and networks. Slides are available online.

Later on, we will see additional applications, in the context of flows, matching, transportation, etc.

With Stéphane Tufféry we’ve been working on credit scoring^{1} and we’ve been using the popular german credit dataset,

> myVariableNames <- c("checking_status","duration","credit_history", + "purpose","credit_amount","savings","employment","installment_rate", + "personal_status","other_parties","residence_since","property_magnitude", + "age","other_payment_plans","housing","existing_credits","job", + "num_dependents","telephone","foreign_worker","class")

> credit = read.table( + "http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", + header=FALSE,col.names=myVariableNames) > credit$class <- credit$class-1

We wanted to get a nice code to produce a graph like the one below,

Yesterday, Stéphane came up with the following code, that can easily be adapted

> library(RColorBrewer) > CL=brewer.pal(6, "RdBu") > varQuanti = function(base,y,x) + { + layout(matrix(c(1, 2), 2, 1, byrow = TRUE),heights=c(3, 1)) + par(mar = c(2, 4, 2, 1)) + base0 <- base[base[,y]==0,] + base1 <- base[base[,y]==1,] + xlim1 <- range(c(base0[,x],base1[,x])) + ylim1 <- c(0,max(max(density(base0[,x])$y),max(density(base1[,x])$y))) + plot(density(base0[,x]),main=" ",col=CL[1],ylab=paste("Density of ",x), + xlim = xlim1, ylim = ylim1 ,lwd=2) + par(new = TRUE) + plot(density(base1[,x]),col=CL[6],lty=1,lwd=2, + xlim = xlim1, ylim = ylim1,xlab = '', ylab = '',main=' ') + legend("topright",c(paste(y," = 0"),paste(y," = 1")), + lty=1,col=CL[c(1,6)],lwd=2) + texte <- c("Kruskal-Wallis'Chi² = \n\n", + round(kruskal.test(base[,x]~base[,y])$statistic*1000)/1000) + text(xlim1[2]*0.8, ylim1[2]*0.5, texte,cex=0.75) + boxplot(base[,x]~base[,y],horizontal = TRUE,xlab= y,col=CL[c(2,5)]) +} > varQuanti(credit,"class","duration")

The code is not complex, but since I usually waste a lot of time on my graphs, I will try to upload more frequently short posts, dedicated to graphs, in R (without ggplot).

^{1.for a chapter on statistical learning in the forthcoming Computational Actuarial Science with R}

Wednesday, in class, we’ve seen how to visualize a multiple regression model (with two continuous explanatory variables). Here, the goal is to predict the average cost of an insurance claim, using some covariates, e.g. the age of the driver, and the age of the car (recall that losses here are liability losses). The prediction obtained from a (standard) generalized *linear* model, with a log-link

> reg1=glm(cout~ageconducteur+agevehicule,data=base,family=Gamma(link="log"))

The code to visualize the predicted average cost is the following: first, we have to compute predictions for specific values,

> pred=function(x,y){ + predict(reg,newdata=data.frame(ageconducteur=x, + agevehicule=y),type="response")

Then, we use this function to compute values on a grid,

> X=seq(20,80,by=5) > Y=0:20 > Z=outer(X,Y,p) > image(X,Y,Z,col=rev(heat.colors(101))) > contour(X,Y,Z,add=TRUE, + levels=c(1400,1800,2000,2200,2400,2600,2800,3000,3200,4000,5000))

If we use factors, and not continuous variates (cut versions of those two variates),

> reg2=glm(cout~cut(ageconducteur,breaks=c(0,22,35,55,80,100))* + cut(agevehicule,breaks=c(-1,1,3,5,10,100)), + data=base,family=Gamma(link="log"))

(note that we consider the Cartesian product, so values are computed for each product of factors, age of the driver and age of the car) we obtain

Obviously, we’re missing something here: the most expensive class with one model is the cheapeast for the other one! Of course, it might come from our classes (that were chosen a bit randomly), but it might be interesting to use nonlinear functions of the ages. So, let us use splines to smooth those two variables,

> reg3=glm(cout~bs(ageconducteur)+bs(agevehicule),data=base, + family=Gamma(link="log"))

With additive smoothed functions, we obtained a symmetric graph (due to the additive property)

while with a bivariate spline

> library(mgcv) + reg4=gam(cout~s(ageconducteur,agevehicule),data=base, + family=Gamma(link="log"))

(for some odd reasons, I could not use – easily – a bivariate spline in the Generalized Linear Model, but it did work considering a Generalized Additive Model – which is, by no means additive now). We can identify here some regions where the average cost can be extremely expensive… But, as mentioned wednesday, one should keep in mind that some parts of the square above are not reached. More precisely, the distribution of the portfolio, as a function of those two covariates is the following

Thus, the proportion of young drivers driving a brand new car, and the proportion of old drivers driving a very old car is rather small… If the goal is to find niches, one should look at the prediction more carefully, but if the goal is to make that everyone gets an insurance cover, maybe we should allow that some drivers are under-priced (especially when are rare in the portfolio). And one should keep in mind that average costs are *extremely *sensitive to large losses, as discussed previously http://freakonometrics.hypotheses.org/3490 (and in class)

In the univariate case, I have migrated an old post, we I tried to reproduce (in R and in French) some standard graphs in the insurance industry: it is always interesting to visualize not only the prediction obtained from our models, but also the size of each class in the portfolio,

The post is online here http://freakonometrics.hypotheses.org/1224