Home / Predictive Modeling & Machine Learning / 203.5.5 Practice : Implementing Intermediate outputs in R

# 203.5.5 Practice : Implementing Intermediate outputs in R

In this post we will learn how to implement the concept of intermediate outputs using R. We will cover many things in this session.

• Dataset: Emp_Productivity/ Emp_Productivity_All_Sites.csv
• Filter the data and take first 74 observations from above dataset . Filter condition is Sample_Set<3
• Build a logistic regression model to predict Productivity using age and experience
• Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable
• Filter the data and take observations from row 34 onwards. Filter condition is Sample_Set<1
• Build a logistic regression model to predict Productivity using age and experience
• Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable
• Build a consolidated model to predict productivity using inter-1 and inter-2 variables
• Create the confusion matrix and find the accuracy and error rates for the consolidated model

Our sampled data Emp_Productivity1 has first 74 observations. Lets build the model on this sample data(sample-1)

``````####The clasification graph Sample-1
library(ggplot2)
ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)``````

``````###Logistic Regerssion model1
Emp_Productivity_logit<-glm(Productivity~Age+Experience,data=Emp_Productivity1, family=binomial())

coef(Emp_Productivity_logit)``````
``````## (Intercept)         Age  Experience
##  -8.9361114   0.2762749   0.5923444``````
``````slope1 <- coef(Emp_Productivity_logit)[2]/(-coef(Emp_Productivity_logit)[3])
intercept1 <- coef(Emp_Productivity_logit)[1]/(-coef(Emp_Productivity_logit)[3])
####Decision boundary for model1 built on Sample-1
library(ggplot2)
base<-ggplot(Emp_Productivity1)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept1 , slope = slope1, color = "red", size = 2)``````

``#Base is the scatter plot. Then we are adding the decision boundary``
``````#Filter the data and take observations from row 34 onwards.
Emp_Productivity2<-Emp_Productivity_raw[Emp_Productivity_raw\$Sample_Set>1,]
####The clasification graph
library(ggplot2)
ggplot(Emp_Productivity2)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)``````

``````###Logistic Regerssion model2 built on Sample2
Emp_Productivity_logit2<-glm(Productivity~Age+Experience, data=Emp_Productivity2, family=binomial())
Emp_Productivity_logit2``````
``````##
## Call:  glm(formula = Productivity ~ Age + Experience, family = binomial(),
##     data = Emp_Productivity2)
##
## Coefficients:
## (Intercept)          Age   Experience
##     16.3184      -0.3994      -0.2440
##
## Degrees of Freedom: 85 Total (i.e. Null);  83 Residual
## Null Deviance:       119
## Residual Deviance: 34.08     AIC: 40.08``````
``coef(Emp_Productivity_logit2)``
``````## (Intercept)         Age  Experience
##  16.3183916  -0.3994172  -0.2439643``````
``````slope3 <- coef(Emp_Productivity_logit2)[2]/(-coef(Emp_Productivity_logit2)[3])
intercept3 <- coef(Emp_Productivity_logit2)[1]/(-coef(Emp_Productivity_logit2)[3])
####Drawing the Decison boundry
library(ggplot2)
base<-ggplot(Emp_Productivity2)+geom_point(aes(x=Age,y=Experience,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept3 , slope = slope3, color = "red", size = 2) ``````

``````####Accuracy of the model2
predicted_values<-round(predict(Emp_Productivity_logit2,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit2\$y)
conf_matrix``````
``````##
## predicted_values  0  1
##                0 43  2
##                1  2 39``````
``````accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy``````
``## [1] 0.9534884``
``````#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter1 variable
Emp_Productivity_raw\$inter1<-predict(Emp_Productivity_logit,type="response", newdata=Emp_Productivity_raw)

#Calculate the prediction probabilities for all the inputs. Store the probabilities in inter2 variable
Emp_Productivity_raw\$inter2<-predict(Emp_Productivity_logit2,type="response", newdata=Emp_Productivity_raw)

``````##    Age Experience Productivity Sample_Set     inter1    inter2
## 1 20.0        2.3            0          1 0.11423230 0.9995775
## 2 16.2        2.2            0          1 0.04080461 0.9999096
## 3 20.2        1.8            0          1 0.09202657 0.9995949
## 4 18.8        1.4            0          1 0.05152147 0.9997899
## 5 18.9        3.2            0          1 0.13955234 0.9996608
## 6 16.7        3.9            0          1 0.11793035 0.9998329``````
``````####Clasification graph with the two new coloumns
library(ggplot2)
ggplot(Emp_Productivity_raw)+geom_point(aes(x=inter1,y=inter2,color=factor(Productivity),shape=factor(Productivity)),size=5)``````

``````###Logistic Regerssion model with Intermediate outputs as input
Emp_Productivity_logit_combined<-glm(Productivity~inter1+inter2,data=Emp_Productivity_raw, family=binomial())
Emp_Productivity_logit_combined``````
``````##
## Call:  glm(formula = Productivity ~ inter1 + inter2, family = binomial(),
##     data = Emp_Productivity_raw)
##
## Coefficients:
## (Intercept)       inter1       inter2
##     -12.213        8.019        8.598
##
## Degrees of Freedom: 118 Total (i.e. Null);  116 Residual
## Null Deviance:       155.7
## Residual Deviance: 49.74     AIC: 55.74``````
``````slope4 <- coef(Emp_Productivity_logit_combined)[2]/(-coef(Emp_Productivity_logit_combined)[3])
intercept4<- coef(Emp_Productivity_logit_combined)[1]/(-coef(Emp_Productivity_logit_combined)[3])

####Drawing the Decison boundry
library(ggplot2)
base<-ggplot(Emp_Productivity_raw)+geom_point(aes(x=inter1,y=inter2,color=factor(Productivity),shape=factor(Productivity)),size=5)
base+geom_abline(intercept = intercept4 , slope = slope4, colour = "red", size = 2) ``````

``````####Accuracy of the combined
predicted_values<-round(predict(Emp_Productivity_logit_combined,type="response"),0)
conf_matrix<-table(predicted_values,Emp_Productivity_logit_combined\$y)
conf_matrix``````
``````##
## predicted_values  0  1
##                0 74  4
##                1  2 39``````
``````accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
accuracy``````
``## [1] 0.9495798``

We got an accuracy of 94.95% with an Intermediate model.