R-Modeling(step 4)

本文涉及的产品
运维安全中心(堡垒机),免费版 6个月
运维安全中心(堡垒机),企业双擎版 50资产 7天
简介: R is a data analysis and visualization platform.

[I]Regression

OLSregression

Description Function
simple linear regression lm(Y~X1,data)
polynomial regression lm(Y~X1+...+I(X^2),data)
multiple linear regression lm(Y~X1+X2+...+Xk)
multiple linear regression with interaction terms lm(Y~X1+X2+X1:X2)

selection the optimal regression model

Function Description
anova(fit1,fit2) nested model fit
AIC(fit1,fit2) statistical fit
stepAIC(fit,direction=) stepwise method:"forward""backward"
regsubsets() all-subsets regression

test

Function Description
plot() plot
qqplot() quantile comparison
durbinWatsonTest() Durbin-Waston
crPlots() component and residual
ncvTest() score test of non-constant error variance
spreadLevelPlot() dispersion level
outlierTest() Bonferroni outlier
avPlots() added variable
inluencePlot() regression
scatterplot() enhanced scatter plot
scatterplotMatrix() enhanced scatter plot matrix
vif() variance expansion factor

abnormal

  • abnormal observation

1.outlier:outlierTest()
2.high leverage point:hat.plot()
3.strong influence point:Cook's D

  • improvement measures

1.delete observation point
2.variable transformation
3.addition and deletion of variables

Generalized Linear

generalized linear

  • glm()
    Distribution Family | Default Connection Function
binomial (link = "logit")
gaussian (link = "identity")
gamma (link = "inverse")
inverse.gaussian (link = "1/mu^2")
poisson (link = "log")
quasi (link = "identity",variance = "constant")
quasibinomial (link = "logit")
quasipoisson (link = "log")
  • function with glm()
    Function | Description
summary() show details of the fitting model
coefficients()/coef() list the parameters of the fitting model
confint() give a confidence interval for the model paramenters
residuals() list residual value of the fitting model
anova() generate an analysis table of variance for two fitting model
plot() generate a diagnostic map of the evaluation fitting model
predict() forecast new datasets with a fitting model
deviance() the deviation of the fitting model
df.residual() the residual freedom of the fitting model

logistic

> data(Affairs,package="ARE")
> summary(Affairs)
> table(Affair$affairs)
> Affairs$ynaffair[Affairs$affairs > 0] <-1
> Affairs$ynaffair[Affairs$affairs == 0] <-0
> Affairs$naffair <- factor(Affairs$ynaffair),levels=c(0,1),labels=c("No","Yes"))
> table(Affairs$ynaffair)
> fit.full <-glm(ynaffair~gender + age + yearmarried +children + 
             religiousness + education + occupation + rating,family = binomial()
             data = Affairs)
> fit.reduced <-glm(ynaffair ~ age + yearmarried + religiousness +rating,
                      family = binomial(),data = Affairs)
> anova(fit.reduced,fit.full,test="Chisq")
> coef(fit.reduced)
> exp(coef(fit.reduced))
> testdata<-data.frame(rating=c(1,2,3,4,5),age=mean(Affair$age),
                                   yearsmarried=mean(Affairs$yearsmsrraied),
                                   religiousness=mean(Affairs$religiousness))
> teatdata
> testdata$prob <- predict(fit.reduced,newdata=testdata,type="response")
> testdata <- data.frame(rating = mean(Affair$rating),
                                    age=seq(17,57,10),
                                    yearsmarried=mean(Affair$yearsmarried),
                                    religiousness=mean(Affairs$religiousness))
> testdata
> testdata$prob<-predict(fit.reduced,newdata=testdata,type="response")
> testdata
> deviance(fit.reduced)/df.residual(fit.reduced)
> fit <- glm(ynaffair ~ age + yearmarried + religiousness + rating,
                 family = quasibinomial(),data=Affairs)
> pchisq(summary(fit.od)$dispersion * fit$df.residual,
             fit$df.residual,lower=F)

poisson

> data(breslow.dat,package="robust")
> name(breslow.dat)
> summary(breslow.dat[c(6,7,8,10)])
> opar<-par(no.readonly=TRUE)
> par(mfrow=c(1,2))
> attach(breakslow.dat)
> hist(sumY,breaks=20,xlab="Seizure Count",main="Distribution of Seizures")
> boxplot(sumY~Trt,xlab="Treatment",main="Group Comparisons")
> par(opar)
> fit<-glm(sumY~Base + Age +Trt,data=breslow.dat,family=poisson())
> summary(fit)
> coef(fit)
> exp(coef(fit))
> deviance(fit)/df.residual(fit)
> library(qcc)
> qcc.overdispersion.test(breslow.dat$sumY,type="poisson")
> fit.od<-glm(sumY~Base + Age + Trt,data=breslow.dat,
       family=quassipossion())
> summary(fit.od)
> fit glm(sumY~Base + Age +Trt,data=breslow.dat,offset=log(time),family=poisson)

[II]Cluster

1.cluster analysis step

1.choose the right variable
2.scale data
3.looking for anomalies
4.calculated distance
5.selection clustering algorithm
6.obtain one or more clustering methods
7.determine the number of classes
8.get the ultimate clustering solution
9.visualization of results
10.interpretation class
11.validation results

2.calculated distance

> data(nurtrient,package="flexclust")
> head(nutrient,4)
> d<-dist(nutrient)
> as.martrix(d)[1:4,1:4]

3.hierarchical clustering analysis

> data(nutrient,package="flexclust")
> row.name(nutrient) <-tolower(row.names(nutrient))
> nutrient.scaled<-scale(nutrient)

> d<-dist(nutrient.scaled)

> fit.average <-hclust(d,method="average")
> plot(fit.average,hang=-1,cex=.8,main="Average Linkage Clustering")
> library(Nbcluster)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(nutrient,scaled,distance="euclidean",
                       min.nc=2,max.nc=15,method="average")
> table(nc$Best.n[1,])
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Cluster Chosen by 26 Criteria")  
> clusters<-cutree(fit.average,k=5)
> table(clusters)
> aggregate(nutrient,by=list(cluster=clusters),median)
> aggregate(as.data.frame(nutrient.scaled),bu=list(cluster=clusters),median)
> plot(fit.average,hang=-1,cex=.8,
         main="Average Linkage Cluster\n5 Cluster Solution")
> rect.hclust(fit.average,k=5)

4.Clustering analysis

  • K-means clustering
> wssplot<-function(data,nc=15,seed=1234)(
> head(wine)
> df<-scale(wine[-1])
> wssplot(df)
> library(NcClust)
> set.seed(1234)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmeans")
> table(nc$Best.n[1,)
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> set.seed(1234)
> fit.km<-kmeans(df,3,nstart=25)
> fit.km$size
> fit.km$centers
> aggregate(wine[-1],by=list(cluster=fit.km$cluster),mean)
  • Division around the center point
> library(cluster)
> set.seed(1234)
> fit.pam<-pam(wine[-1],k=3,stand=TRUE)
> fit.pam$medoids
> clusplot(fit.pam,main="Bivariate Cluster Plot")
> ct.pam<-table(wine$Type,fit.pam$clustering)
> randIndex(ct.pam)

5.avoid non-existing classes

> library(fMultivar)
> set.seed(1234)
> df<-rnom2d(1000,rho=.5)
> df<-as.data.frame(df)
> plot(df,main="Bivariate Normal Distribution with rho=0.5")
> wssplot(df)
> library(NbClust)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmean")
> dev.new()
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> library(ggplot2)
> library(cluster)
> fit<-pam(df,k=2)
> df$clustering<-factor(fit$clustering)
> ggplot(data=df,aes(x=V1,y=V2,color=clustering,shape=clustering)) +
             geom_point() + ggtitle("Clustering of Bivariate Normal Data")
> plot(nc$All,index[,4],type="o",ylab="CCC",xlab="Number of clusters",col="blue")

[III]Classification

data preparation

> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"breast-cancer-wisconsin/breast-cancer-wisconsin.data"
> url<-paste(loc,ds,sep="")
> breast<-read.table(url,sep=",",header=FALSE,na.string="?")
> names(breast)<-c("ID","clumpThickness","sizeUniformity",
                              "shapeUniformity","maginalAdhesion",
                              "singleEpitheliaCellSize","bareNuclei",
                              "blandChromatin","normalNucleoli","mitosis","class")
> df<-breast[-1]
> df$class<-factor(df$class,levels=c(2,4),
                           labels=c("benign","malignant"))
> set.seed(1234)
> train<-sample(nrow(df),0.7*nrow(df))
> df.train<-df[train,]
> df.validate<-df[-train,]
> table(df.train$class)
> table(df.validate$class)

logistic regression

> fit.logit<-glm(class~.,data=df.train,family=binomial())
> summary(fit.logit)
> prob<-predict(fit.logit,df.validate,type="response")
> logit.perd<-factor(prob>.5,levels=c(FALSE,TRUE),
                             labels=c("benign","malignant"))
> logit.perf<-table(df.validate$class,logit.pred,
                            dnn=c("Actual","Predicted"))
> logit.perf

decision tree

  • classic decision tree
> library(repart)
> set.seed(1234)
> dtree<-repart(class~.,data=df.train,method="class",
                       parms=list(split="information"))
> dtree$cptable
> plotcp(dtree)
> dtree.pruned<-prune(dtree,cp=.0125)
> library(repart.plot)
> prp(dtree.pruned,type=2,extra=104,
        fallen.leaves=TRUE,main="Decesion Tree")
> dtree.pred<-predict(dtree.pruned,df.validate,type="class")
> dtree.perf<-table(df.validate$class,dtree.pred,
                             dnn=c("Actual","Predicted"))
> dtree.perf
  • conditional inference tree
> library(party)
> fit.ctree<-ctree(class~.,data=sf.train)
> plot(fit.ctree,main="Conditional Inference Tree")
> ctree.pred<-predict(fit.ctree,df.validate,type="response")
> ctree.perf<-table(df.validate$class,ctree.pred
                            dnn=c("Actual","Predicted"))
> stree.perf

random forest

> library(randomForest)
> set.seed(1234)
> fit.forest<-randomForest(class~.,data=df.train,na.action=na.roughfix,importance=TRUE)
> fit.forest
> importance(fit.forest,type=2)
> forest.pred<-predict(fit.forest,df.validate)
> forest.perf<-table(df.validate$class,forest.pred,
                              dnn=c("Actual","Predicted"))
> forest.perf

support vector machines

  • svm
> library(e1071)
> set.seed(1234)
> fit.svm<-svm(class~.,data=df.train)
> fit.svm
> svm,pred<-predict(fit.svm,na.omit(df.validate))
> svm.perf<-table(na.omit(df.validate)$class,
                           svm,pred,dnn=c("Actual","Predicted"))
> svm.perf
  • svm model with rbf core
> set.seed(1234)
> tuned<-tune.svm(class~.,data=df.train,gamma=10^(-6:1),cost=10^(-10:10))
> turned
> fit.svm<-svm(class~.,data=df.train,gamma=.01,cost=1)
> svm.pred<-predict(fit.svm,na.omit(df,validate))
> svm.perf<-table(na.omit(df.validate)$class,
                            svm.pred,dnn=c("Actual","Predicted"))
> svm.perf

choose the best solution for forecasting

> performance<-function(table,n=2){
        if(!all(dim(table)==c(2,2)))
              stop("Must be a 2 × 2 table")
   tn=table[1,1]
   fp=table[1,2]
   fn=table[2,1]
   tp=table[2,2]
   sensitivity=tp/(tp+fn)
   specificity=tn/(tn+fp)
   ppp=tp/(tp+fp)
   npp=tn/(tn+fn)
   hitrate=(tp+tn)/(tp+tn+fp+fn)
   result<-paste("Sensitivity=",round(ppp,n),
           "\nSpecificity=",round(specificity,n),
           "\nPositive Predictive Value=",round(ppp,n),
           "\nNegative Predictive Value=",round(npp,n),
           "\nAccuracy=",round(hitrate,n),"\n",seq="")
   cat(result)
  }

data mining with the rattle package

> install.packages("rattle")
> rattle()
> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"pima-indians-diabetes/pima-indians-diabetes.data"
> url <-paste(loc,ds,sep="")
> diabetes<-read.table(url,seq=",",header=FALSE)
> names(diabetes)<-c("npregant","plasma","bp","triceps",
                                  "insulin","bmi","pedigree","age","class")
> diabetes$class<-factor(diabetes$class,levels=c(0,1),
                                      labels=c("normal","diabetic"))
> library(rattle)
> rattle()
> cv<-matrix(c(145,50,8,27),nrow=2)
> performance(as.table(cv))

[IV]Time Series

Function Package Description
ts() stats generate timing objects
plot() graphics draw a line graph of the time series
start() stats return the start time of the time series
end() stats return the end time of the time series
frequency() stats return the number of time points of the time series
windows() stats subset a sequence object
ma() forecast fit a simple moving average model
stl() stats use LOESS smooth to decompose timing into seasonal terms,trend terms and random terms
monthplot() stats draw seasonal terms in time series
seasonplot() forecast generate seasonal plot
HoltWinters() ststs fit exponential smoothing model
forecast() forecast predicting the future value of the timing
accuracy() forecast return time-of-fit goodness variable
ets() forecast fit exponential smoothing model and automatically select the optimal model
lag() stats return the timing after the specified hysteresis is taken
Acf() forecast estimated autocorrelation function
Pacf() forecast estimated partial autocorrelation function
diff() base return the sequence after the lag term or difference
ndiffs() forecast find the optimal difference count to remove the trend terms in the sequence
adf.test() tseries perform an ADF test on the sequence to determine if it is stable
arima stats fit the ARIMA model
Box.test() ststs perform a Ljung-Box test to determine if the model's residuals are independent
bds.test() teeries perform a BDS test to determine whether random variables in the sequence
are subject to independent and identical distribution
auto.arima forecast automatically select the ARIMA model

END!

目录
相关文章
|
NoSQL Redis
RedisConf19会议记录 Day0: Training
# 前言 2019年的RedisConf比以往时候来的更早一些,今年会议时间是4月1-3号,仍然是在旧金山鱼人码头Pier 27。恰逢今年是Redis第10周年,规模也比以往大一些,注册人数超过1600人,总共有80个议题,除了RedisLabs外还有很多云厂商和Redis用户带来分享,今年也是阿里云首次在RedisConf进行分享。 # Agenda 会议总共三天,第一天是Tr
1294 0
sabsa training
 http://www.alctraining.com.hk/contact_us.php   http://sabsaservicesinternational.
1051 0
|
机器学习/深度学习 并行计算 PyTorch
【PyTorch】Training Model
【PyTorch】Training Model
112 0
|
机器学习/深度学习 数据挖掘 PyTorch
GAN Step By Step -- Step3 DCGAN
GAN Step By Step -- Step3 DCGAN
GAN Step By Step -- Step3 DCGAN
|
9月前
|
机器学习/深度学习 自然语言处理 PyTorch
Model Inference
模型推理(Model Inference)是指使用已经训练好的机器学习模型来对新数据进行预测或分类的过程。模型推理是机器学习中的一个重要环节,其目的是利用训练好的模型对新数据进行预测或分类,从而得到结果。
493 1
|
并行计算 数据挖掘 机器学习/深度学习
Sample Classification Code of CIFAR-10 in Torch
Sample Classification Code of CIFAR-10 in Torch  from: http://torch.ch/blog/2015/07/30/cifar.html    require 'xlua' require 'optim' require 'nn' require 'image' local c = require 'trepl.
|
存储 测试技术
测试模型时,为什么要with torch.no_grad(),为什么要model.eval(),如何使用with torch.no_grad(),model.eval(),同时使用还是只用其中之一
在测试模型时,我们通常使用with torch.no_grad()和model.eval()这两个方法来确保模型在评估过程中的正确性和效率。
1231 0
|
PyTorch 算法框架/工具
Pytorch Loss Functions总结
Pytorch Loss Functions总结
191 0
|
8月前
|
人工智能 自然语言处理 PyTorch
CLIP(Contrastive Language-Image Pre-training)
CLIP(Contrastive Language-Image Pre-training)
404 0
|
机器学习/深度学习 编解码 并行计算
Paper:《YOLOv4: Optimal Speed and Accuracy of Object Detection》的翻译与解读
Paper:《YOLOv4: Optimal Speed and Accuracy of Object Detection》的翻译与解读
Paper:《YOLOv4: Optimal Speed and Accuracy of Object Detection》的翻译与解读

热门文章

最新文章