[I]Missing Value
1.identify missing value
x | is.na(x) | is.nan(x) | is.infinite(x) |
---|---|---|---|
x<-NA | TRUE | FALSE | FALSE |
x<-0/0 | TRUE | TRUE | FALSE |
x<-1/0 | FALSE | FALSE | TRUE |
- is.na(x):missing value
- is.nan(x):impossible value
- is.infinite(x):infinite value
complete.cases():missing values are NA and NaN;Inf and -Inf are valid values
> mydata(data,package="") #loading
> data[complete.cases(data),] #no missing value row
> data[!complete.cases(data),] #one or more missing value row
2.missing pattern
Pattern | Package | Function | Description |
---|---|---|---|
list | mice | md.pattern(x) | 0:missing value;1:no missing value |
graphic | VIM | aggr(x,prop=FALSE,number=TRUE) | number=FALSE(default):delete numberical label |
graphic | VIM | matrixplot(x,pch=,col=) | light color:small value,dark color:great value,red:default missing value |
related | none | none | x<-as.data.frame(abs(is.na(data))) head(x,5) y<-x[which(apply(x,2,sum)>0)] cor(data,y,use="pairwise.complete.obs") |
3.processing missing value
Method | Description |
---|---|
raw delete | newdata<-na.omit(mydata) |
MI | library(mice) imp<-mice(data,m) fit<-with(imp,analysis) pooled<-pool(fit) summary(pooled) |
mvnmle | maximum likelihood estimation of missing values in multivariate normal distribution data |
cat | multiple interpolation of multi-category variable in log-linear models |
arraryImpute arraryMissPattern Seqknn |
microarrary missing data |
longitudinalData | related function list |
kmi | multiple interpolation Kaplan-Meier |
mix | multiple interpolation mixed type data with continuous data |
pan | multi-panel data or cluster data |
[II]Date Value
Function | Description |
---|---|
date() | output current date and time |
Sys.Date() | output current date |
as.Date(x,"input_format") | character convert to date |
as.character(dates) | date covert to character |
difftime(date1,date2,units=) | time interval,units="weeks"/"days"/"hours"/"minutes"/seconds" |
format(x,format="output_format") | output date in the specified format |
input/output format
Symbol | Description | Example |
---|---|---|
%d | 0~31 | 01 |
%a | abbreviated week name | Mon |
%A | non-abbreviated week name | Monday |
%m | month | 01 |
%b | abbreviated month | Jan |
%B | non-abbreviated month | January |
%y | two-digit year | 19 |
%Y | four-digit year | 2019 |
[III]Type Conversion
Judgement | Conversion |
---|---|
is.numeric() | as.numeric() |
is.character() | as.character() |
is.vector() | as.vector() |
is.matrix() | as.matrix() |
is.date.frame() | as.data.frame() |
is.factor() | as.factor() |
is.logical() | as.logical() |
[IV]Data Sorting
> newdata<-dataframe[order(x1,x2),] #x_i=x,ascending;x_i=-x,descending
[V]Data merging
> total<-merge(dataframeA,dataframeB,by="x1") #column
> total<-cbind(dataframeA,dataframeB) #direct column merger
> total<-rbind(dataframeA,dataframeB) #direct row merger
[VI]Subset of Dataset
> newdata<-dataframe[row indices,column indices] #save variable
> dataframe$x1<-dataframe$x2<-NULL #delete variable x1,x2
> newdata<subset(dataframe,condition)
> mysample<-dataframe[sample(1:nrow(dataframe),extracting elements,replace=),] #replace=FALSE/TRUE(put/back)
[VII]Processing Functions
1.math functions
Function | Description |
---|---|
abs(x) | absolute value |
sqrt(x) | squart root |
ceiling(x) | minimum integer not less than x |
floor(x) | maximum integer not greater than x |
trunc(x) | integer part from 0 to x |
round(x,digits=n) | specified n is the decimal number of x |
signif(x,digits=n) | specified n is the effective number of x |
cos(x),sin(x),tan(x) | cosine,sine,tangent |
acos(x),asin(x),atan(x) | arccosine,arcsine,arctangent |
cosh(x),sinh(x),tanh(x) | hyperbolic cosine,hyperbolic sine,hyperbolic tangent |
acosh(x),asinh(x),atanh(x) | inverse hyperbolic cosine,inverse hyperbolic sine,inverse hyperbolic tangent |
log(x,base=n) | base=n,logarithm of x;log(x):base value=e;log10(x):base value=10 |
exp(x) | exponential function |
2.statistical function
Function | Description |
---|---|
mean(x) | mean |
madian(x) | madian |
sd(x) | standard deviation |
var(x) | variance |
mad(x) | median absolute deviation |
quantile(x,probs) | quantile |
range(x) | range |
sum(x) | summary |
diff(x,lag=n) | hysteresis difference |
min(x) | minimum |
max(x) | maximum |
scale(x,center=TRUE,scale=TRUE) | centralization:center=TRUE;standardization:center=TRUE,scale=TRUE |
3.probability function
> [d/p/q/r]distribution_abbreviation()
- d=density
- p=distribution function
- q=quantile function
- r=random function
Distribution | Abbreviation |
---|---|
Beta | beta |
Binomial | binom |
Cauchy | caushy |
Chi-square | chisq |
Exponential | exp |
F | f |
Gamma | gamma |
Geometric | geom |
Hypergeometric | hyper |
Logarithm normal | lnorm |
Logistic | logis |
Multiple | multinom |
Negative Binomial | nbinom |
Normal | norm |
Poission | pois |
Wilcoxon | signrank |
T | t |
Uniform | unif |
Weibull | weibull |
Wilcoxon | wilcox |
4.character processing function
Function | Description |
---|---|
nchar(x) | character amount of x |
substr(x,start,stop) | extract or replace a substring in a character vetor |
grep(pattern,x,ignore,case=FALSE,fixed=FALSE) | search for a pattern in x.Regular Expression:fixed= FALSE;Text string:fixed=TRUE |
sub(pattern,replacement,x,ignore,case=FALSE,fixed=FALSE) | search for a pattern in x and replacing by text replacement |
strsplit(x,split,fixed=FALSE) | separate x in split |
paste(...,sep="") | connection string with separator sep |
toupper(x) | convert to uppercase |
tolower(x) | convert to lowercase |
5.others
Function | Description |
---|---|
length(x) | the length of x |
seq(from,to,by) | generate a sequence |
rep(x,n) | repeat x times n times |
cut(x,n) | separate x into n parts |
pretty(x,n) | create beautiful split points |
cat(...,file="mylife",append=FALSE) | connection ... and output a file |
apply(x,MARGIN,FUN,...) | x:data,MARGIN:subscript of dimension,FUN:specified function |
Homemade function
> myfunction<-function(arg1,arg2,..){
statements
return(object)
}
[VIII]Control Flow
Description | Function |
---|---|
Repeat and Loop | for(var in seq) statement |
while (cond) statement | |
Conditional Execution | if (cond) statement if (cond) statement1 else statement2 |
ifelse(cond,statement1,statement2) | |
switch(expr,...) |
[IX]Aggregate and Reshape
Function | Description |
---|---|
t(x) | Transpose |
aggregate(x,by,FUN) | x=data,by:a list of variable name,FUN:function |
melt(x,variance) | reshape2 package,data melt |
dcast(md,formula,fun.aggregate) | md:melted data,formula:variance1~variance i,fun.aggregate:aggregate function |
[X]component analysis
step analysis diagram of principal component/exploratory factor
principal component analysis
1.determine the number of principal
> library(psych)
> fa.parallel(Harman23.cor$cov,n.obs=302,fa="pc",n.iter=100,
show.legend=FALSE,main="Scree plot with parallel analysis")
2.extracting the main component
> pc<-principal(r=USJudgeRatings[,-1],nfactors=1,rotate=,scores=) #r:data,rotate default:maximum,scores default:no need
3.principal component rotation
> rc<-principal(Harman23.cor$cov,nfactors=2,rotate="varimax")
4.get the score of the principal component
> round(unclass(rc$weights),2)
exploratory factor analysis
1.determine the number of common factors
> library(psych)
> covariances<-ability.cov$cov
> correlations<-cov2cor(covariances)
> fa.parallel(correlations,nobs=112,fa="both",n.iter=100,
main="Scree plots with parallel analysis")
2.extracting common factor
> fa<-fa(correlations,nfactors=2,rotate="none",fm="pa")
3.factor rotation
> fa.varimax<-fa(correlations,nfactors=2,rotate="varimax",fm="pa") #orthogonal
> fa.promax<-fa(correlations,nfactors=2,rotate="promax",fm="pa") #oblique
> factor.plot(fa.promax,labels=rownames(fa.promax$loadings))
> fa.diagram(fa.promax,simple=FALSE)
4.factor score
> fa.promax$weights