R-Organize Data(step 2)

简介: R is a data analysis and visualization platform.

[I]Missing Value

1.identify missing value

x is.na(x) is.nan(x) is.infinite(x)
x<-NA TRUE FALSE FALSE
x<-0/0 TRUE TRUE FALSE
x<-1/0 FALSE FALSE TRUE
  • is.na(x):missing value
  • is.nan(x):impossible value
  • is.infinite(x):infinite value

complete.cases():missing values are NA and NaN;Inf and -Inf are valid values

> mydata(data,package="")        #loading
> data[complete.cases(data),]        #no missing value row
> data[!complete.cases(data),]        #one or more missing value row

2.missing pattern

Pattern Package Function Description
list mice md.pattern(x) 0:missing value;1:no missing value
graphic VIM aggr(x,prop=FALSE,number=TRUE) number=FALSE(default):delete numberical label
graphic VIM matrixplot(x,pch=,col=) light color:small value,dark color:great value,red:default missing value
related none none x<-as.data.frame(abs(is.na(data)))
head(x,5)
y<-x[which(apply(x,2,sum)>0)]
cor(data,y,use="pairwise.complete.obs")

3.processing missing value

Method Description
raw delete newdata<-na.omit(mydata)
MI library(mice)
imp<-mice(data,m)
fit<-with(imp,analysis)
pooled<-pool(fit)
summary(pooled)
mvnmle maximum likelihood estimation of missing values in multivariate normal distribution data
cat multiple interpolation of multi-category variable in log-linear models
arraryImpute
arraryMissPattern
Seqknn
microarrary missing data
longitudinalData related function list
kmi multiple interpolation Kaplan-Meier
mix multiple interpolation mixed type data with continuous data
pan multi-panel data or cluster data

[II]Date Value

Function Description
date() output current date and time
Sys.Date() output current date
as.Date(x,"input_format") character convert to date
as.character(dates) date covert to character
difftime(date1,date2,units=) time interval,units="weeks"/"days"/"hours"/"minutes"/seconds"
format(x,format="output_format") output date in the specified format

input/output format

Symbol Description Example
%d 0~31 01
%a abbreviated week name Mon
%A non-abbreviated week name Monday
%m month 01
%b abbreviated month Jan
%B non-abbreviated month January
%y two-digit year 19
%Y four-digit year 2019

[III]Type Conversion

Judgement Conversion
is.numeric() as.numeric()
is.character() as.character()
is.vector() as.vector()
is.matrix() as.matrix()
is.date.frame() as.data.frame()
is.factor() as.factor()
is.logical() as.logical()

[IV]Data Sorting

> newdata<-dataframe[order(x1,x2),]        #x_i=x,ascending;x_i=-x,descending

[V]Data merging

> total<-merge(dataframeA,dataframeB,by="x1")        #column
> total<-cbind(dataframeA,dataframeB)        #direct column merger
> total<-rbind(dataframeA,dataframeB)        #direct row merger

[VI]Subset of Dataset

> newdata<-dataframe[row indices,column indices]        #save variable
> dataframe$x1<-dataframe$x2<-NULL        #delete variable x1,x2
> newdata<subset(dataframe,condition)
> mysample<-dataframe[sample(1:nrow(dataframe),extracting elements,replace=),]        #replace=FALSE/TRUE(put/back)

[VII]Processing Functions

1.math functions

Function Description
abs(x) absolute value
sqrt(x) squart root
ceiling(x) minimum integer not less than x
floor(x) maximum integer not greater than x
trunc(x) integer part from 0 to x
round(x,digits=n) specified n is the decimal number of x
signif(x,digits=n) specified n is the effective number of x
cos(x),sin(x),tan(x) cosine,sine,tangent
acos(x),asin(x),atan(x) arccosine,arcsine,arctangent
cosh(x),sinh(x),tanh(x) hyperbolic cosine,hyperbolic sine,hyperbolic tangent
acosh(x),asinh(x),atanh(x) inverse hyperbolic cosine,inverse hyperbolic sine,inverse hyperbolic tangent
log(x,base=n) base=n,logarithm of x;log(x):base value=e;log10(x):base value=10
exp(x) exponential function

2.statistical function

Function Description
mean(x) mean
madian(x) madian
sd(x) standard deviation
var(x) variance
mad(x) median absolute deviation
quantile(x,probs) quantile
range(x) range
sum(x) summary
diff(x,lag=n) hysteresis difference
min(x) minimum
max(x) maximum
scale(x,center=TRUE,scale=TRUE) centralization:center=TRUE;standardization:center=TRUE,scale=TRUE

3.probability function

> [d/p/q/r]distribution_abbreviation()        
  • d=density
  • p=distribution function
  • q=quantile function
  • r=random function
Distribution Abbreviation
Beta beta
Binomial binom
Cauchy caushy
Chi-square chisq
Exponential exp
F f
Gamma gamma
Geometric geom
Hypergeometric hyper
Logarithm normal lnorm
Logistic logis
Multiple multinom
Negative Binomial nbinom
Normal norm
Poission pois
Wilcoxon signrank
T t
Uniform unif
Weibull weibull
Wilcoxon wilcox

4.character processing function

Function Description
nchar(x) character amount of x
substr(x,start,stop) extract or replace a substring in a character vetor
grep(pattern,x,ignore,case=FALSE,fixed=FALSE) search for a pattern in x.Regular Expression:fixed= FALSE;Text string:fixed=TRUE
sub(pattern,replacement,x,ignore,case=FALSE,fixed=FALSE) search for a pattern in x and replacing by text replacement
strsplit(x,split,fixed=FALSE) separate x in split
paste(...,sep="") connection string with separator sep
toupper(x) convert to uppercase
tolower(x) convert to lowercase

5.others

Function Description
length(x) the length of x
seq(from,to,by) generate a sequence
rep(x,n) repeat x times n times
cut(x,n) separate x into n parts
pretty(x,n) create beautiful split points
cat(...,file="mylife",append=FALSE) connection ... and output a file
apply(x,MARGIN,FUN,...) x:data,MARGIN:subscript of dimension,FUN:specified function

Homemade function

> myfunction<-function(arg1,arg2,..){
       statements
      return(object)
  }

[VIII]Control Flow

Description Function
Repeat and Loop for(var in seq) statement
while (cond) statement
Conditional Execution if (cond) statement
if (cond) statement1 else statement2
ifelse(cond,statement1,statement2)
switch(expr,...)

[IX]Aggregate and Reshape

Function Description
t(x) Transpose
aggregate(x,by,FUN) x=data,by:a list of variable name,FUN:function
melt(x,variance) reshape2 package,data melt
dcast(md,formula,fun.aggregate) md:melted data,formula:variance1~variance i,fun.aggregate:aggregate function

[X]component analysis

step analysis diagram of principal component/exploratory factor

1

principal component analysis

1.determine the number of principal

> library(psych)
> fa.parallel(Harman23.cor$cov,n.obs=302,fa="pc",n.iter=100,
                  show.legend=FALSE,main="Scree plot with parallel analysis")

2.extracting the main component

> pc<-principal(r=USJudgeRatings[,-1],nfactors=1,rotate=,scores=)        #r:data,rotate default:maximum,scores default:no need

3.principal component rotation

> rc<-principal(Harman23.cor$cov,nfactors=2,rotate="varimax")

4.get the score of the principal component

> round(unclass(rc$weights),2)

exploratory factor analysis

1.determine the number of common factors

> library(psych)
> covariances<-ability.cov$cov
> correlations<-cov2cor(covariances)
> fa.parallel(correlations,nobs=112,fa="both",n.iter=100,
                   main="Scree plots with parallel analysis")

2.extracting common factor

> fa<-fa(correlations,nfactors=2,rotate="none",fm="pa")

3.factor rotation

> fa.varimax<-fa(correlations,nfactors=2,rotate="varimax",fm="pa")        #orthogonal
> fa.promax<-fa(correlations,nfactors=2,rotate="promax",fm="pa")        #oblique
> factor.plot(fa.promax,labels=rownames(fa.promax$loadings))
> fa.diagram(fa.promax,simple=FALSE)

4.factor score

> fa.promax$weights

目录
相关文章
|
XML Dubbo Java
【Dubbo3高级特性】「框架与服务」服务的异步调用实践以及开发模式
【Dubbo3高级特性】「框架与服务」服务的异步调用实践以及开发模式
348 0
|
缓存 Devops 物联网
阿里巴巴DevOps实践指南(六)| 产品导向的交付
业务驱动和产品导向是适应数字化时代要求的协作和交付方式,是我们对 DevOps 实施的核心价值主张。同时,它们的有效实施离不开工程实践和能力的支撑,下一章我们将讨论 DevOps 的另一核心要素——持续交付的工程能力。
阿里巴巴DevOps实践指南(六)| 产品导向的交付
|
12月前
|
存储 人工智能 自然语言处理
边缘智能的新时代:端侧大模型的研究进展综述
【10月更文挑战第9天】随着人工智能的发展,大语言模型在自然语言处理领域取得突破,但在资源受限的边缘设备上部署仍面临挑战。论文《On-Device Language Models: A Comprehensive Review》全面综述了端侧大模型的研究进展,探讨了高效模型架构、压缩技术、硬件加速及边缘-云协作等解决方案,展示了其在实时、个性化体验方面的潜力,并指出了未来的研究方向和挑战。
1051 2
|
6月前
|
存储 NoSQL 数据库
Redis 逻辑数据库与集群模式详解
Redis 是高性能内存键值数据库,广泛用于缓存与实时数据处理。本文深入解析 Redis 逻辑数据库与集群模式:逻辑数据库提供16个独立存储空间,适合小规模隔离;集群模式通过分布式架构支持高并发和大数据量,但仅支持 database 0。文章对比两者特性,讲解配置与实践注意事项,并探讨持久化及性能优化策略,助你根据需求选择最佳方案。
207 5
|
7月前
|
自然语言处理 搜索推荐 关系型数据库
MySQL实现文档全文搜索,分词匹配多段落重排展示,知识库搜索原理分享
本文介绍了在文档管理系统中实现高效全文搜索的方案。为解决原有ES搜索引擎私有化部署复杂、运维成本高的问题,我们转而使用MySQL实现搜索功能。通过对用户输入预处理、数据库模糊匹配、结果分段与关键字标红等步骤,实现了精准且高效的搜索效果。目前方案适用于中小企业,未来将根据需求优化并可能重新引入专业搜索引擎以提升性能。
303 5
|
消息中间件 Kafka Python
Producer的错误处理与重试机制
【8月更文第29天】在分布式系统中,消息传递是核心组件之一,它通常通过消息队列(如 Kafka、RabbitMQ 或其他)来实现。当生产者尝试将消息发送到消息队列时,可能会遇到各种类型的故障,例如网络中断、服务器不可用等。为了确保消息的可靠传递,需要实现有效的错误处理和重试机制。
473 3
|
8月前
|
人工智能 程序员 测试技术
通义灵码 AI 程序员核心功能体验
阿里云通义灵码AI程序员已全面上线,成为全球首个同时支持 VS Code、JetBrains IDEs 开发工具的AI程序员产品。
1147 1
通义灵码 AI 程序员核心功能体验
|
8月前
|
人工智能 Java 语音技术
零基础上手百炼语音AI模型|Github示例工程介绍
零基础上手百炼语音AI模型|Github示例工程介绍
|
9月前
|
弹性计算 运维 监控
阿里云云服务诊断工具评测-轻松实现云资源健康监控与故障排查
阿里云云服务诊断工具评测:作为一名开发工程师,我体验了其健康状态与诊断功能。健康状态功能可实时监控ECS实例运行状况,帮助快速发现异常;诊断功能则自动分析并提供解决方案,显著提升故障排查效率。通过这些功能,我的工作效率提升了约30%-40%。建议进一步增强智能诊断和优化界面显示,使工具更加强大易用。推荐给所有运维人员和开发工程师使用。
220 22
|
消息中间件 存储 负载均衡
大数据-60 Kafka 高级特性 消息消费01-消费组图例 心跳机制图例 附参数详解与建议值
大数据-60 Kafka 高级特性 消息消费01-消费组图例 心跳机制图例 附参数详解与建议值
261 3