R语言二手车汽车销售数据可视化探索:预处理、平滑密度图、地理空间可视化(下)

简介: R语言二手车汽车销售数据可视化探索:预处理、平滑密度图、地理空间可视化

R语言二手车汽车销售数据可视化探索:预处理、平滑密度图、地理空间可视化(中):https://developer.aliyun.com/article/1491735


问题 #13 里程表读数和车龄有关系吗?里程表读数和价格?解释结果。里程表读数和车龄有关吗?


我们应该花一些时间清理里程表读数。例如,最大里程表读数 1234567890 只是一些广告。但是为了简单起见,我们看到里程表读数的第 99 个百分位数是 2.610^{5},因此我们将在 500,000 处修剪数据获得几乎所有分布。

绝大多数数据似乎确实呈上升趋势。但是,请注意大约 5 岁到 20 岁之间的阴影,它们的里程表读数较低。

quantile(vpoter, probs = 0.99, na.rm = TRUE)

正如我们在下面的平滑散点图中看到的,里程表读数与价格之间普遍存在负相关关系,但请注意,有些非常昂贵的汽车里程表读数较低,其中许多是古董车。

idx = ( vpots$meter < 500000 & vpsts$ice >= 500 & vposice <= 100000 &
          !is.na(vposdometer) & !is.na(vposprice) )
smoothScatte

问题 #14 识别“老爷”车。这些是什么厂家生产的?这些的价格分布是什么?


从下面的第一个 smoothScatter 图中,超过 35 年的汽车是“老爷车”。

从下表中可以看出,雪佛兰和福特占“老爷车”的 50% 以上。特别是,由于直到 1970 年代石油危机才开始大规模进口日本汽车,因此日本“老爷车”并不多,而我们对“老爷车”的截止时间约为 1970 年。

比较“老爷车”与所有汽车的价格分布,“老爷车”似乎密度更高,价格更高。

idx = (vpts$prie >= 500 & vpos$rce <= 100000 &


         !is.na(vpsts$rice)  !is.na(vpoge) )smootScater(x = vpst$ge\[idx\], y = vpoce\[ix\]

# 看看制造商和 老爷车的价格分布情况idx = (vpoage >= 35 & !is.na(vpst$ge))

问题 #15 我省略了这个数据集中的一个重要变量。你认为那是什么?我们可以从其他变量中得出这个吗?


在网站上搜索汽车时,通常是年份、品牌和型号,按顺序排列。请注意,年份和品牌(即制造商)是数据集中的独立变量。但是,请注意数据集中调用的变量是 year、make 和 model。因此,如果我们可以解析每个标题的文本字符串以提取模型,我们可以为模型导出我们自己的独立变量。

head(vposts$header, 20)

问题 #16 显示使用情况和里程表是如何相关的。还有使用情况和价格是如何相关的。以及汽车的状况和年龄。简要解释您的发现。


conditos = leels(vpsts$conitio)
conditon= sprintf('"%s",\\n', conditions)
cat(conditions)

# 我们将以最常见的现有类别为基础建立新的类别。sort(tble(vpst$coition))

vposts$


sane_odo = substboxplot(odoution bb = "Miles")

# 做第二张图,以更好地显示分布情况。boxlotoistuiles")

# 现在我们可以看到,最高的里程表读数似乎是在 "一般 "和 "良好 "条件下,这有点令人惊讶。有可能人们在里程表较高时夸大了车况,试图让它听起来更吸引人。车况分布最分散的是 "残次品",这是有道理的,因为残次品汽车可能非常旧,也可能是被损坏的新汽车。

san_rice = suset(vpsts,pric < 2e5)
pice\_y\_cond =split(sane\_prce$prce, san\_pricew_cond)
boxplo(price\_b\_con, co

age\_y\_cod = spli(sane\_ae$age, sae\_age$new_cond)


boxplot(age\_by\_cond, col = "

价格和车龄分布并没有显示出任何太令人惊讶的地方。价格和状况似乎直接相关。“像新”的汽车有时会以极高的价格提供,而这在状况较差的汽车中并不常见。车龄和状况成反比:旧车的状况似乎更糟。


自测题


Question #1 How many observations are there in the data set?

Question #2 What are the names of the variables? and what is the class of each variable?

Question #3 What is the average price of all the vehicles? the median price? and the deciles? Displays these on a plot of the distribution of vehicle prices.

Question #4 What are the different categories of vehicles, i.e. the type variable/column? What is the proportion for each category ?

Question #5 Display the relationship between fuel type and vehicle type. Does this depend on transmission type?

Question #6 How many different cities are represented in the dataset?

Question #7 Visually display how the number/proportion of “for sale by owner” and “for sale by dealer” varies across city?

Question #8 What is the largest price for a vehicle in this data set? Examine this and fix the value. Now examine the new highest value for price.

Question #9 What are the three most common makes of cars in each city for “sale by owner” and for “sale by dealer”? Are they similar or quite different?

Question #10 Visually compare the distribution of the age of cars for different cities and for “sale by owner” and “sale by dealer”. Provide an interpretation of the plots, i.e., what are the key conclusions and insights?

Question #11 Plot the locations of the posts on a map? What do you notice?

Question #12 Summarize the distribution of fuel type, drive, transmission, and vehicle type. Find a good way to display this information.

Question #13 Plot odometer reading and age of car? Is there a relationship? Similarly, plot odometer reading and price? Interpret the result(s). Are odometer reading and age of car related?

Question #14 Identify the “old” cars. What manufacturers made these? What is the price distribution for these?

Question #15 I have omitted one important variable in this data set. What do you think it is? Can we derive this from the other variables? If so, sketch possible ideas as to how we would compute this variable.

Question #16 Display how condition and odometer are related. Also how condition and price are related. And condition and age of the car. Provide a brief interpretation of what you find.

posts by people selling vehicles. The important variable that I did not give you was the model/type of the vehicle being sold. This is very important for determining the price of the vehicle. For example, a new Volve V60 has a suggested price of $35,000, but a new S60 has a price of $43,000, and the new Toyota Yaris and Avalon are $15,000 and $32,000 respectively - a factor of 2. So we need to determine the model of the vehicle.

We also want to verify some of the data and fix it if possible. And we also want to be able to programmatically extract other information from the posts if it is present.


  1. Extract the price being asked for the vehicle from the body column, if it is present, and check if it agrees with the actual price in the pricecolumn.

  2. Extract a Vehicle Identication Number (VIN) from the body, if it is present. We could use this to both identify details of the car (year it was built, type and model of the car, safety features, body style, engine type, etc.) and also use it to get historical information about the particular car. Add the VIN, if available, to the data frame. How many postings include the VIN?

  3. Extract phone numbers from the body column, and again add these as a new column. How many posts include a phone number?

  4. Extract email addresses from the body column, and again add theseas a new column. How many posts include an email address?

  5. Find the year in the description or body and compare it with the value in the year column.

  6. Determine the model of the car, e.g., S60, Boxter, Cayman, 911, Jetta. This includes correcting mis-spelled or abbreviated model names. You may find the agrep() function useful. You should also use statistics, i.e., counts to see how often a word occurs in other posts and if such a spelling is reasonable, and whether this model name has been seen with that maker often.

When doing these questions, you will very likely have to iterate by developing a regular expression, and seeing what results it gives you and adapting it. Furthermore, you will probably have to use two or more strategies when looing for a particular piece of information. This is expected; the data are not nice and regularly formatted.

Modeling

Pick two models of cars, each for a different car maker, e.g., Toyota or Volvo. For each of these, separately explore the relationship between the price being asked for the vehicle, the number of miles (odometer), age of the car and condition. Does location (city) have an effect on this? Use a statistical model to be able to suggest the appropriate price for such a car given its age, mileage, and condition. You might consider a linear model, k-nearest neighbors, or a regression tree.

You need to describe why the method you chose is appropriate? what assumptions are needed and how reasonable they are? and how well if performs and how you determined this? Would you use it if you were buying or selling this type of car?

Useful Functions

strsplit(), grep(), grepl(), gregexpr(), sub(), gsub().

agrep(), adist(), nchar(), substring()

The stringi and stringr packages.


相关文章
|
3月前
|
数据可视化 数据挖掘 图形学
R语言基础可视化:使用ggplot2构建精美图形的探索
【8月更文挑战第29天】 `ggplot2`是R语言中一个非常强大的图形构建工具,它基于图形语法提供了一种灵活且直观的方式来创建各种统计图形。通过掌握`ggplot2`的基本用法和美化技巧,你可以轻松地将复杂的数据转化为直观易懂的图形,从而更好地理解和展示你的数据分析结果。希望本文能够为你探索`ggplot2`的世界提供一些帮助和启发。
|
3月前
|
数据可视化 数据挖掘 数据处理
R语言高级可视化技巧:使用Plotly与Shiny制作互动图表
【8月更文挑战第30天】通过使用`plotly`和`shiny`,我们可以轻松地创建高度互动的数据可视化图表。这不仅增强了图表的表现力,还提高了用户与数据的交互性,使得数据探索变得更加直观和高效。本文仅介绍了基本的使用方法,`plotly`和`shiny`还提供了更多高级功能和自定义选项,等待你去探索和发现。希望这篇文章能帮助你掌握使用`plotly`和`shiny`制作互动图表的技巧,并在你的数据分析和可视化工作中发挥更大的作用。
|
23天前
|
机器学习/深度学习 数据采集 人工智能
R语言是一种强大的编程语言,广泛应用于统计分析、数据可视化、机器学习等领域
R语言是一种广泛应用于统计分析、数据可视化及机器学习的强大编程语言。本文为初学者提供了一份使用R语言进行机器学习的入门指南,涵盖R语言简介、安装配置、基本操作、常用机器学习库介绍及实例演示,帮助读者快速掌握R语言在机器学习领域的应用。
47 3
|
6月前
|
数据可视化 数据挖掘 API
【R语言实战】聚类分析及可视化
【R语言实战】聚类分析及可视化
|
2月前
|
数据采集
基于R语言的GD库实现地理探测器并自动将连续变量转为类别变量
【9月更文挑战第9天】在R语言中,可通过`gd`包实现地理探测器。首先,安装并加载`gd`包;其次,准备包含地理与因变量的数据框;然后,使用`cut`函数将连续变量转换为分类变量;最后,通过`gd`函数运行地理探测器,并打印结果以获取q值等统计信息。实际应用时需根据数据特点调整参数。
128 8
|
3月前
|
数据可视化
R语言可视化设计原则:打造吸引力十足的数据可视化
【8月更文挑战第30天】R语言可视化设计是一个综合性的过程,需要综合运用多个设计原则来创作出吸引力十足的作品。通过明确目标、选择合适的图表类型、合理运用色彩与视觉层次、明确标注与引导视线以及引入互动性与动态效果等原则的应用,你可以显著提升你的数据可视化作品的吸引力和实用性。希望本文能为你提供一些有益的启示和帮助。
|
6月前
|
数据采集 数据可视化
利用R语言进行因子分析实战(数据+代码+可视化+详细分析)
利用R语言进行因子分析实战(数据+代码+可视化+详细分析)
|
6月前
|
Web App开发 数据可视化 数据挖掘
利用R语言进行聚类分析实战(数据+代码+可视化+详细分析)
利用R语言进行聚类分析实战(数据+代码+可视化+详细分析)
|
2月前
|
数据采集 机器学习/深度学习 数据可视化
R语言从数据到决策:R语言在商业分析中的实践
【9月更文挑战第1天】R语言在商业分析中的应用广泛而深入,从数据收集、预处理、分析到预测模型构建和决策支持,R语言都提供了强大的工具和功能。通过学习和掌握R语言在商业分析中的实践应用,我们可以更好地利用数据驱动企业决策,提升企业的竞争力和盈利能力。未来,随着大数据和人工智能技术的不断发展,R语言在商业分析领域的应用将更加广泛和深入,为企业带来更多的机遇和挑战。
|
23天前
|
数据挖掘 C语言 C++
R语言是一种强大的统计分析工具,提供了丰富的函数和包用于时间序列分析。
【10月更文挑战第21天】时间序列分析是一种重要的数据分析方法,广泛应用于经济学、金融学、气象学、生态学等领域。R语言是一种强大的统计分析工具,提供了丰富的函数和包用于时间序列分析。本文将介绍使用R语言进行时间序列分析的基本概念、方法和实例,帮助读者掌握R语言在时间序列分析中的应用。
41 3