R语言二手车汽车销售数据可视化探索:预处理、平滑密度图、地理空间可视化(中):https://developer.aliyun.com/article/1491735
问题 #13 里程表读数和车龄有关系吗?里程表读数和价格?解释结果。里程表读数和车龄有关吗?
我们应该花一些时间清理里程表读数。例如,最大里程表读数 1234567890 只是一些广告。但是为了简单起见,我们看到里程表读数的第 99 个百分位数是 2.610^{5},因此我们将在 500,000 处修剪数据获得几乎所有分布。
绝大多数数据似乎确实呈上升趋势。但是,请注意大约 5 岁到 20 岁之间的阴影,它们的里程表读数较低。
quantile(vpoter, probs = 0.99, na.rm = TRUE)
正如我们在下面的平滑散点图中看到的,里程表读数与价格之间普遍存在负相关关系,但请注意,有些非常昂贵的汽车里程表读数较低,其中许多是古董车。
idx = ( vpots$meter < 500000 & vpsts$ice >= 500 & vposice <= 100000 & !is.na(vposdometer) & !is.na(vposprice) ) smoothScatte
问题 #14 识别“老爷”车。这些是什么厂家生产的?这些的价格分布是什么?
从下面的第一个 smoothScatter 图中,超过 35 年的汽车是“老爷车”。
从下表中可以看出,雪佛兰和福特占“老爷车”的 50% 以上。特别是,由于直到 1970 年代石油危机才开始大规模进口日本汽车,因此日本“老爷车”并不多,而我们对“老爷车”的截止时间约为 1970 年。
比较“老爷车”与所有汽车的价格分布,“老爷车”似乎密度更高,价格更高。
idx = (vpts$prie >= 500 & vpos$rce <= 100000 &
!is.na(vpsts$rice) !is.na(vpoge) )smootScater(x = vpst$ge\[idx\], y = vpoce\[ix\]
# 看看制造商和 老爷车的价格分布情况idx = (vpoage >= 35 & !is.na(vpst$ge))
问题 #15 我省略了这个数据集中的一个重要变量。你认为那是什么?我们可以从其他变量中得出这个吗?
在网站上搜索汽车时,通常是年份、品牌和型号,按顺序排列。请注意,年份和品牌(即制造商)是数据集中的独立变量。但是,请注意数据集中调用的变量是 year、make 和 model。因此,如果我们可以解析每个标题的文本字符串以提取模型,我们可以为模型导出我们自己的独立变量。
head(vposts$header, 20)
问题 #16 显示使用情况和里程表是如何相关的。还有使用情况和价格是如何相关的。以及汽车的状况和年龄。简要解释您的发现。
conditos = leels(vpsts$conitio) conditon= sprintf('"%s",\\n', conditions) cat(conditions)
# 我们将以最常见的现有类别为基础建立新的类别。sort(tble(vpst$coition))
vposts$
sane_odo = substboxplot(odoution bb = "Miles")
# 做第二张图,以更好地显示分布情况。boxlotoistuiles")
# 现在我们可以看到,最高的里程表读数似乎是在 "一般 "和 "良好 "条件下,这有点令人惊讶。有可能人们在里程表较高时夸大了车况,试图让它听起来更吸引人。车况分布最分散的是 "残次品",这是有道理的,因为残次品汽车可能非常旧,也可能是被损坏的新汽车。
san_rice = suset(vpsts,pric < 2e5) pice\_y\_cond =split(sane\_prce$prce, san\_pricew_cond) boxplo(price\_b\_con, co
age\_y\_cod = spli(sane\_ae$age, sae\_age$new_cond)
boxplot(age\_by\_cond, col = "
价格和车龄分布并没有显示出任何太令人惊讶的地方。价格和状况似乎直接相关。“像新”的汽车有时会以极高的价格提供,而这在状况较差的汽车中并不常见。车龄和状况成反比:旧车的状况似乎更糟。
自测题
Question #1 How many observations are there in the data set?
Question #2 What are the names of the variables? and what is the class of each variable?
Question #3 What is the average price of all the vehicles? the median price? and the deciles? Displays these on a plot of the distribution of vehicle prices.
Question #4 What are the different categories of vehicles, i.e. the type variable/column? What is the proportion for each category ?
Question #5 Display the relationship between fuel type and vehicle type. Does this depend on transmission type?
Question #6 How many different cities are represented in the dataset?
Question #7 Visually display how the number/proportion of “for sale by owner” and “for sale by dealer” varies across city?
Question #8 What is the largest price for a vehicle in this data set? Examine this and fix the value. Now examine the new highest value for price.
Question #9 What are the three most common makes of cars in each city for “sale by owner” and for “sale by dealer”? Are they similar or quite different?
Question #10 Visually compare the distribution of the age of cars for different cities and for “sale by owner” and “sale by dealer”. Provide an interpretation of the plots, i.e., what are the key conclusions and insights?
Question #11 Plot the locations of the posts on a map? What do you notice?
Question #12 Summarize the distribution of fuel type, drive, transmission, and vehicle type. Find a good way to display this information.
Question #13 Plot odometer reading and age of car? Is there a relationship? Similarly, plot odometer reading and price? Interpret the result(s). Are odometer reading and age of car related?
Question #14 Identify the “old” cars. What manufacturers made these? What is the price distribution for these?
Question #15 I have omitted one important variable in this data set. What do you think it is? Can we derive this from the other variables? If so, sketch possible ideas as to how we would compute this variable.
Question #16 Display how condition and odometer are related. Also how condition and price are related. And condition and age of the car. Provide a brief interpretation of what you find.
posts by people selling vehicles. The important variable that I did not give you was the model/type of the vehicle being sold. This is very important for determining the price of the vehicle. For example, a new Volve V60 has a suggested price of $35,000, but a new S60 has a price of $43,000, and the new Toyota Yaris and Avalon are $15,000 and $32,000 respectively - a factor of 2. So we need to determine the model of the vehicle.
We also want to verify some of the data and fix it if possible. And we also want to be able to programmatically extract other information from the posts if it is present.
- Extract the price being asked for the vehicle from the body column, if it is present, and check if it agrees with the actual price in the pricecolumn.
- Extract a Vehicle Identication Number (VIN) from the body, if it is present. We could use this to both identify details of the car (year it was built, type and model of the car, safety features, body style, engine type, etc.) and also use it to get historical information about the particular car. Add the VIN, if available, to the data frame. How many postings include the VIN?
- Extract phone numbers from the body column, and again add these as a new column. How many posts include a phone number?
- Extract email addresses from the body column, and again add theseas a new column. How many posts include an email address?
- Find the year in the description or body and compare it with the value in the year column.
- Determine the model of the car, e.g., S60, Boxter, Cayman, 911, Jetta. This includes correcting mis-spelled or abbreviated model names. You may find the agrep() function useful. You should also use statistics, i.e., counts to see how often a word occurs in other posts and if such a spelling is reasonable, and whether this model name has been seen with that maker often.
When doing these questions, you will very likely have to iterate by developing a regular expression, and seeing what results it gives you and adapting it. Furthermore, you will probably have to use two or more strategies when looing for a particular piece of information. This is expected; the data are not nice and regularly formatted.
Modeling
Pick two models of cars, each for a different car maker, e.g., Toyota or Volvo. For each of these, separately explore the relationship between the price being asked for the vehicle, the number of miles (odometer), age of the car and condition. Does location (city) have an effect on this? Use a statistical model to be able to suggest the appropriate price for such a car given its age, mileage, and condition. You might consider a linear model, k-nearest neighbors, or a regression tree.
You need to describe why the method you chose is appropriate? what assumptions are needed and how reasonable they are? and how well if performs and how you determined this? Would you use it if you were buying or selling this type of car?
Useful Functions
strsplit(), grep(), grepl(), gregexpr(), sub(), gsub().
agrep(), adist(), nchar(), substring()
The stringi and stringr packages.