数据科学与统计学之时空建模-空间数据建模

简介: 时空建模-空间数据建模

空间数据建模

In real world datasets, particularly in climate and environmental problems, we often have measurements at a specified number of locations (in space and/or time), and will often need to make predictions of the output at unmeasured locations (for example, at the next several time steps, or for a spatial region).
We can approach the spatial problem by constructing a fine grid over the area in question and then predicting at each point on the grid, interpolating between the observed data points. If we have a set of explanatory/predictor variables at the locations of the measurements and for each point of the grid, then we could use a regression model to produce predictions. Our model should also provide a measure of uncertainty around any predictions, as we don’t observe the desired output at all possible locations.
In this course, we’ll be learning and applying methods for handling spatial and temporal measurement data. In this practical, we’ll consider a spatial dataset, and start by using methods that we already know, attempting to fit linear models to this spatial data, potentially identifying any issues with this approach and motivating the need to go beyond such models, as we will later in this course.

在现实世界的数据集中,特别是在气候和环境问题中,我们经常在指定数量的位置(空间或时间)进行测量,并且通常需要对未测量位置的输出进行预测。
我们可以通过在相关区域上构建精细网格,然后在网格上的每个点进行预测,在观察到的数据点之间进行插值来解决空间问题。如果我们在测量位置和网格的每个点都有一组解释或预测变量,那么我们可以使用回归模型来生成预测。我们的模型还应该提供任何预测的不确定性度量,因为我们没有在所有可能的位置观察到所需的输出。
在这个实践中,我们将考虑一个空间数据集,并从使用我们已知的方法开始,尝试将线性模型拟合到这个空间数据,识别这种方法的任何问题并激发超越这些模型的需要。

The Meuse dataset

We’re going to look at the meuse dataset, contained in the sp package. From its description (in ?meuse): “This data set gives locations and topsoil heavy metal concentrations, along with a number of soil and landscape variables at the observation locations, collected in a flood plain of the river Meuse, near the village of Stein (NL). Heavy metal concentrations are from composite samples of an area of approximately 15m x 15m.”
Let’s load in the data, and do some initial data analysis:

我们将使用sp包中包含的meuse数据集。在R中市容?meuse查看它的描述:“该数据集提供了位置和表土重金属浓度,以及观察地点的许多土壤和景观变量,收集在Stein村附近的默兹河洪泛区(荷兰)。 重金属浓度来自面积约为 15m x 15m 的复合样品。”让我们加载数据,并进行一些初步的数据分析:

# 加载包和数据
library(sp)
library(ggplot2)
data(meuse)
head(meuse)

image.png

meuse$x<- meuse$x/ 1000 # 整理坐标
meuse$y<- meuse$y/ 1000
summary(meuse)

image.png

We have (x, y) locations (in some non-standard coordinate system, we won’t worry about this), and want to model the soil concentration of various metals (cadmium, copper, lead, zinc) in $mg.kg^{-1}$. The dataset also contains some other variables which may be useful for prediction, such as elevation, and two measures of distance to the river (dist, dist.m). We’ll use the normalized version, dist, as this choice will come in useful later on. In all that follows, we’ll be attempting to model zinc concentration.
A sensible place to start is plotting the spatial locations of the points, to see the domain we’re working with (we’ll generally be using ggplot2, but everything here could be done similarly with base graphics):

我们有(x,y)位置,并且想要以$mg.kg^{-1}$为单位模拟各种金属(镉、铜、铅、锌)在土壤中的浓度.该数据集还包含一些可能对预测有用的其他变量,例如海拔和到河流的两个距离度量(dist、dist.m)。我们将使用规范化版本dist,因为这个选择稍后会派上用场。 在接下来的内容中,我们将尝试模拟锌浓度。
一个明智的开始是绘制空间点位置,以查看我们正在处理的域(我们通常使用 ggplot2,但这里的所有内容都可以使用基本图形来完成):

ggplot(data= meuse, aes(x= x, y= y))+
    geom_point()

image.png

This isn’t that exciting, so let’s plot the points with their size corresponding to distance from the river and zinc concentration:

这不是那么令人兴奋,所以接下来我们让点大小与与河流的距离和锌浓度相对应:

ggplot(data= meuse, aes(x= x, y= y, size= dist))+
    geom_point()

ggplot(data= meuse, aes(x= x, y= y, size= zinc))+
    geom_point()

image.png

image.png

From these plots, we see that the river is running from bottom left to top right, along our cloud of points (and we’re probably lying in a meander, as suggested by the 3 points away from the rest, with a low distance). There also seems to be a correlation between the zinc concentration and the distance from the river, with higher concentrations generally along the river. We can combine the above two plots using something like:

从这些图中,我们看到河流从左下角流向右上角,云锌浓度与河流距离之间似乎也存在相关性,通常沿河浓度较高。 我们可以使用以下内容组合上述两个图:

ggplot(data= meuse, aes(x= x, y= y))+
    geom_point(aes(size= dist, color= zinc))+
    scale_colour_viridis_c()

image.png

So, our data appears to have some correlation spatially, but there is also a clear inverse relationship between the distance from the river and the zinc concentration. We see that if we apply transformations (taking the square root of dist, and the logarithm of zinc) we have a reasonably strong negative correlation. Plotting these transformations:

因此,我们的数据在空间上似乎有一些相关性,但与河流的距离与锌浓度之间也存在明显的反比关系。我们看到,如果我们应用变换(取dist的平方根和锌的对数),我们会得到相当强的负相关。 绘制这些变换:

ggplot(data= meuse, aes(x= dist, y= zinc))+
    geom_point()
ggplot(data= meuse, aes(x= sqrt(dist), y= log(zinc)))+
    geom_point()

image.png

image.png

Fitting a model for zinc concentration-拟合锌浓度模型

Our initial data analysis has highlighted the relatively strong relationship between distance from the river and zinc concentration, given appropriate transformations. Let’s fit a linear regression using this relationship, with the aim of using this model to predict zinc concentration on a grid of points around the river:

经过适当的转换后,我们的初始数据分析显示了了到河流的距离与锌浓度之间相对较强的关系。 让我们使用此关系拟合线性回归,目的是使用此模型来预测河流周围点网格上的锌浓度:

$$log(zinc_i) = \beta_0 + \beta_1\sqrt{dist_i} + \epsilon_i$$

linear_model<- lm(log(zinc)~ sqrt(dist), data= meuse)
summary(linear_model)
![image.png](https://ucc.alicdn.com/pic/developer-ecology/44427633zfisq_2771a707b0f8436e8afae8486408e9b0.png)

For linear regression models, recall that we are assuming that the residuals $\epsilon_i$ are independent and identically distributed with mean $\theta$ and variance $\sigma^2$:

对于线性回归模型,回想一下,我们假设残差$\epsilon_i$是独立且同分布的,均值 $\theta$ 和方差 $\sigma^2$:

$$\epsilon \sim N(0,\sigma^2)$$

To assess whether this assumption is valid, we can consider the usual diagnostic plots:

为了评估这个假设是否有效,我们可以考虑通常的 diagnostic plots:

```{r}
par(mfrow= c(2, 2))
plot(linear_model)

image.png

We see some potential problems here: the QQ plot shows some deviation from Normality, particularly in the upper tail. In the residuals vs fitted plot, the mean of the residuals should be zero regardless of the fitted value, and the spread of these points should be constant, both of which are violated to some extent, particulary for lower values of (log) concentration.

我们在这里看到了一些潜在的问题:QQ图显示了一些与正态性的偏差,特别是在upper-tail。在residuals vs fittedplot中 ,无论拟合值如何,残差的平均值都应为零,并且这些点的分布应该是恒定的,这两者都在一定程度上被违反,特别是对于较低的 (log) 浓度值。

(At this point, we may want to consider adding extra covariates, or using different transformations, to try to improve the linear model fit, but we assume that this model is ‘good enough’ for now.)
We can also consider how the residuals change with distance from the river, and with x and y coordinates, again expecting no patterns if our assumptions hold:

(此时,我们可能要考虑添加额外的协变量,或使用不同的变换,以尝试改进线性模型拟合,但我们假设该模型目前“足够好”。)
我们还可以考虑残差如何随与河流的距离以及x和y坐标的变化而变化,如果我们的假设成立,则再次期望没有其他模式:

ggplot(data= meuse, aes(x= sqrt(dist), y= resid(linear_model)))+
    geom_point()+
    geom_smooth(se= FALSE)
ggplot(data= meuse, aes(x= x, y= resid(linear_model)))+
    geom_point()+
    geom_smooth(se= FALSE)
ggplot(data= meuse, aes(x= y, y= resid(linear_model)))+
    geom_point()+
    geom_smooth(se= FALSE)

image.png

image.png

image.png

And clearly here we have an issue, with the spread of residuals much narrower for high values of the distance,and a systematic pattern in terms of the x coordinate.
The problem becomes even clearer if we plot the residuals on our original spatial map, with green and redindicating whether the zinc measurements are higher or lower than the modelled values, with the size of the points scaled by the size of the residual:

很明显,这里我们有一个问题,相对于较远距离,残差的分布要窄得多,并且在残差个X坐标值存在一定的关系。
如果我们在原始空间地图上绘制残差,则问题变得更加清晰,用绿色和重新指示锌测量值是高于还是低于建模值,点的大小按残差的大小进行缩放:

ggplot(data= meuse, aes(x= x, y= y))+
    geom_point(size= 5*abs(resid(linear_model)), 
               col= ifelse(resid(linear_model)> 0, 'green', 'red'))

image.png

This plot shows that the errors are correlated when considered spatially, with the model systematically over and underestimating the measurements in different regions, rather than the errors being distributed randomly. However, we set out to produce a map of zinc concentration, so despite the inadequacies of this particular model, let’s use it to interpolate and predict on a grid. The sp package also contains a dataset called meuse.grid, which gives a grid containing x and y coordinates, along with the distance to the river for every point. Using this in combination with the linear model, we can colour in our map with predictions:

该图显示,当从空间角度考虑时,误差是相关的,模型系统地高估和低估了不同区域的测量值,而不是随机分布的误差。然而,我们开始制作锌浓度图,所以尽管这个特定模型有不足之处,让我们用它在网格上进行插值和预测。sp包还包含一个名为meuse.grid的数据集,它给出了一个包含x和y坐标的网格,以及每个点到河流的距离。将此与线性模型结合使用,我们可以在地图中进行预测着色:

data(meuse.grid)
meuse.grid$x<- meuse.grid$x/ 1000
meuse.grid$y<- meuse.grid$y/ 1000
meuse.grid$linear_preds<- predict(linear_model, meuse.grid)
ggplot(data= meuse.grid, aes(x= x, y= y, fill= linear_preds))+
    geom_tile()+
    scale_fill_viridis_c(limits= c(4.4, 7.6))+
    labs(fill= "log conc")

image.png

Unsurprisingly, given the form of the model, the zinc concentration is highest next to the river, and decreases as the distance from the river increases.
So we’ve attempted to handle the spatial component of this data set by only considering distance to the river, which is strongly related to the zinc concentration. But whilst the residuals are reasonably Normal overall, there are clear systematic biases spatially:

不出所料,在该模型下,靠近河流的锌浓度最高,并随着与河流距离的增加而降低。
因此,我们试图通过仅考虑到河流的距离来处理此数据集的空间分量,这与锌浓度密切相关。 但是,虽然残差总体上是合理的,但在空间上存在明显的系统偏差:

Looking ahead

We saw above that assuming uncorrelated residuals was a problem, so the following code implements a model where we assume that the residuals at close points are related. Then, we could understand how to use the geoR package and what the functions below are actually doing, but for now it doesn’t matter:

我们在上面看到假设不相关的残差是一个问题,所以下面的代码实现了一个模型,我们假设近点的残差是相关的。然后,我们可以了解如何使用geoR包以及下面的函数实际上在做什么,但现在没关系:

library(geoR)
meuse_new<- as.geodata(meuse, data.col= which(colnames(meuse)== "zinc"))
str(meuse_new)

image.png

meuse_new
meuse_new[2]$data<- log(meuse_new[2]$data)
## likfit 参数:
## geodata: a list containing elements coords and data
## ini: initial values
## fix.nug: logical, indicating whether tau^2 should be regard as fixed
## lik.met: frmely method, ML for maximum likelihood, REML for restricted maximum likelihood
ml<- likfit(meuse_new, ini= c(10, 1), fix.nug= FALSE, lik.met= "REML")

image.png

## krige.conv参数:
## geodata:
## coords:
## loc: 2-D coordiantes
## krige: can take a call to krige.control(Auxiliary function辅助函数)
preds<- krige.conv(meuse_new, loc= meuse.grid[,1:2], krige = krige.control(obj.model = ml))

image.png

meuse.grid$preds <- preds$predict
ggplot(data = meuse.grid, aes(x = x, y = y, fill = preds)) +
    geom_tile() +
    scale_fill_viridis_c(limits = c(4.4,7.6)) +
    labs(fill = 'log conc')

image.png

So now we have an alternative set of predictions of zinc concentration spatially. There’s clearly a lot more going on here - there isn’t just a linear relationship as distance from the river increases, and we’re clearly able to model the data with much greater flexibility, picking up more local variability in the zinc concentration. There is a reasonably strong correlation between the predictions from the two models overall, but particularly
where the linear model predicts high concentrations (i.e. close to the river), the new model can return a much wider range of values, suggesting that distance from the river is not the only important factor:

所以现在我们有一组替代的锌浓度的空间预测。显然这里还有很多事情要做—,锌浓度与河流距离不仅仅是线性关系,,而且我们显然能够以更大的灵活性对数据进行建模,获取更多局部锌浓度变化。总体而言,两个模型的预测之间存在相当强的相关性,但特别是在线性模型预测高浓度(即靠近河流)的情况下,新模型可以返回更广泛的值,这表明与河流的距离并不是唯一的重要因素:

ggplot(data= meuse.grid, aes(x= linear_preds, y= preds))+
    geom_point()+
    labs(x= "Linear model", "Spatial model")

image.png

We can also look at the uncertainty on these predictions, and plot this spatially:

我们还可以查看这些预测的不确定性,并在空间上进行绘制:

meuse.grid$unc<- preds$krige.var
ggplot(data= meuse.grid, aes(x= x, y= y, fill= unc))+
    geom_tile()

image.png

目录
相关文章
|
存储 数据挖掘 数据处理
Pandas中explode()函数的应用与实战
Pandas中explode()函数的应用与实战
677 0
|
Java Docker 容器
|
7月前
|
XML 机器学习/深度学习 监控
高级检索增强生成系统:LongRAG、Self-RAG 和 GraphRAG 的实现与选择
检索增强生成(RAG)已超越简单向量匹配,迈向LongRAG、Self-RAG与GraphRAG等高级形态。LongRAG通过大块重叠分片保留长上下文,提升连贯性;Self-RAG引入反思机制,动态判断检索必要性与内容相关性,增强可信度;GraphRAG构建知识图谱,支持多跳推理与复杂关系挖掘。三者分别应对上下文断裂、检索盲目性与关系表达缺失难题,代表2025年RAG工程化核心进展,可依场景组合使用以平衡准确性、成本与复杂度。
786 57
高级检索增强生成系统:LongRAG、Self-RAG 和 GraphRAG 的实现与选择
|
10月前
lol脚本,dnf脚本,挑一挑脚本源码分享
LOL脚本实现技能冷却检测和自动补兵功能,采用多线程运行 DNF脚本包含预设技能序列和智能移动逻辑,支持BOSS血条检测
|
9月前
|
边缘计算 缓存 人工智能
EdgeShard:通过协作边缘计算实现高效的大语言模型推理——论文解读
EdgeShard是一种基于协作边缘计算的大语言模型(LLM)推理框架,旨在解决LLM在云端部署面临的延迟高、带宽压力大和隐私泄露等问题。通过将LLM分片部署在多个边缘设备上,结合云边协同与设备间协作,EdgeShard实现了高效的模型推理。其核心创新包括:联合设备选择与模型划分优化、支持流水线并行与微批处理、提出EdgeShard-No-Bubbles策略以减少设备空闲时间,从而显著提升推理吞吐量并降低延迟。实验表明,EdgeShard在异构边缘设备上可实现高达50%的延迟降低和2倍的吞吐量提升,支持全精度模型推理而无精度损失,为资源受限的边缘环境提供了高效的LLM部署方案。
1694 2
|
JSON 数据可视化 API
GraphRAG+Ollama,构建本地精准全局问答系统!
RAG 是目前大语言模型相关最知名的工具之一,从外部知识库中检索事实,以便为大型语言模型 (LLM) 提供最准确、最新的信息。
|
存储 机器学习/深度学习 自然语言处理
知识图谱和向量数据库的关系
知识图谱和向量数据库在处理不同类型的数据和任务中具有各自的优势。它们可以在大语言模型中相互结合,以处理复杂的需求,提供更全面和准确的信息检索和推理能力。这种结合为我们在自然语言处理和机器学习领域中解决实际问题提供了有力的工具和方法。
2751 1
|
机器学习/深度学习 算法 PyTorch
基于图神经网络的大语言模型检索增强生成框架研究:面向知识图谱推理的优化与扩展
本文探讨了图神经网络(GNN)与大型语言模型(LLM)结合在知识图谱问答中的应用。研究首先基于G-Retriever构建了探索性模型,然后深入分析了GNN-RAG架构,通过敏感性研究和架构改进,显著提升了模型的推理能力和答案质量。实验结果表明,改进后的模型在多个评估指标上取得了显著提升,特别是在精确率和召回率方面。最后,文章提出了反思机制和教师网络的概念,进一步增强了模型的推理能力。
1032 4
基于图神经网络的大语言模型检索增强生成框架研究:面向知识图谱推理的优化与扩展

热门文章

最新文章