以下是一些用 Python 编写的代码示例,有助于提高数据采集的准确性:
python
复制
import requests
import pandas as pd
import re
数据来源验证函数
def validate_source(url):
try:
response = requests.get(url)
if response.status_code == 200:
return True
else:
return False
except requests.exceptions.RequestException as e:
print(f"Error validating source: {e}")
return False
数据清洗函数,去除重复值
def clean_data(data):
return list(set(data))
检查数据是否在合理范围内
def check_range(data, min_val, max_val):
valid_data = [item for item in data if min_val <= item <= max_val]
return valid_data
检查数据格式是否正确
def validate_format(data, pattern):
valid_data = [item for item in data if re.match(pattern, item)]
return valid_data
示例用法
data = [10, 20, 30, 20, 40]
print("原始数据:", data)
print("去除重复值后:", clean_data(data))
min_val = 15
max_val = 35
filtered_data = check_range(data, min_val, max_val)
print(f"在范围 {min_val} - {max_val} 内的数据:", filtered_data)
data_str = ["abc123", "def456", "ghi789", "jkl012"]
pattern = r"[a-z]{3}\d{3}"
valid_str_data = validate_format(data_str, pattern)
print("格式正确的数据:", valid_str_data)
这段代码中包含了对数据来源的验证、去除重复值、检查数据范围以及验证数据格式的功能,有助于在数据采集过程中提高数据的准确性。