
对象存储 OSS,20GB 3个月
对象存储 OSS,恶意文件检测 1000次 1年
对象存储 OSS,内容安全 1000次 1年
简介: compare hdfs & oss files

Requirements Document for Verifying Data Consistency Between HDFS and OSS


The purpose of this document is to provide a clear set of requirements for verifying data consistency between HDFS and OSS (Object Storage Service) using Python. This process is important to ensure that there are no discrepancies between the data stored in HDFS and the data stored in OSS, which may cause issues in data analysis or processing.

Functional Requirements

1. The solution should perform a binary comparison of the files in HDFS and OSS to verify if they are identical.

2. The solution should compare the number of files in HDFS and OSS to ensure that they are equal.

3. The solution should compare the size of files in HDFS and OSS to ensure that they are equal.

4. The solution should handle errors and edge cases gracefully and log them in a file.

5. The solution should be configurable to allow for customization of parameters such as the directories to compare and the log file location.

Non-functional Requirements

1. The solution should be implemented in Python3.

2. The solution should use the OSS SDK for Python and the HDFS Python API.

3. The solution should be optimized to minimize the time and resources required for the comparison process.

4. The solution should be easy to deploy and run on different environments.

5. The solution should be documented and maintainable.


1. The solution assumes that the HDFS and OSS directories to compare are accessible.

2. The solution assumes that the HDFS and OSS directories have the same file structure.

Implementation Plan

1. Install the OSS SDK for Python and the HDFS Python API.

2. Define variables for the HDFS and OSS directories to compare.

3. Read the files in the HDFS and OSS directories, and verify that the number of files in each directory is equal.

4. For each file, compare the file size and binary content in HDFS and OSS to ensure they are equal.

5. Log any errors or discrepancies in a file.

6. Allow for customization of parameters such as directories to compare and the log file location.

7. Test the solution in different environments to ensure its compatibility and functionality.

8. Document the solution and make it maintainable.

Code Implementation in Python3


import pyhdfs

from oss2 import Auth, Bucket

# Define the HDFS and OSS directories to compare

hdfs_path = "/hdfs/path/"

oss_endpoint = "oss_endpoint"

oss_bucket = "oss_bucket"

oss_path = "oss_path/"

# Initialize the HDFS client and retrieve the file list in the directory

client = pyhdfs.HdfsClient(hosts='localhost:50070')

hdfs_files = client.listdir(hdfs_path)

# Initialize the OSS client and retrieve the file list in the directory

auth = Auth('oss_accessKeyId', 'oss_accessKeySecret')

bucket = Bucket(auth, oss_endpoint, oss_bucket)

oss_files = [obj.key for obj in oss2.ObjectIterator(bucket, prefix=oss_path)]

# Verify that the number of files in HDFS and OSS directories are equal

if len(hdfs_files) != len(oss_files):

   print("Error: Number of files in HDFS and OSS directories are different")


   # Iterate over each file in the HDFS directory

   for file in hdfs_files:

       # Retrieve the file in HDFS and OSS

       hdfs_file = client.open(hdfs_path + file)

       oss_file = bucket.get_object(oss_path + file)

       # Verify that the size and content of the files are equal

       if (hdfs_file.get_content_summary().length != oss_file.content_length) or (hdfs_file.read() != oss_file.read()):

           print("Error: Content of files in HDFS and OSS directories are different")

       # Close the files





In conclusion, verifying data consistency between HDFS and OSS is necessary to ensure accurate and reliable data processing. The requirements document outlines the key functional and non-functional requirements for implementing a solution in Python3. The code implementation demonstrates the steps required to compare the directories and files in HDFS and OSS, and logs any errors or discrepancies in a file. The solution can be customized to suit different environments and is easy to maintain.

129 0
存储 数据管理 数据库
477 2
存储 前端开发 JavaScript
154 1
JavaScript 前端开发 数据库
144 0
存储 C++
编译器 C++
JavaScript 前端开发 容器
96 0
消息中间件 数据库 RocketMQ
112 0