Deploy and Run Apache Airflow on Alibaba Cloud

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云数据库 RDS MySQL,集群系列 2核4GB
推荐场景:
搭建个人博客
云数据库 RDS MySQL,高可用系列 2核4GB
简介: # Deploy and Run Apache Airflow on Alibaba CloudTutorial of running open source project Apache Airflow on Alibaba Cloud with ApsaraDB (Alibaba Cloud Database). We also show a simple data migration ta

Deploy and Run Apache Airflow on Alibaba Cloud

Tutorial of running open source project Apache Airflow on Alibaba Cloud with ApsaraDB (Alibaba Cloud Database). We also show a simple data migration task deployed and run in Airflow to migrate data between 2 databases.

You can access the tutorial artifact including deployment script (Terraform), related source code, sample data and instruction guidance from the github project:
https://github.com/alibabacloud-howto/opensource_with_apsaradb/tree/main/apache-airflow

More tutorial around Alibaba Cloud Database, please refer to:
https://github.com/alibabacloud-howto/database


Overview

Apache Airflow (https://airflow.apache.org/) is a platform created by the community to programmatically author, schedule and monitor workflows.

Airflow requires a database. If you're just experimenting and learning Airflow, you can stick with the default SQLite option or single node PostgreSQL built in Docker edition. To enhance with the database high availability behind the Apache Airflow, we will show the steps of deployment working with Alibaba Cloud Database.
Airflow supports PostgreSQL and MySQL https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html. You can either use one of the following databases on Alibaba Cloud:

In this tutorial, we will show the case of using RDS PostgreSQL high availability edition to replace the single node PostgreSQL built in Docker edition for more stable production purpose.

Deployment architecture:

image.png


Index


Step 1. Use Terraform to provision ECS and database on Alibaba Cloud

Run the terraform script to initialize the resources (in this tutorial, we use RDS PostgreSQL as backend database of Airflow and another RDS PostgreSQL as the demo database showing the data migration via Airflow task, so ECS and 2 RDS PostgreSQL instances are included in the Terraform script). Please specify the necessary information and region to deploy.

image.png

After the Terraform script execution finished, the ECS and RDS PostgreSQL instances information are listed as below.

image.png

  • rds_pg_url_airflow_database: The connection URL of the backend database for Airflow
  • rds_pg_url_airflow_demo_database: The connection URL of the demo database using Airflow

The database port for RDS PostgreSQL is 1921 by default.

Step 2. Deploy and setup Airflow on ECS with RDS PostgreSQL

Please log on to ECS with ECS EIP.

ssh root@<ECS_EIP>

image.png

Download and execute the script setup.sh via the following commands to setup Airflow on ECS.

cd ~
wget https://raw.githubusercontent.com/alibabacloud-howto/opensource_with_apsaradb/main/apache-airflow/setup.sh
sh setup.sh
cd ~/airflow
mkdir ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env

Edit the downloaded docker-compose.yaml file to set the backend database as the RDS PostgreSQL.

cd ~/airflow
vim docker-compose.yaml

Use the connection string of rds_pg_url_airflow_database in Step 1. Comment the part related the postgres.

image.png

image.png

Then execute the following command to initialize Airflow docker.

docker-compose up airflow-init

image.png

Then execute the following command to start Airflow.

docker-compose up

image.png

Now, Airflow has started successfully. Please visit the following URL (replace <ECS_EIP> with the EIP of the ECS) to access the Airflow web console.

http://<ECS_EIP>:8080

The default account has the login airflow and the password airflow.

image.png

image.png

Next, let's move on to work on the 1st data migration task on Airflow.

Step 3. Prepare the source and target database for Airflow data migration task demo

Please log on to ECS with ECS EIP within another terminal window.

ssh root@<ECS_EIP>

Download and setup PostgreSQL client to communicate with the demo database.

cd ~
wget http://mirror.centos.org/centos/8/AppStream/x86_64/os/Packages/compat-openssl10-1.0.2o-3.el8.x86_64.rpm
rpm -i compat-openssl10-1.0.2o-3.el8.x86_64.rpm
wget http://docs-aliyun.cn-hangzhou.oss.aliyun-inc.com/assets/attach/181125/cn_zh/1598426198114/adbpg_client_package.el7.x86_64.tar.gz
tar -xzvf adbpg_client_package.el7.x86_64.tar.gz

Fetch the database DDL and DML SQL files.

cd ~/airflow
wget https://raw.githubusercontent.com/alibabacloud-howto/opensource_with_apsaradb/main/apache-airflow/northwind_ddl.sql
wget https://raw.githubusercontent.com/alibabacloud-howto/opensource_with_apsaradb/main/apache-airflow/northwind_data_source.sql
wget https://raw.githubusercontent.com/alibabacloud-howto/opensource_with_apsaradb/main/apache-airflow/northwind_data_target.sql

There are 2 databases (source: northwind_source, target: northwind_target) working as the source and target respectively in this data migration demo.

Connect to the demo source database northwind_source, create the tables (northwind_ddl.sql) and load the sample data (northwind_data_source.sql).
Replace <rds_pg_url_airflow_demo_database> with the demo RDS PostgreSQL connection string.
We've set up the demo database account as username demo and password N1cetest.

Execute the following commands for the source database:

cd ~/adbpg_client_package/bin
./psql -h<rds_pg_url_airflow_demo_database> -p1921 -Udemo northwind_source

\i ~/airflow/northwind_ddl.sql
\i ~/airflow/northwind_data_source.sql

select tablename from pg_tables where schemaname='public';
select count(*) from products;
select count(*) from orders;

image.png

Execute the following commands for the target database:

./psql -h<rds_pg_url_airflow_demo_database> -p1921 -Udemo northwind_target

\i ~/airflow/northwind_ddl.sql
\i ~/airflow/northwind_data_target.sql

select tablename from pg_tables where schemaname='public';
select count(*) from products;
select count(*) from orders;

image.png

We can see that tables products and orders in the target database are empty. Later we will use the migration task running in Airflow to migrate data from the source database to the target database.

Step 4. Deploy and run data migration task in Airflow

First, go to the Airflow web console (Admin -> Connections) to add database connections to the source and target databases respectively.

image.png

image.png

image.png

image.png

Download and deploy (put into the dags directory) the migration task python script https://github.com/alibabacloud-howto/opensource_with_apsaradb/blob/main/apache-airflow/northwind_migration.py into Airflow.

cd ~/airflow/dags
wget https://raw.githubusercontent.com/alibabacloud-howto/opensource_with_apsaradb/main/apache-airflow/northwind_migration.py

image.png

The DAG task in this demo finds the new product_id and order_id’s in database northwind_source and then updates the same product and order tables in database northwind_target with the rows greater than that maximum id. The job is scheduled to run every minute starting on today’s date (when you run this demo, please update accordingly).
The demo airflow DAG python script is originated from https://dzone.com/articles/part-2-airflow-dags-for-migrating-postgresql-data, We've done some modification.

If the task loaded successfully, the DAG task is shown on the web console.

image.png

image.png

Since the migration task is running all the times, we can go to the target database and check the data migrated.

image.png

相关实践学习
借助OSS搭建在线教育视频课程分享网站
本教程介绍如何基于云服务器ECS和对象存储OSS,搭建一个在线教育视频课程分享网站。
7天玩转云服务器
云服务器ECS(Elastic Compute Service)是一种弹性可伸缩的计算服务,可降低 IT 成本,提升运维效率。本课程手把手带你了解ECS、掌握基本操作、动手实操快照管理、镜像管理等。了解产品详情:&nbsp;https://www.aliyun.com/product/ecs
目录
相关文章
|
2月前
|
分布式计算 Serverless 数据处理
EMR Serverless Spark 实践教程 | 通过 Apache Airflow 使用 Livy Operator 提交任务
Apache Airflow 是一个强大的工作流程自动化和调度工具,它允许开发者编排、计划和监控数据管道的执行。EMR Serverless Spark 为处理大规模数据处理任务提供了一个无服务器计算环境。本文为您介绍如何通过 Apache Airflow 的 Livy Operator 实现自动化地向 EMR Serverless Spark 提交任务,以实现任务调度和执行的自动化,帮助您更有效地管理数据处理任务。
158 0
|
4月前
|
监控 数据处理 调度
使用Apache Airflow进行工作流编排:技术详解与实践
【6月更文挑战第5天】Apache Airflow是开源的工作流编排平台,用Python定义复杂数据处理管道,提供直观DAGs、强大调度、丰富插件、易扩展性和实时监控。本文深入介绍Airflow基本概念、特性,阐述安装配置、工作流定义、调度监控的步骤,并通过实践案例展示如何构建数据获取、处理到存储的工作流。Airflow简化了复杂数据任务管理,适应不断发展的数据技术需求。
|
5月前
|
JSON Java Apache
Spring Cloud Feign 使用Apache的HTTP Client替换Feign原生httpclient
Spring Cloud Feign 使用Apache的HTTP Client替换Feign原生httpclient
227 0
|
Dubbo 前端开发 网络协议
微服务最佳实践,零改造实现 Spring Cloud & Apache Dubbo 互通
微服务最佳实践,零改造实现 Spring Cloud & Apache Dubbo 互通
47624 64
|
机器学习/深度学习 存储 Kubernetes
如何将 Apache Airflow 用于机器学习工作流
Apache Airflow 是一个流行的平台,用于在 Python 中创建、调度和监控工作流。 它在 Github 上有超过 15,000 颗星,被 Twitter、Airbnb 和 Spotify 等公司的数据工程师使用。 如果您使用的是 Apache Airflow,那么您的架构可能已经根据任务数量及其要求进行了演变。 在 Skillup.co 工作时,我们首先有几百个 DAG 来执行我们所有的数据工程任务,然后我们开始做机器学习。
|
开发框架 Dubbo Cloud Native
带你读《Apache Dubbo微服务开发从入门到精通》——四、 与gRPC、Spring Cloud、Istio的关系
带你读《Apache Dubbo微服务开发从入门到精通》——四、 与gRPC、Spring Cloud、Istio的关系
296 6
|
自然语言处理 Rust Kubernetes
QCon 2022·上海站 学习笔记1: Run WebAssembly in Apache APISIX
QCon 2022·上海站 学习笔记1: Run WebAssembly in Apache APISIX
167 0
|
存储 SQL 传感器
【Flink】(04)Apache Flink 漫谈系列 —— 实时计算 Flink 与 Alibaba Cloud Realtime Compute 剖析2
【Flink】(04)Apache Flink 漫谈系列 —— 实时计算 Flink 与 Alibaba Cloud Realtime Compute 剖析2
588 0
【Flink】(04)Apache Flink 漫谈系列 —— 实时计算 Flink 与 Alibaba Cloud Realtime Compute 剖析2
|
SQL 消息中间件 分布式计算
【Flink】(04)Apache Flink 漫谈系列 —— 实时计算 Flink 与 Alibaba Cloud Realtime Compute 剖析1
【Flink】(04)Apache Flink 漫谈系列 —— 实时计算 Flink 与 Alibaba Cloud Realtime Compute 剖析1
331 0
【Flink】(04)Apache Flink 漫谈系列 —— 实时计算 Flink 与 Alibaba Cloud Realtime Compute 剖析1
|
存储 Kubernetes 监控
大规模运行 Apache Airflow 的经验和教训
Sam Wheating,来自加拿大不列颠哥伦比亚省温哥华的高级开发人员。供职于 Shopify 的数据基础设施和引擎基础团队。他是开源软件的内部倡导者,也是 Apache Airflow 项目的贡献者。
1159 0
大规模运行 Apache Airflow 的经验和教训

推荐镜像

更多
下一篇
无影云桌面