GT-Scan2: Bringing Bioinformatics to Alibaba Cloud

简介: Learn how Alibaba Cloud powers the cutting-edge genome sequence search tool, GT-Scan2, with its suite of big data products and serverless computing platform.

CRISPR-Cas9 is a genome editing tool that is creating a buzz in the science world. It is faster, cheaper and more accurate than previous techniques for editing the genome of living cells. It hence has the potential to revolutionize a wide range of applications.

CRISPR-Cas9 has a lot of potential especially in the health space as it allows the treatment of medical conditions that have a genetic component, including cancer, hepatitis B or even high cholesterol. Clinical trials have already started for patients with specific blood and solid cancer types.

CRISPR-Cas9 is suitable for these applications because it can be programmed to recognize and edit specific locations in the genome by pattern-matching unique sequences of DNA. However, for robust application in the clinic, the efficiency of CRISPR-Cas9 needs to be increased as does the speed with which target sites can be designed.

Researchers in the eHealth program of the Commonwealth Scientific and Industrial Research Organization (CSIRO) in Australia, developed GT-Scan2, a novel software tool to address both issues.

GT-Scan2 can help researchers find the most effective CRISPR/Cas9 targets in a genomic region by ranking targets by the predicted cutting efficiency. You can think of it as the "search-engine for the genome". GT-Scan2 will also report the number of potential off-targets for each target, where potential off-targets are other regions in the genome with 0-3 mismatches to the target.

  • Identifies optimal CRISPR-Cas9 targets in the human genome.
  • Combines information about the chromatin environment and sequence of the target site.

Architecture

A Web Application front end is used to access the GT Scan2 application and to submit the relevant jobs.

Image_1

When a user submits a job, GT-Scan2 inserts the job parameters as an item into a TableStore table via an API call. This allows the solution to be freely scalable without creating a bottleneck. The database entry triggers the first Function Compute function, which finds all putative CRISPR targets in the user-specified DNA sequence (fetched automatically upon user submission). Potential CRISPR target sites have fixed rules and can be easily found using a regular expression that completes in seconds and are inserted into a second TableStore table.

Image_2
GT-Scan2 is served directly from OSS making it a static web app without server-side processing. It retrieves the dynamic content (such as job results and parameters) via API calls using API Gateway from a NoSQL database (TableStore) using a JavaScript framework.

Applying Serverless Computing

All potential targets need to be evaluated for their off-target risk using the efficient string matching tool, Bowtie. Though Bowtie only requires a reduced representation of the 3 billion letter genomic sequence, the size of these index files still reaches 915 MB for the human genome. Even though Alibaba Cloud Function Compute supports temp spaces of this size, the implementation divides the genome into smaller blocks to enable parallel processing. For an average run, GT-Scan2 hence triggers 200-500 individual Function Compute functions, which simultaneously update the scores for the different putative targets in TableStore. During this process, the frontend is polling this table via API Gateway and updating the webpage as results come in, eliminating the need for server-side compute.

Alibaba Cloud Function Compute provides a framework to develop a future-ready software package that is able to support medical genome engineering applications. It has the ability to instantaneously scale at run time to the optimal capability by spawning the appropriate number of functions to cope with the varying complexity of different genes. Other benefits include only paying for the storage when no compute is triggered; jobs not competing with web server resources as the website is a static page with dynamic content being updated through Angular 2 and the API Gateway; as well as not needing to maintain compute instances (security patches of OS).

Improvements

GT-Scan deployment benefitted from the Alibaba Cloud specific architectural patterns and services. Some of them are listed below.

  • Uses asynchronous invoke method instead of queue based triggers. This allows shorter invoke times and removes the dependency on message queue.
  • Applies Batch read/write when accessing data from the NoSQL database, making IO more efficient.
  • GT Scan deployment streams all logs to Alibaba Cloud Log Service, which allows easier troubleshooting of issues with the workflow operations. Access to logs in a single location allows user to pin point issues easily without having to spend time on logging into server or individual service consoles.

Image_3

Automated Deployments

The open sourced Fun Tool (Fun with Serverless) will enable automated deployments of API Gateway and Function Compute resources making deployments of new GT Scan versions a breeze. The tool allows automated deployments of components defined in a simple YAML file.

What's Next?

Analytics

Leverage Alibaba Cloud's award winning big data platform to create a Machine Learning Pipeline will enable sophisticated analyses to be integrated in the application. This is of specific relevance for personalized health applications, which identify editing strategies for individual patients.

Image_4

Log Analysis

Alibaba Cloud Log Service allows exporting log files for future analysis leveraging Alibaba Cloud's big data platform of existing open sources analysis platforms available at CSIRO's disposal. The log file exports can then be plugged into an existing machine learning pipeline to learn from the usage patterns of the GT-Scan application.

Image_5

Ref

https://community.alibabacloud.com/blog/gt-scan2%253A-bringing-bioinformatics-to-alibaba-cloud_593841?spm=a2c65.11461537.0.0.62ef5355hBhpcO

相关实践学习
阿里云表格存储使用教程
表格存储(Table Store)是构建在阿里云飞天分布式系统之上的分布式NoSQL数据存储服务,根据99.99%的高可用以及11个9的数据可靠性的标准设计。表格存储通过数据分片和负载均衡技术,实现数据规模与访问并发上的无缝扩展,提供海量结构化数据的存储和实时访问。 产品详情:https://www.aliyun.com/product/ots
目录
相关文章
|
域名解析 SEO 搜索推荐
网络基础知识之————A记录和CNAME记录的区别
1、什么是域名解析? 域名解析就是国际域名或者国内域名以及中文域名等域名申请后做的到IP地址的转换过程。IP地址是网路上标识您站点的数字地址,为了简单好记,采用域名来代替ip地址标识站点地址。域名的解析工作由DNS服务器完成。
12974 1
|
资源调度 JavaScript API
【Vue2 / Vue3】 一个贼nb,贼强大的自定义打印插件
【Vue2 / Vue3】 一个贼nb,贼强大的自定义打印插件
13825 120
|
缓存 Linux Docker
docker 跨平台构建镜像
docker 跨平台构建镜像
634 0
|
算法 C++
【C++入门到精通】智能指针 shared_ptr循环引用 | weak_ptr 简介及C++模拟实现 [ C++入门 ]
【C++入门到精通】智能指针 shared_ptr循环引用 | weak_ptr 简介及C++模拟实现 [ C++入门 ]
1092 0
|
前端开发 JavaScript API
JavaScript待办事项列表
JavaScript待办事项列表
|
数据采集 机器学习/深度学习 人工智能
数据清洗、数据处理入门!R语言我来了,数据不再零散!
「数据清洗」和「预处理」是数据科学中必不可少的一部分,它们能够帮助我们准确地分析和预测未来趋势。如果你曾经尝试过进行分析或建模,你会发现数据往往不像我们所想象的那样干净、整洁。需要对数据进行仔细的检查、清理和处理,才能真正把数据转变成有用的信息。
1035 0
|
Android开发
【Android 进程保活】应用进程拉活 ( 账户同步拉活 | 账户同步 | 源码资源 )(一)
【Android 进程保活】应用进程拉活 ( 账户同步拉活 | 账户同步 | 源码资源 )(一)
424 0
|
C语言 C++
C++返回值为对象时复制构造函数不执行怎么破
  先说点背景知识,调用复制构造函数的三种情况:   1.当用类一个对象去初始化另一个对象时。   2.如果函数形参是类对象。   3.如果函数返回值是类对象,函数执行完成返回调用时。   在辅导学生上机时,有同学第3点提出异议。有教材上的例题为证: #include <iostream> using namespace std; class Point //Point
1345 0
|
Android开发
Android系统移植与调试之------->build.prop生成过程分析
本文简要分析一下build.prop是如何生成的。Android的build.prop文件是在Android编译时刻收集的各种property(LCD density/语言/编译时间, etc.),编译完成之后,文件生成在out/target/product//system/目录下。
1842 0