How AHI Fintech and DataVisor are Securing Data through AI and Big Data

简介: With growing threat of cyber-attacks, organizations like AHI Fintech and DataVisor are using Big Data and AI to help customers in China to protect their data.

The field of financial risk control has recently seen a sudden increase in competition over the past year. Several budding enterprises find themselves currently fighting a battle on two fronts—data acquisition capabilities and algorithm technology.

In June 2017, China's Cyber Security Act was launched. Companies that crawl users' mobile phone for data without prior authorization may now be facing serious legal implications. This can include a 7-year jail sentence for legal representatives of the convicted company. Furthermore, many businesses in the field of data acquisition and transactional data are now facing thorough investigation.

With the loss of the gray data industry, the risk control industry seems to be facing an opportunity to move toward healthy compliance, despite the technical challenges it still faces.

A Close Race: Algorithms or Data?

Despite the increasingly thorough regulations, cyber-attack rates are still on the rise. This makes user data even more precious, especially for businesses reliant on the integration of external data. Instead of circumventing regulations, enterprises have now shifted their focus on two major aspects of big data analysis – algorithms and modeling. These two areas are crucial in the field of big data as their emergence has given rise to a group of new risk-control companies in China.

Huang Ling, CEO of AHI FinTech and a Ph.D. in computer science at the University of California, Berkeley, and a part-time professor at the Interdisciplinary Information Institute at Tsinghua University described his work as "A global war." "In the risk control industry, our opponents are huge, and feature worldwide black-market production chain."

Since the very beginning, risk control has been facing a global opponent in the form of a community of hackers. These hackers invade other people’s phones and computers through malicious software. On the one hand, they can access confidential data, on the other hand, they can use this compromised information to open fake accounts. They also can do all kinds of false social interactions such as leaving reviews or making purchases. They eventually have a seemingly normal account with a lot of friends and good credit history. Ultimately, they use these accounts to apply for a variety of financial products.

Meanwhile, the core of risk control is using relevant data to conduct modeling analysis and eliminate fake users, then provide repayment ability and repayment willingness and risk control evaluations for real users.

The data risk control companies that have begun to pop up over the past few years complete the resources that this work relies on algorithms and data acquisition capabilities.

Currently, in the face of the massive worldwide "gangs" of hackers and an enormous black-market production chain, risk control, and anti-fraud solutions in the market are lagging an obvious step behind regarding of algorithmic technology. Program providers often use device fingerprints, black/white lists, regular systems, or tagged machine learning models to detect fraudulent activities. Some methods only conduct shallow analyses; therefore, it is easy for malicious opponents to circumvent and deceive. Some use machine learning methods but often rely on tagged historical data to train models. This labeled data is often scarce and represents only the fraudulent activities that have occurred in the past. Furthermore, the models trained with this data are not accurate enough to cope with the ever-changing fraudulent practices.

Meanwhile, a vast number of risk control companies are primarily reliant on strong data acquisition and integration capabilities. But by malicious crawling, purchasing of hacked data, and so on, the consolidated data eventually includes a substantial portion of the individual's confidential data which include ID card, phone number, bank card, savings account, or exact home address. The solutions that currently exist in the industry are extremely dependent on this kind of data. However, this data uses infringes on personal privacy and its legal legitimacy has been the target of heavy criticism. However, the amount of data available within the industry has a profound impact on modeling accuracy, and after the omission of these sensitive data, it will test the algorithms’ ability to perform.

Acquisition of Risk Control Data: Is Magnitude or Scenario More Important?

Will the sophistication of the algorithm make up for the loss of a large amount of sensitive data? Last month, in an exclusive interview with HC Financial Service Group CFO Shen Yutong (Tony Shen) in New York, the Big Data Digest said we should be more cautious when using data that is not directly related to lending behavior or credit behavior.

"Some people frequently shop online, but because of their frequent shopping habits, they find themselves short of cash and in need of a loan. It is necessary to realize that regular online payments do not always mean that the buyer is a good person to give credit to. Therefore, they fail to get loans when applying for greater amounts of credit."

Social and online-shopping data, while valuable, are not necessarily more useful than data that is directly related to a person’s financial situation. Mr. Shen’s attitude reflects that of traditional financial experts toward Internet risk control—that is one of caution. Furthermore, it embodies a problem currently faced by the industry, when acquiring risk control data, is the volume of data more important than the data scenarios or vice versa?

With regards to this issue, Huang Ling is obviously more supportive of the latter type, "I think it depends on what kind of data we're talking about and how we use those data. The data which we're crawling isn't necessarily going to make much of a difference here. For customer data and application scenarios, it is important to help them mine the data more accurately and closer to the goal of the data.


Core customers for financial risk control are Internet companies, Internet financial enterprises, and other financial institutions. What these institutions have in common is a large number of accounts. With accounts at the center, we can acquire a lot of personal information including bank balances, purchasing history, borrowing history, and more. The work of risk control then is to build models around this data and make assessments of the user’s ability and willingness to make repayments, proper loan amount, etc.

Huang Ling believes that desensitizing data for user behavior models can also help realize the goal of risk control fraud prevention. "When executing behavior analysis, we usually look at the person’s social relationships, phone records, and e-commerce behavior. Here, behavior refers to where, when, and on what device the user registers and logs on to the website, what they did after logging in (the pages they visited, the products they purchased, the friends they added, who the spoke to, etc.). Even though this data includes some sensitive information (such as who your friends are) the data is desensitized. This data is then fed into a graphing algorithm and user association analysis is used to identify the hidden information related to the interaction between users."

Huang Ling and his team at AHI FinTech Quest

"We have almost no sensitive data belonging to users, more important for us is to use non-sensitive data, targeting the client’s behavior data. We then combine it with user use scenarios, use AI and Big Data methods to help the client get value out of their data, then create the most appropriate risk control model for the client’s user scenario. Subsequently, we help them achieve the most optimal testing results on their platform. This way, we can automatically detect abnormal connections among tens of millions of users, produce risk control warnings, and guard against organized and systematic risks. We execute all without infringing on users’ privacy or having insight on the type or characteristics of the attack."

Redefining Risk Control: Starting from the Data Source

This kind of data acquisition method also poses greater and more serious requirements to companies in the industry.

"When we do risk control modeling, the first thing we look at is the quality of the data, including whether or not the data is complete and whether or not it includes data that is relevant to risk control."

Huang Ling believes that risk control happens not only during modeling and testing, but on the side of the company, beginning with the collection of data. When dealing with customers, AHI FinTech focuses on helping its clients elevate their related abilities from a service perspective:

First, after a risk control signal is sent to the client’s platform, the platform can then block users with a high-risk value. For users with lower but insignificant risk control values, the system can merge data from other dimensions into their rules and models. It can also perform further processing and refining, and then re-process the user.


Furthermore, the system provides feedback during each step of data collection as well as the discovery of fraudulent trends and modeling.

The systems conduct these two aspects in parallel. If collected data does not meet quality standards, then the client must be required to adjust, then provide feedback on its issues in certain aspects. Even if the data cannot be made up for right away, the system has to fix it as soon as possible."

"We make recommendations to the client on how to collect data according to our own risk control and fraud prevention experience, so working with us is not just a matter of us helping you meet fraud detection requirements, we also give the client a lot of feedback and keep open lines of communication. We offer them comprehensive consulting and service from system applications to data collection and risk control."

The opponent of financial risk control is the enormous black-market production chain, making the matter extremely complicated. Organizations are introducing innovative technologies like AI and machine learning to the field in droves but using them correctly is certainly not a simple matter.

Most of solutions currently in the market are reliant on collecting massive amounts of data, then using rules systems or supervised machine learning generated models. These solutions harbor an obvious shortcoming: the models are always reliant on training with historical tag data. However, we can produce tags can only after we have suffered a fraud attack. We create them at the cost of our own sweat, blood, and tears. As our goal is for these kinds of attacks to be increasingly rare, we find ourselves lacking data for training models. Models produced by this kind of tag training are never good enough, and they can only represent fraudulent behavior that we’ve seen in the past. When fraudsters invent new methods, our models that are reliant on tag training always have difficulty in quickly and accurately stopping them—often creating massive losses.

Huang Ling’s team uses semi-supervised learning on data with only a few—or even no—tags to generate models, allowing them to significantly reduce the cost of acquiring new tags, increasing data usage, and producing higher quality models. Using an active machine learning platform, massive data processing capabilities provided by a Big Data system that combines organic and artificial intelligence, and the experience of risk control experts to help artificial intelligence automatically learn previously unknown fraud tactics. Additionally, it can track new fraud methods, and constantly adapt to an ever-changing environment to created anti-fraud machine learning models. This makes it significantly difficult for fraudsters to evade detection.

The Black-Market Production Chain in the Risk Control Industry

In addition to AHI FinTech, there is a Silicon Valley company—Datavisor—in the risk control field that takes a similar approach. Beginning in 2014, Huang Ling left his 7-year career as a senior researcher at Intel, becoming a founding member and Director of Data at Datavisor where he hosted the company’s entire machine learning, user behavior analysis, and credit modeling system. Here, he became a party to the next generation in Silicon Valley and became the most well-known expert in using unsupervised risk control methods.


Huang Ling has always believed that risk control in China is not particularly comparable to that in Silicon Valley. The black products faced by the anti-fraud industry are an entire production chain, made up of a gang of sorts that is spread out around the globe. This chain stretches from Eastern Europe to America to China to India. Moreover, it includes security attack software at the top to the people who use this software to control people and phones around the world. It also includes people and phones to create fake users who execute all kinds of fraudulent activity and reap the benefits.

Therefore, to a certain extent, you can say that risk control and anti-fraud work are universal. Several Internet companies and financial institutions in China are also facing attacks from abroad, and a lot of attacks perpetrated in America are conducted via China, India, Africa, or one of several South East Asian countries.

As a result, the much significant difference between America and China likely lies in differences in political policy and industry development:

Because the credit system in America is sound, the cost of committing fraud is relatively high. Whether the user is defrauding a bank or an online merchant, these kinds of activities usually affect the user’s credit score through a variety of channels. In China, this system is still under-developed; therefore, in many situations, user’s credit rating according to the central bank does not reflect online financial fraudulent activities. Therefore, the cost of committing fraud is comparatively low. As a result, examples of large-scale fraud are more abundant in China than they are in the States and they tend to be harder to handle effectively.

Furthermore, the development of the industry has been different in China and America. China’s mobile applications and Internet financial industries have grown to be larger than their counterparts in America, so there is more fraudulent activity surrounding these two sectors than in America.

"After coming back, we noticed that in China—especially in fields related to finance—this kind of fraud gang was larger and craftier than they are in America. They also use more real people to commit fraud, making them harder to detect and requiring more machine learning and AI modeling methods." Huang Ling said.

And on a global battlefield such as this, the addition of experts in artificial intelligence algorithms and security scientists is even more invaluable.

Turning to the entrepreneurial mind, Huang Ling said, "I have been researching and practicing in the fields of artificial intelligence algorithms and Internet security for several years. I hope that the skills and experience I have acquired over this time will be useful in the fields of financial risk control and fraud prevention. I also wish to provide a complete set of systems and services to accompany financial and Internet products and thereby achieve a more secure, honest, and fair industry environment. "

Aside from Huang Ling, the other co-founder of AHI FinTech—chief scientist Xu Wei—also hails from academia. Xu Wei served as a cross-institute adjunct professor at Tsinghua University.


Huang Ling regards the entrepreneurship of AI scientists in the field of risk control as a good thing. This is because they possess the skill and understanding of algorithms. Additionally, he is willing to give them a chance to really participate in the industry, rather than just being a cog in the machine.

阿里云实时数仓实战 - 项目介绍及架构设计
课程简介 1)学习搭建一个数据仓库的过程,理解数据在整个数仓架构的从采集、存储、计算、输出、展示的整个业务流程。 2)整个数仓体系完全搭建在阿里云架构上,理解并学会运用各个服务组件,了解各个组件之间如何配合联动。 3 )前置知识要求   课程大纲 第一章 了解数据仓库概念 初步了解数据仓库是干什么的 第二章 按照企业开发的标准去搭建一个数据仓库 数据仓库的需求是什么 架构 怎么选型怎么购买服务器 第三章 数据生成模块 用户形成数据的一个准备 按照企业的标准,准备了十一张用户行为表 方便使用 第四章 采集模块的搭建 购买阿里云服务器 安装 JDK 安装 Flume 第五章 用户行为数据仓库 严格按照企业的标准开发 第六章 搭建业务数仓理论基础和对表的分类同步 第七章 业务数仓的搭建  业务行为数仓效果图  
存储 数据采集 分布式计算
实时大数据处理real-time big data processing (RTDP)框架:挑战与解决方案
实时大数据处理real-time big data processing (RTDP)框架:挑战与解决方案
《ET Brain Exploring New Uses for Data and AI》电子版地址
ET Brain Exploring New Uses for Data and AI
79 0
《ET Brain Exploring New Uses for Data and AI》电子版地址
SQL 人工智能 分布式计算
DATA AI Summit 2022提及到的对 aggregate 的优化
DATA AI Summit 2022提及到的对 aggregate 的优化
160 0
DATA AI Summit 2022提及到的对 aggregate 的优化
存储 数据采集 人工智能
初始大数据(Big Data)开发
大数据(big data),或称巨量资料,指的是所涉及的资料量规模巨大到无法透过目前主流软件工具,在合理时间内达到撷取、管理、处理、并整理成为帮助企业经营决策更积极目的的资讯。主要解决的是对海量数据的存储以及海量数据的计算分析问题
初始大数据(Big Data)开发
298 0
机器学习/深度学习 传感器 人工智能
AI提高药物发现效率 | ML,Supercomputers and Big Data
AI提高药物发现效率 | ML,Supercomputers and Big Data
121 0
AI提高药物发现效率 | ML,Supercomputers and Big Data
机器学习/深度学习 存储 人工智能
被神话的大数据——从大数据(big data)到深度数据(deep data)思维转变
2034 0
ET Brain: Exploring New Uses for Data and AI
This whitepaper seeks to introduce the various ET Brain solutions, their features, and how the solutions are designed to transform the respective target industries.
1989 0
ET Brain: Exploring New Uses for Data and AI