苏黎世联邦理工DS3Lab:构建以数据为中心的机器学习系统

简介: 苏黎世联邦理工DS3Lab:构建以数据为中心的机器学习系统

机器之心知识站与国际顶尖实验室及研究团队合作,将陆续推出系统展现实验室成果的系列技术直播,作为深入国际顶尖团队及其前沿工作的又一个入口。赶紧点击「阅读原文」关注起来吧!


12月15日-12月22日,最新一期「机器之心走近全球顶尖实验室」邀请到苏黎世联邦理工学院(ETH Zurich) DS3Lab实验室带来分享。

苏黎世联邦理工学院(ETH Zurich) DS3Lab 实验室 (https://ds3lab.inf.ethz.ch/) 由该校助理教授张策以及16名博士生和博士后组成,实验室隶属于计算机系系统组(https://systems.ethz.ch/)。我们相信,让所有人都可以经济、高效地开发安全可信的机器学习应用是未来机器学习技术推动社会发展的关键一环。以此为中心,DS3Lab 的研究重点是构建以数据为中心、以人为中心,并具有高效可扩展能力的下一代机器学习平台和系统。


DS3Lab 主要致力于两大研究方向,覆盖了机器学习开发、运行、和运维的整个生命周期:1) Ease.ML项目: 该项目主要研究如何设计、管理、加速以数据为中心的机器学习开发、运行和运维流程;2) ZipML项目: 该项目面向新的软硬件环境(如云计算、分布式、数据库、新硬件等),设计实现高效可扩展的机器学习系统。


DS3Lab 的研究成果影响了多个企业级的机器学习平台或其子系统的设计,比如 openGauss 数据库的 DB4AI 系统、Bagua 分布式机器学习平台和 Persia 分布式推荐系统;实验室同时也和工业界(例如 Microsoft、Google、Alibaba、eBay、Oracle, IBM, Kuaishou)合作,共同研发了一些原型系统。DS3Lab 也致力于促进人工智能对社会的积极影响,与一系列国际组织(如世界自然基金会)合作,将机器学习应用于气候、地理、医疗等领域

近年来,DS3Lab 的成员在研究方面获得过诸如 Google Focused Research Award, ICLR Outstanding Paper Award, SIGMOD Best Paper Award, SIGMOD Research Highlight Award 等奖项。许多 DS3Lab 的博士生和博士后在毕业或者出站后能够在世界知名大学,如丹麦哥本哈根大学(University of Copenhagen),创建自己的研究小组,开启新的学术生涯。

12月15日-12月22日,来自苏黎世联邦理工学院(ETH Zurich) DS3Lab实验室的11位嘉宾将带来6期分享:构建以数据为中心的机器学习系统,详情如下:

12月15日 20:00-21:30

主题一:Building ML Systems in the Era of Data-centric AI

Speaker: Ce Zhang

Biography: Ce is an Assistant Professor in Computer Science at ETH Zurich. The mission of his research is to make machine learning techniques widely accessible---while being cost-efficient and trustworthy---to everyone who wants to use them to make our world a better place. He believes in a system approach in enabling this goal, and his current research focuses on building next generation machine learning platforms and systems that are data-centric, human-centric, and declaratively scalable. Before joining ETH, Ce was advised by Christopher Ré. He finished his PhD at University of Wisconsin-Madison and spent another year as a postdoctoral researcher at Stanford. His work has received recognitions such as the SIGMOD Best Paper Award, SIGMOD Research Highlight Award, and Google Focused Research Award, and has been featured and reported by Science, Nature, the Communications of the ACM, and a various media outlets such as Atlantic, WIRED, Quanta Magazine, etc.

Abstract: In this talk, I will provide a broad overview of the research activities of DS3Lab, centered on improving the scalability and usability of machine learning systems and platforms. Many of these activities will be discussed in detail in the following six sessions.


主题二:Ease.ML/DataScope: Data Debugging for ML

Speaker: Bojan Karlas

Biography: Bojan is a fourth-year PhD student at the Systems Group of ETH Zurich advised by Prof. Ce Zhang. His research revolves around data management systems for machine learning. Particularly, he works on building principled tools for managing machine learning workflows, data debugging and data cleaning for ML. His industry experience involves research internships at Microsoft, Oracle and Logitech, as well as a two-year software engineer position at the Microsoft Development Center Serbia where he worked in the SQL Server Parallel Data Warehouse team. Prior to joining ETH, he got a computer science master's degree at EPFL in Lausanne and a software engineering bachelor's degree at ETF Belgrade.

Abstract: Which training examples in my training set are more “guilty” to make my trained machine learning (ML) model low accuracy, unfair, or non-robust to adversarial attacks? If some of my training examples contain missing information, which incomplete training example should I repair first? Answering questions like these will have a profound impact on future ML systems to guide users through both development (MLDevs) and operation (MLOps). This requires us to understand the fundamental concept of data importance with respect to model training.In this talk, we focus on two types of data errors (missing values and wrong values) and examine principled methods for guiding the attention of our data debugging effort. Specifically, for each of these types of data errors, we will go over two kinds of ways to represent data importance which we focus on in our work. For missing data we look at information gain, and for wrong data we look at the Shapley value. There have been intensive recent efforts in speeding up and scaling up the calculation of these importance measures via MCMC or proxy models such as k-nearest neighbors. However, all these efforts only deal with ML training in isolation and ignore the data processing pipelines (or queries) precursing ML training in most, if not all, ML applications. This significantly limits the application of these techniques in real-world scenarios. Our work takes the first step towards jointly analyzing data processing pipeline with ML training and we develop novel algorithms for computing these important metrics in polynomial time.


12月16日 20:00-21:00

主题:Ease.ML/ModelPicker and Ease.ML/CI: Model Testing and CI/CD for Machine Learning

Speaker: Cedric Renggli and Merve Gurel

Biography:

  • Cedric is a PhD Student supervised by Ce Zhang in the Systems Group at ETH Zurich. His main research interest lies in the foundation of usable, scalable and efficient data-centric systems to support all kinds of interactions in a machine learning model lifecycle, broadly known as MLOps. This notably led to the definition of new engineering principles, such as how to run a feasibility study for ML application development, or how to perform continuous integration (CI) of ML models with statistical guarantees. Cedric is furthermore working on efficient methods to enable model search functionalities in pre-trained model collections. This typically serves as a starting point to solve new machine learning tasks using transfer learning.Cedric holds a bachelor's degree from the Bern University of Applied Sciences and received his MSc in Computer Science from ETH Zurich in 2018. His work on Efficient Sparse AllReduce For Scalable Machine Learning was awarded with the silver medal of ETH Zurich for outstanding master thesis. During his PhD, Cedric has worked as a research intern and student research consultant at Google Brain in Zurich.
  • Merve is a PhD candidate within theSystems Group in theDepartment of Computer Science at ETH Zurich. Her research goal is to form a principled understanding of many aspects of machine learning systems: from hardware to label efficiency and robustness. Moreover, using this knowledge, she aims to build theoretically-sound, usable, robust and scalable machine learning systems. Recently, her efforts have been recognized by Google where she has received the 2021 Google Scholarship award. She obtained her masters from theSchool of Computer and Communication Sciences at EPFL, where she was a research scholar at the Information Theory Laboratory. There she worked on topics of advanced probability and information theory such as convergence rates of bounded martingales and rate-distortion theory. During and after her masters, she was also a researcher at IBM Research in Zurich, and worked on building denoising algorithms for modern radio telescopes.

Abstract:Machine learning projects are usually not completed after having successfully trained a machine learning model satisfying the required performance indicators (e.g., accuracy). Rather the model enters into the operational phase, whereby multiple new challenges can occur. In this talk, we will focus on two main data-centric challenges in the operational stage of a ML model lifecycle: (1) how to continuously integrate and test a model to fulfill some requirements with strong statistical guarantees, and (2) how to select the best generalizing model for a target classification task by querying as minimum label as possible. We illustrate their benefits in several practical scenarios.

In the first part of this talk, we focus on continuous integration, an indispensable step of modern software engineering practices to systematically manage the life cycles of system development. Developing a machine learning model is no difference — it is an engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However, most, if not all, existing continuous integration engines do not support machine learning as first-class citizens. In this talk we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine learning providing statistical guarantees. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design a domain specific language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions popularly used in real production systems.

In the second part, we present Model Picker, a set of active model selection strategies to select the model with the best generalization capability for a target task. Specifically, we ask: “Given k pre-trained classifiers and a stream or pool of unlabeled data examples, how can we actively decide whose label to query so that we can distinguish the best model from the rest while making a small number of queries?” Towards that, we introduce two active model selection strategies under the umbrella of Model Picker for pool- and stream-based settings. We also establish their theoretical guarantees, and extensively demonstrate its effectiveness in our experimental studies. We show that the competing methods often require up to 2.5x more labels to reach the same performance with Model Picker.


12月17日 20:00-21:00

主题:Ease.ML/Snoopy: Automatic Feasibility Study for Machine Learning

Speaker: Luka Rimanic

Biography:Luka Rimanic earned his PhD in 2018 in the mathematics department at University of Bristol, after completing Part III at the University of Cambridge. Back in the days, he was exploring the field of additive combinatorics, whilst using a number of other fields in conjunction. Upon completing his PhD degree, Luka had a spell in industry as a consultant for machine learning tasks before joining the DS3Lab, Systems Group at ETH, in October 2019. He currently works on several projects concerning differential privacy, transfer learning and the usability of machine learning systems, with particular focus on the theory behind such systems. In recent years his work has been published in various top-tier conferences.

Abstract:A common problem that domain experts who are using today’s AutoML systems encounter is what we call unrealistic expectations - when users are facing a very challenging task with a noisy data acquisition process, while being expected to achieve startlingly high accuracy with machine learning. Many of these are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In practice, if these were done by human machine learning consultants, they would first analyze the representative dataset for the defined task and assess the feasibility of the target accuracy - if the target is not achievable, one can then explore alternative options by refining the dataset, the acquisition process, or investigating different task definitions. In this talk, we ask: Can we automate this feasibility study process?

We will present a system whose goal is to perform an automatic feasibility study before building machine learning applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error. Over the years, we have been conducting a series of work which provide important insights into many decisions taken when designing our system. These consist of building a novel framework for evaluating Bayes error estimators, theory behind applying feature transformation to improve the performance of certain estimators, and system optimisations that involve a multi-armed bandit approach, together with an improved version of the successive-halving algorithm under certain assumptions. We will describe each of these ingredients, culminating in end-to-end experiments that show how users are able to save substantial time and monetary efforts.


12月20日 20:00-21:00

主题:Bagua & Persia: Distributed ML Systems (Bagua & Persia)

Speaker: Shaoduo Gan and Binhang Yuan

Biography:

  • Shaoduo Gan has attained his PhD degree from ETH Zurich in 2021. His research is focused on distributed learning systems and the framework of machine learning. His work has been published at VLDB, SIGMOD, IEEE TPDS, ICML and NeurIPS.
  • Binhang Yuan is a Postdoc researcher in the Computer Science Department,  ETH Zürich. He obtained his Ph.D in Computer Science from the Computer Science Department,  Rice University. His research interests include database for machine learning, distributed machine learning, and recommendation systems.

Abstract:The increasing scalability and performance of distributed machine learning systems has been one of the main driving forces behind the rapid advancement of machine learning techniques. In this talk, we will introduce two recently released open source distributed learning toolkits (Bagua and Persia) developed under the collaboration between DS3Lab and Kwai Seattle AI Lab.

Bagua is a general purpose distributed data-parallel training system, which is designed to bridge the gap between the current landscapes of learning systems and optimization theory. Recent years have witnessed plenty of algorithmic design advances for data parallelism in order to lower the communication via system relaxations: quantization, decentralization, and communication delay. However, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Therefore, we build BAGUA, a MPI-style communication library, providing a collection of primitives that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by this design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 128 GPUs, BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 2 times) across a diverse range of tasks.

Persia is a distributed training system specialized for extremely large scale recommender systems. Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation--the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest of the neural network is increasingly computation-intensive. We resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms. Both theoretical demonstration and empirical study up to 100 trillion parameters have been conducted to justify the system design and implementation of Persia.


12月21日 20:00-21:00

主题:Data Ecosystem Integration for Machine Learning: Serverless and Databases

Speaker: Jiawei Jiang and Lijie Xu

Biography:

  • Jiawei Jiang is now a postdoctoral researcher in the Department of Computer Science of ETH Zürich. He obtained his Ph.D in Computer Science from Peking University in 2018. His research interests include scalable machine learning, distributed optimization, and automatic machine learning.
  • Lijie Xu is now a postdoctoral researcher in the Department of Computer Science of ETH Zürich. He obtained his PhD degree from the Institute of Software, Chinese Academy of Sciences. His current research interests include in-database machine learning, big data systems, and distributed systems.

Abstract:How to integrate machine learning to today’s data ecosystems effectively and efficiently is a challenging problem. Today’s data ecosystems have different data storage and computation paradigms, which may degrade the ML performance and scalability if implemented on them. In this talk, we focus on integrating ML to two types of data ecosystems:

(1) ML on serverless: The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over serverful infrastructures (IaaS). In this talk, we will present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimisation algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless infrastructure. Our results indicate that ML training pays off in serverless only for models with efficient (i.e., reduced) communication and that quickly converge. In general, FaaS can be much faster but it is never significantly cheaper than IaaS.

(2) In-DB ML (DB4AI): One key problem for in-DB ML is about how to implement an effective and efficient Stochastic Gradient Descent (SGD) paradigm in DB. SGD requires random data access that is inherently inefficient when implemented in database systems that rely on block-addressable secondary storage such as HDD and SSD. In this talk, we will present a new SGD paradigm for in-DB ML with a novel data shuffling approach. Compared with existing approaches, our approach avoids a full data shuffle while maintaining a comparable convergence rate of SGD as if a full shuffle was performed. We have integrated it into PostgreSQL and openGauss, by introducing three new physical operators. The experimental results show that our approach can achieve comparable convergence rate with the full shuffle based SGD, and outperform the state-of-the-art in-DB ML tools.


12月22日 20:00-21:00

主题:ML and Modern Hardware Hardware

Speaker: Zeke Wang and Wenqi Jiang

Biography:

  • Dr. Zeke Wang is currently a research Professor, affiliated to the Artificial Intelligence Collaborative Innovation Center,  School of Computer Science, Zhejiang University. In 2011, he received his Ph.D. from Zhejiang University's School of Instrumental Studies, and worked as an assistant researcher at Zhejiang University's School of Instrumental Studies from 2012 to 2013. From 2013 to 2017, he was a postdoctoral fellow at Nanyang Technological University and National University of Singapore. From 2017 to December 2019, he was a postdoctoral fellow in the Systems Group of ETH Zurich. Served as a program committee member of multiple international conferences (such as KDD) and reviewers of multiple international journals (such as TPDS, TC, TCAD).
  • Wenqi Jiang is a PhD student at ETH Zurich advised by Prof. Gustavo Alonso. His research is centered around improving and replacing state-of-the-art software-based systems by emerging heterogeneous hardware. To be more specific, he is exploring how reconfigurable hardware (FPGAs) can be applied and deployed in data centers efficiently. Such research ranges from application-specific designs such as recommendation systems and information retrieval systems to infrastructure development such as distributed frameworks and virtualization on the cloud.

Abstract:Various hardwares have been adopted for (distributed) machine learning, in this talk, we will first talk about some recent attempts to combine SmartNIC and FPGAs to efficiently support machine learning workflows; to be more specific, in the second half of the talk we will talk about some advances about leveraging FPGAs to accelerate the inference procedure in recommendation systems.

As Moore’s Law is about to end, the computing power of traditional CPUs cannot continue to increase rapidly, and the current network speed continues to increase rapidly, so the current CPU computing power cannot meet the requirements of the network for data processing capabilities, so we need to offload the network protocol stack  to the SmartNIC to reduce the computing pressure of the server CPU; on the other hand, we can also reasonably offload some computing tasks in the Machine Learning training system to the SmartNIC to improve the overall performance of the training system. At present, ASIC-based SmartNICs can only support specific and limited application offloading; and ARM processor-based SmartNICs have very limited computing power and cannot support the offloading of complex functions. For this reason, we think that the SmartNIC based on high-end FPGA can flexibly support the Machine Learning training system. Our SmartNIC has enough computing power to allow for communication-related task offloading, such as compression/decompression.

Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads which are typically bound by computation, recommendation inference is largely bound by memory due to the many random accesses needed to lookup the embedding tables. To this end, we first present MicroRec (MLSys'21), a high-performance FPGA inference engine for recommendation systems that tackles the memory bottleneck by both computer architecture and data structure solutions. Once the memory bottleneck is removed, the DNN computation becomes the main bottleneck again due to the limited power of computation delivered by FPGAs. Thus, we further design and implement a high-performance and heterogeneous recommendation inference cluster named FleetRec (KDD'21) that takes advantage of the strengths of both FPGAs and GPUs. Experiments on three production models up to 114 GB show that FleetRec outperforms optimized CPU baseline by more than one order of magnitude in terms of throughput while achieving significantly lower latency.

相关文章
|
11月前
|
机器学习/深度学习 人工智能 Kubernetes
Argo Workflows 加速在 Kubernetes 上构建机器学习 Pipelines
Argo Workflows 是 Kubernetes 上的工作流引擎,支持机器学习、数据处理、基础设施自动化及 CI/CD 等场景。作为 CNCF 毕业项目,其扩展性强、云原生轻量化,受到广泛采用。近期更新包括性能优化、调度策略增强、Python SDK 支持及 AI/大数据任务集成,助力企业高效构建 AI、ML、Data Pipelines。
1170 0
|
人工智能 自然语言处理 数据挖掘
云上玩转Qwen3系列之三:PAI-LangStudio x Hologres构建ChatBI数据分析Agent应用
PAI-LangStudio 和 Qwen3 构建基于 MCP 协议的 Hologres ChatBI 智能 Agent 应用,通过将 Agent、MCP Server 等技术和阿里最新的推理模型 Qwen3 编排在一个应用流中,为大模型提供了 MCP+OLAP 的智能数据分析能力,使用自然语言即可实现 OLAP 数据分析的查询效果,减少了幻觉。开发者可以基于该模板进行灵活扩展和二次开发,以满足特定场景的需求。
|
9月前
|
机器学习/深度学习 数据采集 算法
量子机器学习入门:三种数据编码方法对比与应用
在量子机器学习中,数据编码方式决定了量子模型如何理解和处理信息。本文详解角度编码、振幅编码与基础编码三种方法,分析其原理、实现及适用场景,帮助读者选择最适合的编码策略,提升量子模型性能。
755 8
|
机器学习/深度学习 存储 运维
机器学习异常检测实战:用Isolation Forest快速构建无标签异常检测系统
本研究通过实验演示了异常标记如何逐步完善异常检测方案和主要分类模型在欺诈检测中的应用。实验结果表明,Isolation Forest作为一个强大的异常检测模型,无需显式建模正常模式即可有效工作,在处理未见风险事件方面具有显著优势。
997 46
|
12月前
|
存储 人工智能 运维
企业级MLOps落地:基于PAI-Studio构建自动化模型迭代流水线
本文深入解析MLOps落地的核心挑战与解决方案,涵盖技术断层分析、PAI-Studio平台选型、自动化流水线设计及实战构建,全面提升模型迭代效率与稳定性。
594 6
|
机器学习/深度学习 算法 数据挖掘
PyTabKit:比sklearn更强大的表格数据机器学习框架
PyTabKit是一个专为表格数据设计的新兴机器学习框架,集成了RealMLP等先进深度学习技术与优化的GBDT超参数配置。相比传统Scikit-Learn,PyTabKit通过元级调优的默认参数设置,在无需复杂超参调整的情况下,显著提升中大型数据集的性能表现。其简化API设计、高效训练速度和多模型集成能力,使其成为企业决策与竞赛建模的理想工具。
563 12
PyTabKit:比sklearn更强大的表格数据机器学习框架
|
机器学习/深度学习 PyTorch API
昇腾AI4S图机器学习:DGL图构建接口的PyG替换
本文探讨了在图神经网络中将DGL接口替换为PyG实现的方法,重点以RFdiffusion蛋白质设计模型中的SE3Transformer为例。SE3Transformer通过SE(3)等变性提取三维几何特征,其图构建部分依赖DGL接口。文章详细介绍了两个关键函数的替换:`make_full_graph` 和 `make_topk_graph`。前者构建完全连接图,后者生成k近邻图。通过PyG的高效实现(如`knn_graph`),我们简化了图结构创建过程,并调整边特征处理逻辑以兼容不同框架,从而更好地支持昇腾NPU等硬件环境。此方法为跨库迁移提供了实用参考。
|
12月前
|
机器学习/深度学习 数据采集 分布式计算
阿里云PAI AutoML实战:20分钟构建高精度电商销量预测模型
本文介绍了如何利用阿里云 PAI AutoML 平台,在20分钟内构建高精度的电商销量预测模型。内容涵盖项目背景、数据准备与预处理、模型训练与优化、部署应用及常见问题解决方案,助力企业实现数据驱动的精细化运营,提升市场竞争力。
1774 0
|
8月前
|
机器学习/深度学习 数据采集 人工智能
【机器学习算法篇】K-近邻算法
K近邻(KNN)是一种基于“物以类聚”思想的监督学习算法,通过计算样本间距离,选取最近K个邻居投票决定类别。支持多种距离度量,如欧式、曼哈顿、余弦相似度等,适用于分类与回归任务。结合Scikit-learn可高效实现,需合理选择K值并进行数据预处理,常用于鸢尾花分类等经典案例。(238字)

热门文章

最新文章