[DISCUSS] What parts of the Python API should we focus on next ?
As release-1.10 is under feature-freeze(The stateless Python UDF is already supported), it is time for us to plan the features of PyFlink for the next release.
To make sure the features supported in PyFlink are the mostly demanded for the community, we'd like to get more people involved, i.e., it would be better if all of the devs and users join in the discussion of which kind of features are more important and urgent.
We have already listed some features from different aspects which you can find below, however it is not the ultimate plan. We appreciate any suggestions from the community, either on the functionalities or performance improvements, etc. Would be great to have the following information if you want to suggest to add some features:
- Feature description: xxxx
- Benefits of the feature: xxxx
- Use cases (optional): xxxx
----Features in my mind----
- Integration with most popular Python libraries
fromPandas/toPandas API Description: Support to convert between Table and pandas.DataFrame. Benefits: Users could switch between Flink and Pandas API, for example, do some analysis using Flink and then perform analysis using the Pandas API if the result data is small and could fit into the memory, and vice versa.
Support Scalar Pandas UDF Description: Support scalar Pandas UDF in Python Table API & SQL. Both the input and output of the UDF is pandas.Series. Benefits: 1) Scalar Pandas UDF performs better than row-at-a-time UDF, ranging from 3x to over 100x (from pyspark) 2) Users could use Pandas/Numpy API in the Python UDF implementation if the input/output data type is pandas.Series
Support Pandas UDAF in batch GroupBy aggregation Description: Support Pandas UDAF in batch GroupBy aggregation of Python Table API & SQL. Both the input and output of the UDF is pandas.DataFrame. Benefits: 1) Pandas UDAF performs better than row-at-a-time UDAF more than 10x in certain scenarios 2) Users could use Pandas/Numpy API in the Python UDAF implementation if the input/output data type is pandas.DataFrame
Fully support all kinds of Python UDF
Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please give us some use case if you want this feature to be contained in the next release) Description: Support UDAF in GroupBy aggregation. Benefits: Users could define and use Python UDAF and use it in GroupBy aggregation. Without it, users have to use Java/Scala UDAF.
Support Python UDTF Description: Support Python UDTF in Python Table API & SQL Benefits: Users could define and use Python UDTF in Python Table API & SQL. Without it, users have to use Java/Scala UDTF.
Debugging and Monitoring of Python UDF
Support User-Defined Metrics Description: Allow users to define user-defined metrics and global job parameters with Python UDFs. Benefits: UDF needs metrics to monitor some business or technical indicators, which is also a requirement for UDFs.
Make the log level configurable Description: Allow users to config the log level of Python UDF. Benefits: Users could configure different log levels when debugging and deploying.
Enrich the Python execution environment
Docker Mode Support Description: Support running python UDF in docker workers. Benefits: Support various of deployments to meet more users' requirements.
Expand the usage scope of Python UDF
Support to use Python UDF via SQL client Description: Support to register and use Python UDF via SQL client Benefits: SQL client is a very important interface for SQL users. This feature allows SQL users to use Python UDFs via SQL client.
Integrate Python UDF with Notebooks Description: Such as Zeppelin, etc (Especially Python dependencies)
Support to register Python UDF into catalog Description: Support to register Python UDF into catalog Benefits: 1)Catalog is the centralized place to manage metadata such as tables, UDFs, etc. With it, users could register the UDFs once and use it anywhere. 2) It's an important part of the SQL functionality. If Python UDFs are not supported to be registered and used in catalog, Python UDFs could not be shared between jobs.
Performance Improvements of Python UDF
Cython improvements Description: Cython Improvements in coder & operations Benefits: Initial tests show that Cython will speed 3x+ in coder serialization/deserialization.
Add Python ML API
- Add Python ML Pipeline API Description: Align Python ML Pipeline API with Java/Scala Benefits: 1) Currently, we already have the Pipeline APIs for ML. It would be good to also have the related Python APIs. 2) In many cases, algorithm engineers prefer the Python language.
BTW, the PyFlink is a new component, and there are still a lot of work need to do. Thus, everybody is cordially welcome to join the contribution to PyFlink, including asking questions, filing bug reports, proposing new features, joining discussions, contributing code or documentation ...
Hope to see your feedback! *来自志愿者整理的flink邮件归档
’- integrate PyFlink with Jupyter notebook - Description: users should be able to run PyFlink seamlessly in Jupyter - Benefits: Jupyter is the industrial standard notebook for data scientists. I’ve talked to a few companies in North America, they think Jupyter is the #1 way to empower internal DS with Flink*来自志愿者整理的flink邮件归档
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。