1. 背景、问题、需求以及解决难点概述
如题,GBDT(Gradient Boosted Decision Trees)是一类机器学习算法,作为Boosting思想的代表,广泛的应用于各种任务中,特别是回归和分类问题,通过迭代地添加弱学习器(通常是决策树)并不断拟合预测残差来提升整体模型的性能。微软的LightGBM与陈天奇博士的XGBoost均是对GBDT算法的高效工程实现。
客户算法业务上量阶段,参考开源文档在ODPS Spark环境下以PySpark测通LightGBM-2.x的版本并结合DataWorks进行周期性模型训练以及推理的调度。但线下的离线模型基于LightGBM-3.x训练,2.x版本无法兼容加载做模型训练更新和推理,更新3.x版本jar包后轮番出现ModuleNotFoundError、NoClassDefFoundError以及.so文件版本不满足等多语言异常栈。通过对mmlspark以及高版本synapseml的源码剖析,发现了该开源库在架构更改与更名后实现和使用方式存在差异,同时结合ODPS Spark运行时环境自定义了部分静态库后,顺利解决LightGBM-3.x算法任务上量的卡点问题。
1.1. 参考文档:
MaxCompute-Spark官方开源文档参考:https://github.com/aliyun/MaxCompute-Spark/wiki/06.-PySpark-%E4%BD%BF%E7%94%A8mmlspark%E5%92%8Canalytics-zoo
mmlspark及synapseml官网文档:https://github.com/microsoft/SynapseML?tab=readme-ov-file#python-standalone
pyspark lightgbm实现开源样例文档:https://openscoring.io/blog/2023/05/26/pyspark_lightgbm_pipeline/
1.2. 实现版本
mmlspark:
- pyspark2.x
- spark-2.4.5-odps0.33.0
- public.python-3.7.9-ucs4.tar.gz
- mmlspark_2.11:0.18.0(国内maven镜像库版本仅到0.18.1,github上v.1.0.0的tag无法直接下载)
synapseml:
- pyspark3.x
- spark-3.1.1-odps0.33.0
- public.python-3.7.9-ucs4.tar.gz
- synapseml-lightgbm_2.12:0.10.1
1.3. 实现环境及方式
基于DataWorks ODPS Spark节点的调度及任务提交环境
其他开发与提交环境的搭建可以参考如下链接,本文不做实现展示:https://help.aliyun.com/zh/maxcompute/user-guide/set-up-a-spark-on-maxcompute-development-environment/?spm=a2c4g.11186623.0.0.6a7344daA81lvw
2. LightGBM-2.x-mmlspark实现
2.1. 依赖项准备
这里提供两种实现方式,可以根据自身场景按需选择,如果有spark或者pyspark的客户端,可以选择1,如果有java和idea的环境可以使用2。
这里建议打包的时候按需打包,不要一次性打包所有的jar包,比如hadoop的相关运行时依赖,有的jar包平台侧已经提供,可能会有jar包冲突导致作业提交不了的问题,具体问题排查和确认详情见4.3
- 官方实现
pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:<version TAG> --repositories https://mmlspark.azureedge.net/maven # 项目所需jar包下载到目录: $HOME/.ivy2/jars # 打包 cd $HOME/.ivy2/jars zip -r <package_name>.zip .
- maven实现
<dependencies> <dependency> <groupId>com.microsoft.ml.spark</groupId> <artifactId>mmlspark_2.11</artifactId> <version>0.18.0</version> </dependency> </dependencies>
# 项目所需jar包下载到目录: cd <project_dir> mvn dependency:copy-dependencies # 打包 zip -r <package_name>.zip .
2.2. 测试代码准备
代码是比较常规的训练和推理的pipline,这里要注意的是第四行需要将work dir下的mmlspark文件路径添加到python path中用于引用mmlspark库,否则会有python的ModuleNotFoundError,具体问题根因排查和确认详情见4.1
import sys import os print(os.getcwd()) sys.path.append(os.getcwd() + '/mmlspark') from pyspark.sql import SparkSession from pyspark.ml.linalg import Vectors from pyspark.sql.functions import lit from mmlspark.lightgbm import LightGBMClassifier print('import success sunyf') # SparkSession spark = SparkSession.builder \ .appName("LightGBM Example") \ .getOrCreate() print('spark version is:' + spark.version) # # 创建样本数据 data = [ (0.0, Vectors.dense([0.1, 0.2, 0.3])), (1.0, Vectors.dense([1.1, 1.2, 1.3])), (0.0, Vectors.dense([2.1, 2.2, 2.3])), (1.0, Vectors.dense([3.1, 3.2, 3.3])), (0.0, Vectors.dense([4.1, 4.2, 4.3])) ] df = spark.createDataFrame(data, ["label", "features"]) # # 输出样本数据 df.show() # 创建 LightGBMClassifier classifier = LightGBMClassifier(labelCol="label", featuresCol="features") # 训练模型 model = classifier.fit(df) # 预测 predictions = model.transform(df) predictions.show()
2.3. 作业配置
配置项说明以及对应的spark-submit命令参考:https://help.aliyun.com/zh/dataworks/user-guide/create-an-odps-spark-node?spm=a2c4g.11186623.0.i3#5f2d4630408bz
spark版本:Spark2.x
语言:Python
选择主python资源:main.py(见2.2)
配置项:
# 将项目额外需要的jar包通过archive上传,并将解压后的路径添加到extraclasspath中 spark.executor.extraClassPath=./archive2/* spark.driver.extraClassPath=./archive2/* # 指定odps spark环境中默认的python3.7.9的环境 spark.hadoop.odps.cupid.resources=public.python-3.7.9-ucs4.tar.gz spark.pyspark.python=./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3 spark.hadoop.odps.spark.version=spark-2.4.5-odps0.33.0
选择archives资源:
- mmlspark.zip
- 这个包是mmlspark_2.11-0.18.0.jar解压后并zip了python的部分,如下图1所示mmlspark文件夹部分
因为DataWorks上传Python资源仅支持.py结尾,且odps spark节点--py-files对应的Python资源模块,仅支持添加.py结尾的python资源,所以将python库对应的部分打zip包,通过archive上传解压后,在main.py的代码中sys.path.extend概路径用于将mmlspark库对应的文件路径添加到python path中
- archive2.zip
- mmlspark_2.11-0.18.0.jar
- lightgbmlib-2.2.350.jar
2.4. 运行截图
3. LightGBM-3.x-synapseml实现
3.1. 依赖项准备
- jar包准备,整体可以参考2.1的部分,需要按需更换:groupId:artifactId:version。这里选择了与客户要求版本一致的
<dependency> <groupId>com.microsoft.azure</groupId> <artifactId>synapseml-lightgbm_2.12</artifactId> <version>0.10.1</version> </dependency>
- 需要注意的是,python库mmlspark在版本升级以及更名后,并不在synapseml-*.jar中,但可以直接通过pypi以及国内的镜像源使用如下命令下载
python -m pip install synapseml==${synapseml.version}
使用pip show命令可以看到该package安装的位置,需要到该Location下进行打包并上传(odps spark内置的python 3.7.9中默认带的包没有他,参考:https://help.aliyun.com/zh/maxcompute/user-guide/develop-a-spark-on-maxcompute-application-by-using-pyspark?spm=a2c4g.11186623.0.i2#section-0q7-5da-5vt),采用4.1中的方案将该依赖包通过archive上传,并添加到python执行环境的sys.path中。
3.2. 测试代码准备
测试代码与2.2基本一致,涵盖了LightGBMClassifier的训练与推理的测试。这里要注意的是升级版本后需要从synapse库import出相关的类做后续的算法开发。
import sys import os print(os.getcwd()) sys.path.append(os.getcwd() + '/python_synapse_low') import findspark findspark.init() from pyspark.sql import SparkSession from pyspark.ml.linalg import Vectors from pyspark.sql.functions import lit from synapse.ml.lightgbm import LightGBMClassifier print('import success sunyf') # 创建 SparkSession spark = SparkSession.builder \ .appName("LightGBM Example") \ .getOrCreate() print('spark version is:' + spark.version) # 创建样本数据 data = [ (0.0, Vectors.dense([0.1, 0.2, 0.3])), (1.0, Vectors.dense([1.1, 1.2, 1.3])), (0.0, Vectors.dense([2.1, 2.2, 2.3])), (1.0, Vectors.dense([3.1, 3.2, 3.3])), (0.0, Vectors.dense([4.1, 4.2, 4.3])) ] # 创建 DataFrame df = spark.createDataFrame(data, ["label", "features"]) # 输出样本数据 df.show() # 创建 LightGBMClassifier classifier = LightGBMClassifier(labelCol="label", featuresCol="features") # 训练模型 model = classifier.fit(df) # 预测 predictions = model.transform(df) predictions.show()
3.3. 作业配置
spark版本:Spark3.x
语言:Python
选择主python资源:synapseml_low_main.py(见3.2)
配置项:
# 将项目额外需要的jar包通过archive上传,并将解压后的路径添加到extraclasspath中 spark.executor.extraClassPath=./archive4/* spark.driver.extraClassPath=./archive4/* # 指定odps spark环境中默认的python3.7.9的环境 spark.hadoop.odps.cupid.resources=public.python-3.7.9-ucs4.tar.gz spark.pyspark.python=./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3
选择archives资源:
- python_synapse_low.zip
- 这个是synapseml==${synapseml.version}打包后的zip
- archive4.zip
- lightgbmlib-3.2.110.jar
- spray-json_2.12-1.3.5.jar
- synapseml-core_2.12-0.10.1.jar
- synapseml-lightgbm_2.12-0.10.1.jar
3.4. 运行截图
4. 踩坑记录
4.1. ModuleNotFoundError: No module named 'mmlspark'
mmlspark python类库加载失败
- 如果是客户端提交则可以直接在--py-files后添加jar包的路径
- odps spark节点--py-files对应的「Python资源」模块,仅支持添加(手动选择).py结尾的python资源,该python库的资源被封装在mmlspark_2.11-0.18.0.jar中,所以需将python库对应的部分打zip包,通过archive上传解压后,在main.py的代码中sys.path.extend概路径用于将mmlspark库对应的文件路径添加到python path中
4.2. ModuleNotFoundError: No module named 'synapseml'
synapseml python类库加载失败
- 需要根据pip show synapseml查看安装路径打包通过archive上传并添加到python path
- 或者自定义python环境,通过odpscmd上传,spark.hadoop.odps.cupid.resources指定资源,并通过spark.pyspark.python指定具体的python可执行文件。(需要补充的是,样例中用了python 3.7.9,但是官网建议3.8+,实测部分功能在3.7.9的环境下会报错,但是打3.8+的虚拟环境太大,仅一个pyspark的包就小几百兆,这里不做演示。)
4.3. Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
堆栈如下所示,jar包冲突导致,mmlspark的项目依赖开源版本hadoop相关依赖,如hadoop-yarn-client-2.6.5.jar,spark-core_2.11-2.4.3.jar,建议按需添加jar包。
odps spark 2.x以及3.x参考的pom文件参考如下:
https://github.com/aliyun/MaxCompute-Spark/blob/master/spark-2.x/pom.xml
2024-06-15 15:40:02,460 WARN org.spark_project.jetty.servlet.ServletHandler - /jobs/ java.util.NoSuchElementException: null at java.util.Collections$EmptyIterator.next(Collections.java:4189) ~[?:1.8.0_111-AliJVM] at org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:281) ~[spark-kvstore_2.11-2.4.3.jar:2.4.3] at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:38) ~[spark-core_2.11-2.4.3.jar:2.4.3] at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:275) ~[spark-core_2.11-2.4.3.jar:2.4.3] at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) ~[spark-core_2.11-2.4.3.jar:2.4.3] at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) ~[spark-core_2.11-2.4.3.jar:2.4.3] at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) ~[spark-core_2.11-2.4.3.jar:2.4.3] at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) ~[javax.servlet-api-3.1.0.jar:3.1.0] at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) ~[javax.servlet-api-3.1.0.jar:3.1.0] at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) ~[spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) ~[spark-core_2.11-2.4.3.jar:2.4.3] at org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:20) ~[cupid-runtime-1.0.0-shaded.jar:?] at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.Server.handle(Server.java:539) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) [spark-core_2.11-2.4.3.jar:2.4.3] at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) [spark-core_2.11-2.4.3.jar:2.4.3] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111-AliJVM] 2024-06-15 15:40:02,601 INFO org.apache.spark.ui.JettyUtils - Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json. 2024-06-15 15:40:02,604 INFO org.spark_project.jetty.server.handler.ContextHandler - Started o.s.j.s.ServletContextHandler@2ed5f71e{/metrics/json,null,AVAILABLE,@Spark} 2024-06-15 15:40:02,756 INFO org.apache.hadoop.fs.aliyun.volume.InternalVolumeFileSystem - Initializing volume to pangu map, using file: volume_pangu_map.conf 2024-06-15 15:40:02,909 INFO org.apache.hadoop.fs.aliyun.pangu.HadoopPanguFileSystem - Begin initializing pangu fs. Input URI: pangu://ODPS-CN-SHANGHAI-L:10240/product/odps/admin_task_project/volumes/20180627-5f2253b35f4e42f5a9a4b15094d8485c/sparkhistory 2024-06-15 15:40:02,914 INFO org.apache.hadoop.conf.Configuration.deprecation - dfs.block.size is deprecated. Instead, use dfs.blocksize 2024-06-15 15:40:03,093 WARN org.apache.spark.scheduler.EventLoggingListener - Event log odps://admin_task_project/cupidhistory/sparkhistory/application_1718436478070_685575130_1.lz4.inprogress already exists. Overwriting... 2024-06-15 15:40:03,623 INFO org.apache.spark.scheduler.EventLoggingListener - Logging events to odps://admin_task_project/cupidhistory/sparkhistory/application_1718436478070_685575130_1.lz4 2024-06-15 15:40:03,814 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8030 2024-06-15 15:40:04,014 INFO org.apache.spark.deploy.yarn.YarnRMClient - Registering the ApplicationMaster 2024-06-15 15:40:05,256 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:06,261 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:07,266 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:08,270 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:09,275 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:10,283 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:11,289 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:12,294 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:13,347 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:14,352 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2024-06-15 15:40:45,376 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
4.4. ODPS-0421111: Resource not found - 'xxxx'
通过spark.hadoop.odps.cupid.resources引用的odps项目的该名称的resource不存在,可以使用odpscmd或者DataWorks到指定的odps项目下确认下资源是否存在。
4.5. TypeError: 'JavaPackage' object is not callable
堆栈如下图所示,self._new_java_obj("com.microsoft.ml.spark.lightgbm.LightGBMClassifier"),比较常见,对应的java类找不到,可以到添加的archive中使用如下命令进行查找并添加到archive和extralclasspath中。
grep -i "<报错类>" ./*.jar
Traceback (most recent call last): File "main.py", line 28, in <module> classifier = LightGBMClassifier(labelCol="label", featuresCol="features") File "/worker/pyspark.zip/pyspark/__init__.py", line 110, in wrapper File "/worker/mmlspark/mmlspark/lightgbm/_LightGBMClassifier.py", line 65, in __init__ self._java_obj = self._new_java_obj("com.microsoft.ml.spark.lightgbm.LightGBMClassifier") File "/worker/pyspark.zip/pyspark/ml/wrapper.py", line 67, in _new_java_obj TypeError: 'JavaPackage' object is not callable at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.applicationReportTransform(YarnClientImpl.java:414) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:609) at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:332) at org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:1056) at org.apache.spark.deploy.yarn.Client.run(Client.scala:1161) at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1543) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.aliyun.odps.SubmitJob.main(SubmitJob.java:78)
4.6. libgomp.so.1: version `GOMP_4.5' not found
部分动态链接库版本不符合要求,诸如LightGBM等算法库的实现逻辑均是由cpp实现的,最后会封装成.so的库来提高并行训练效率,避免python GIL锁影响训练效率(如XGBoost的核心库libxgboost.so)。但这个过程中会使用系统自带的一些动态链接库.so来做实现,部分系统版本较老,导致上述报错。
odps官网提供了基于centos:7.6.1810的DockerFile,可以以此为基础,构建适合当前算法库所需版本的动态链接库,通过strings命令来检查该库是否符合需求,开源库依赖较多,最好还是在开源测试环境测通后,打包lib或者lib64文件夹,通过spark.executorEnv.LD_LIBRARY_PATH来指定自定义的动态链接库路径。
Traceback (most recent call last): File "synapseml_main.py", line 34, in <module> model = classifier.fit(df) File "/worker/pyspark.zip/pyspark/ml/base.py", line 161, in fit File "/worker/python_synapse/synapse/ml/lightgbm/LightGBMClassifier.py", line 2148, in _fit java_model = self._fit_java(dataset) File "/worker/pyspark.zip/pyspark/ml/wrapper.py", line 332, in _fit_java File "/worker/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/worker/pyspark.zip/pyspark/sql/utils.py", line 111, in deco File "/worker/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o81.fit. : java.lang.UnsatisfiedLinkError: /worker/tmp/mml-natives549789301188954945/lib_lightgbm.so: /usr/ali/alicpp/built/gcc-4.9.2/gcc-4.9.2/lib64/libgomp.so.1: version `GOMP_4.5' not found (required by /worker/tmp/mml-natives549789301188954945/lib_lightgbm.so) at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1086) at com.microsoft.azure.synapse.ml.core.env.NativeLoader.loadLibraryByName(NativeLoader.java:66) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMUtils$.initializeNativeLibrary(LightGBMUtils.scala:33) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:37) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:36) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27) at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27) at org.apache.spark.ml.Predictor.fit(Predictor.scala:151) at org.apache.spark.ml.Predictor.fit(Predictor.scala:115) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:745) at com.aliyun.odps.cupid.CupidUtil.errMsg2SparkException(CupidUtil.java:50) at com.aliyun.odps.cupid.CupidUtil.getResult(CupidUtil.java:131) at com.aliyun.odps.cupid.requestcupid.YarnClientImplUtil.pollAMStatus(YarnClientImplUtil.java:108) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.applicationReportTransform(YarnClientImpl.java:379) ... 17 more 24/06/18 17:38:46 ERROR Client: Application diagnostics message: Failed to contact YARN for application application_1718703407900_591906351. java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.aliyun.odps.SubmitJob.main(SubmitJob.java:78) Caused by: org.apache.spark.SparkException: Application application_1718703407900_591906351 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1242) at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1634) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:949) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1028) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1037) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ... 5 more