开发者社区> 问答> 正文

在初始化spark上下文后,在运行时更改pyspark的hadoop配置中的aws凭据

我已经在Stack Overflow上查看了相关问题的解决方案,但似乎这个问题相当独特。对于上下文,由于公司程序的原因,我需要每小时刷新一次AWS安全凭证,而我正在努力将新刷新的安全凭证添加到spark中。在第一个小时内一切正常(我可以从s3访问和读取表等),但是在第一个小时结束后刷新凭据后,我无法成功更改我的凭据凭据。

一旦我刷新我的aws凭据,这里是我用来更新spark的代码,以使他们使用新的aws凭据:

sc = spark.sparkContext

def getAWSKeys(profile):
awsCreds = {}
Config = ConfigParser.ConfigParser()
Config.read(os.path.join(os.getenv("HOME"), '.aws', 'credentials'))
if profile in Config.sections():

   awsCreds["aws_access_key_id"] = Config.get(
       profile, "aws_access_key_id")
   awsCreds["aws_secret_access_key"] = Config.get(
       profile, "aws_secret_access_key")
   awsCreds["aws_session_token"] = Config.get(
       profile, "aws_session_token")

return awsCreds

awsKeys = getAWSKeys(profile)
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId",

                             awsKeys["aws_access_key_id"])

sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey",

                             awsKeys["aws_secret_access_key"])

sc._jsc.hadoopConfiguration().set("fs.s3.session.token",

                             awsKeys["aws_session_token"])

sc._jsc.hadoopConfiguration().set("fs.s3.enableServerSideEncryption", "true")
sc._jsc.hadoopConfiguration().set("fs.s3.access.key",

                             awsKeys["aws_access_key_id"])

sc._jsc.hadoopConfiguration().set("fs.s3.secret.key",

                             awsKeys["aws_secret_access_key"])

sc._jsc.hadoopConfiguration().set("fs.s3.endpoint",

                             "s3.us-east-1.amazonaws.com")

sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId",

                             awsKeys["aws_access_key_id"])

sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey",

                             awsKeys["aws_secret_access_key"])

sc._jsc.hadoopConfiguration().set("fs.s3a.session.token",

                             awsKeys["aws_session_token"])

sc._jsc.hadoopConfiguration().set("fs.s3a.enableServerSideEncryption", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key",

                             awsKeys["aws_access_key_id"])

sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key",

                             awsKeys["aws_secret_access_key"])

sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint",

                             "s3.us-east-1.amazonaws.com")

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId",

                             awsKeys["aws_access_key_id"])

sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",

                             awsKeys["aws_secret_access_key"])

sc._jsc.hadoopConfiguration().set("fs.s3n.session.token",

                             awsKeys["aws_session_token"])

sc._jsc.hadoopConfiguration().set("fs.s3n.enableServerSideEncryption", "true")
sc._jsc.hadoopConfiguration().set("fs.s3n.access.key",

                             awsKeys["aws_access_key_id"])

sc._jsc.hadoopConfiguration().set("fs.s3n.secret.key",

                             awsKeys["aws_secret_access_key"])

sc._jsc.hadoopConfiguration().set("fs.s3n.endpoint",

                             "s3.us-east-1.amazonaws.com")

sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
sc.setSystemProperty("com.amazonaws.services.s3n.enableV4", "true")
sc.setSystemProperty("com.amazonaws.services.s3a.enableV4", "true")

sc._jsc.hadoopConfiguration().set("fs.s3.aws.credentials.provider",

                              "org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider")

os.environ['AWS_ACCESS_KEY_ID'] = awsKeys["aws_access_key_id"]
os.environ['AWS_SECRET_ACCESS_KEY'] = awsKeys["aws_secret_access_key"]
os.environ['AWS_SESSION_TOKEN'] = awsKeys["aws_session_token"]
我试图在我的方法中详尽无遗,但遗憾的是没有任何效果。我得到的错误是:

Py4JJavaError Traceback (most recent call last)
in ()

  3 table = (
  4     spark.read.option("delimiter", "|")

----> 5 .csv(f"s3n://{s3_path}/{file1}", header = True, inferSchema=True)

  6     .select("col1", "col2", "col3", "col4")
  7 )

/usr/lib/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine)

408         if isinstance(path, basestring):
409             path = [path]

--> 410 return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))

411 
412     @since(1.5)

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(a, *kw)

 61     def deco(*a, **kw):
 62         try:

---> 63 return f(a, *kw)

 64         except py4j.protocol.Py4JJavaError as e:
 65             s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)

317                 raise Py4JJavaError(
318                     "An error occurred while calling {0}{1}{2}.\n".

--> 319 format(target_id, ".", name), value)

320             else:
321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o12923.csv.
: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 9A4F6DDEA3BD8AA6), S3 Extended Request ID: xg9ZiPjfV3h4rGgs5emsUiWl8xQdv0OMhK/91qdAs/iIvapWgIlWh9m1qLTGj3ODFM9MtEnuueg=

at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1588)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1258)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4169)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4116)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1237)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:24)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:10)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:82)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:94)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:39)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:211)
at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:768)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1430)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:311)
at org.apache.spark.sql.execution.datasources.DataSource

$$ anonfun$14.apply(DataSource.scala:359) at org.apache.spark.sql.execution.datasources.DataSource $$

anonfun$14.apply(DataSource.scala:348)

at scala.collection.TraversableLike

$$ anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike $$

anonfun$flatMap$1.apply(TraversableLike.scala:241)

at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
at sun.reflect.GeneratedMethodAccessor118.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

重申一下,在第一个小时内一切正常,但是当我刷新aws凭据时,我收到了400 Bad Request错误。我试图将这些新的aws凭证添加到spark中,但我尝试过的任何内容都没有用。

展开
收起
社区小助手 2019-01-02 15:05:55 5572 0
1 条回答
写回答
取消 提交回答
  • 社区小助手是spark中国社区的管理员,我会定期更新直播回顾等资料和文章干货,还整合了大家在钉群提出的有关spark的问题及回答。

    我无法看到这样做的简单方法,因为这些凭据被绑定到文件系统然后被冻结。

    如果我试图这样做,我会编写自己的执行,AWSCredentialsProvider为AWS调用提供凭据。默认链类似于:spark config,env vars,对EC2元数据服务的GET请求。您可以添加一个新的,以某种方式获取新值。您需要想出一种方法将新的会话凭据传播到群集中的每个主机,
    另一件事是知道AWS Assumed Roles的最大生命从1小时增加到12小时,所以如果你能让你的IT团队增加你被分配到12小时的角色,你可能只能得到一整天的。

    先尝试一下。

    ps:CSV“inferSchema = true”表示“只读取整个CSV文件以制定模式”。

    2019-07-17 23:24:24
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
《构建Hadoop生态批流一体的实时数仓》 立即下载
零基础实现hadoop 迁移 MaxCompute 之 数据 立即下载
CIO 指南:如何在SAP软件架构中使用Hadoop 立即下载