Spark属性配置

Spark on kubernetes 在原生 spark 的基础上又增加了一些对 kubernetes 支持的配置文件,在下面的配置中凡是属性中包含 kubernetes 的属性都是新增内容,只有在支持 kubernetes 的 spark 版本中才生效。

下面是单独针对 spark on kubernetes 的一些配置。其他配置项就跟使用 YARN 或 Mesos 运行一样。查看 配置页面 获取更多信息。

属性名称默认值含义
spark.kubernetes.namespace default 指定运行 driver 和 executor pod 的 namespace。当在 cluster mode 下使用spark-submit提交任务时,可以在命令行中增加 --kubernetes-namespace 参数。
spark.kubernetes.driver.docker.image spark-driver:2.2.0 Driver docker 镜像。 使用标注的Docker tag 格式。
spark.kubernetes.executor.docker.image spark-executor:2.2.0 Executor docker 镜像。 使用标注的Docker tag 格式。
spark.kubernetes.initcontainer.docker.image spark-init:2.2.0 Docker image to use for the init-container that is run before the driver and executor containers. Specify this using the standard Docker tag format. The init-container is responsible for fetching application dependencies from both remote locations like HDFS or S3, and from the resource staging server, if applicable.
spark.kubernetes.shuffle.namespace default Namespace in which the shuffle service pods are present. The shuffle service must be created in the cluster prior to attempts to use it.
spark.kubernetes.shuffle.labels (none) Labels that will be used to look up shuffle service pods. This should be a comma-separated list of label key-value pairs, where each label is in the format key=value. The labels chosen must be such that they match exactly one shuffle service pod on each node that executors are launched.
spark.kubernetes.allocation.batch.size 5 每一轮 executor pod 分配时启动的 pod 个数。
spark.kubernetes.allocation.batch.delay 1 每一轮 executor pod 分配时等待的秒数。
spark.kubernetes.authenticate.submission.caCertFile (none) Path to the CA cert file for connecting to the Kubernetes API server over TLS when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
spark.kubernetes.authenticate.submission.clientKeyFile (none) Path to the client key file for authenticating against the Kubernetes API server when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
spark.kubernetes.authenticate.submission.clientCertFile (none) Path to the client cert file for authenticating against the Kubernetes API server when starting the driver. This file must be located on the submitting machine's disk. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
spark.kubernetes.authenticate.submission.oauthToken (none) OAuth token to use when authenticating against the Kubernetes API server when starting the driver. Note that unlike the other authentication options, this is expected to be the exact string value of the token to use for the authentication.
spark.kubernetes.authenticate.driver.caCertFile (none) Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
spark.kubernetes.authenticate.driver.clientKeyFile (none) Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme). If this is specified, it is highly recommended to set up TLS for the driver submission server, as this value is sensitive information that would be passed to the driver pod in plaintext otherwise.
spark.kubernetes.authenticate.driver.clientCertFile (none) Path to the client cert file for authenticating against the Kubernetes API server from the driver pod when requesting executors. This file must be located on the submitting machine's disk, and will be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. do not provide a scheme).
spark.kubernetes.authenticate.driver.oauthToken (none) OAuth token to use when authenticating against the against the Kubernetes API server from the driver pod when requesting executors. Note that unlike the other authentication options, this must be the exact string value of the token to use for the authentication. This token value is uploaded to the driver pod. If this is specified, it is highly recommended to set up TLS for the driver submission server, as this value is sensitive information that would be passed to the driver pod in plaintext otherwise.
spark.kubernetes.authenticate.driver.serviceAccountName default Service account that is used when running the driver pod. The driver pod uses this service account when requesting executor pods from the API server. Note that this cannot be specified alongside a CA cert file, client key file, client cert file, and/or OAuth token.
spark.kubernetes.authenticate.resourceStagingServer.caCertFile (none) Path to the CA cert file for connecting to the Kubernetes API server over TLS from the resource staging server when it monitors objects in determining when to clean up resource bundles.
spark.kubernetes.authenticate.resourceStagingServer.clientKeyFile (none) Path to the client key file for authenticating against the Kubernetes API server from the resource staging server when it monitors objects in determining when to clean up resource bundles. The resource staging server must have credentials that allow it to view API objects in any namespace.
spark.kubernetes.authenticate.resourceStagingServer.clientCertFile (none) Path to the client cert file for authenticating against the Kubernetes API server from the resource staging server when it monitors objects in determining when to clean up resource bundles. The resource staging server must have credentials that allow it to view API objects in any namespace.
spark.kubernetes.authenticate.resourceStagingServer.oauthToken (none) OAuth token value for authenticating against the Kubernetes API server from the resource staging server when it monitors objects in determining when to clean up resource bundles. The resource staging server must have credentials that allow it to view API objects in any namespace. Note that this cannot be set at the same time as spark.kubernetes.authenticate.resourceStagingServer.oauthTokenFile.
spark.kubernetes.authenticate.resourceStagingServer.oauthTokenFile (none) File containing the OAuth token to use when authenticating against the against the Kubernetes API server from the resource staging server, when it monitors objects in determining when to clean up resource bundles. The resource staging server must have credentials that allow it to view API objects in any namespace. Note that this cannot be set at the same time as spark.kubernetes.authenticate.resourceStagingServer.oauthToken.
spark.kubernetes.authenticate.resourceStagingServer.useServiceAccountCredentials true Whether or not to use a service account token and a service account CA certificate when the resource staging server authenticates to Kubernetes. If this is set, interactions with Kubernetes will authenticate using a token located at /var/run/secrets/kubernetes.io/serviceaccount/token and the CA certificate located at /var/run/secrets/kubernetes.io/serviceaccount/ca.crt. Note that if spark.kubernetes.authenticate.resourceStagingServer.oauthTokenFile is set, it takes precedence over the usage of the service account token file. Also, if spark.kubernetes.authenticate.resourceStagingServer.caCertFile is set, it takes precedence over using the service account's CA certificate file. This generally should be set to true (the default value) when the resource staging server is deployed as a Kubernetes pod, but should be set to false if the resource staging server is deployed by other means (i.e. when running the staging server process outside of Kubernetes). The resource staging server must have credentials that allow it to view API objects in any namespace.
spark.kubernetes.executor.memoryOverhead 默认是 executor 内存 * 0.10,最小值是 384M 分配给每个 executor 的堆外内存的值,作为附加开销,单位可以为k、m、g等。该值用于虚拟机的开销、其他本地服务开销。根据 executor 的大小设置(通常是 6%到10%)。
spark.kubernetes.driver.label.[LabelName] (none) Custom labels that will be added to the driver pod. This should be a comma-separated list of label key-value pairs, where each label is in the format key=value. Note that Spark also adds its own labels to the driver pod for bookkeeping purposes.
spark.kubernetes.driver.annotation.[AnnotationName] (none) 使用 AnnotationName 为 driver pod 指定 annotation。例如 spark.kubernetes.driver.annotation.something=true
spark.kubernetes.executor.label.[LabelName] (none) Add the label specified by LabelName to the executor pods. For example, spark.kubernetes.executor.label.something=true. Note that Spark also adds its own labels to the driver pod for bookkeeping purposes.
spark.kubernetes.executor.annotation.[AnnotationName] (none) Add the annotation specified by AnnotationName to the executor pods. For example, spark.kubernetes.executor.annotation.something=true.
spark.kubernetes.driver.pod.name (none) Driver pod 的名字。如果未设置,driver pod 的名字将被设置为”spark.app.name“ 加上当前时间戳作为后缀,以避免冲突。
spark.kubernetes.submission.waitAppCompletion true In cluster mode, whether to wait for the application to finish before exiting the launcher process. When changed to false, the launcher has a "fire-and-forget" behavior when launching the Spark job.
spark.kubernetes.resourceStagingServer.port 10000 Resource staging server 部署后监听的端口。
spark.kubernetes.resourceStagingServer.uri (none) URI of the resource staging server that Spark should use to distribute the application's local dependencies. Note that by default, this URI must be reachable by both the submitting machine and the pods running in the cluster. If one URI is not simultaneously reachable both by the submitter and the driver/executor pods, configure the pods to access the staging server at a different URI by setting spark.kubernetes.resourceStagingServer.internal.uri as discussed below.
spark.kubernetes.resourceStagingServer.internal.uri Value of spark.kubernetes.resourceStagingServer.uri URI of the resource staging server to communicate with when init-containers bootstrap the driver and executor pods with submitted local dependencies. Note that this URI must by the pods running in the cluster. This is useful to set if the resource staging server has a separate "internal" URI that must be accessed by components running in the cluster.
spark.ssl.kubernetes.resourceStagingServer.internal.trustStore Value of spark.ssl.kubernetes.resourceStagingServer.trustStore Location of the trustStore file to use when communicating with the resource staging server over TLS, as init-containers bootstrap the driver and executor pods with submitted local dependencies. This can be a URI with a scheme of local://, which denotes that the file is pre-mounted on the pod's disk. A uri without a scheme or a scheme of file:// will result in this file being mounted from the submitting machine's disk as a secret into the init-containers.
spark.ssl.kubernetes.resourceStagingServer.internal.trustStorePassword Value of spark.ssl.kubernetes.resourceStagingServer.trustStorePassword Password of the trustStore file that is used when communicating with the resource staging server over TLS, as init-containers bootstrap the driver and executor pods with submitted local dependencies.
spark.ssl.kubernetes.resourceStagingServer.internal.trustStoreType Value of spark.ssl.kubernetes.resourceStagingServer.trustStoreType Type of the trustStore file that is used when communicating with the resource staging server over TLS, when init-containers bootstrap the driver and executor pods with submitted local dependencies.
spark.ssl.kubernetes.resourceStagingServer.internal.clientCertPem Value of spark.ssl.kubernetes.resourceStagingServer.clientCertPem Location of the certificate file to use when communicating with the resource staging server over TLS, as init-containers bootstrap the driver and executor pods with submitted local dependencies. This can be a URI with a scheme of local://, which denotes that the file is pre-mounted on the pod's disk. A uri without a scheme or a scheme of file:// will result in this file being mounted from the submitting machine's disk as a secret into the init-containers.
spark.kubernetes.mountdependencies.jarsDownloadDir /var/spark-data/spark-jars 下载 Jar 包到 driver 和 executor 中的路径。该路径将作为 empty dir volume 挂载到 driver 和 executor 容器中。
spark.kubernetes.mountdependencies.filesDownloadDir /var/spark-data/spark-files 下载文件到 driver 和 executor 中的路径。该路径将作为 empty dir volume 挂载到 driver 和 executor 容器中。
spark.kubernetes.report.interval 1s 在 cluster mode 下报告当前 spark job 状态的时间间隔。
spark.kubernetes.docker.image.pullPolicy IfNotPresent Kubernetes 中的 docker 镜像拉取策略。
spark.kubernetes.driver.limit.cores (none) 指定 driver pod 的 hard cpu limit。
spark.kubernetes.executor.limit.cores (none) 指定单个 executor pod 的 hard cpu limit。
spark.kubernetes.node.selector.[labelKey] (none) Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier will result in the driver pod and executors having a node selector with key identifier and value myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix.
spark.executorEnv.[EnvironmentVariableName] (none) 通过 EnvironmentVariableName 为 Executor 进程指定环境变量。用户可以指定多个环境变量。
spark.kubernetes.driverEnv.[EnvironmentVariableName] (none) 通过 EnvironmentVariableName 为 Driver 进程指定环境变量。用户可以指定多个环境变量。