Alibaba Cloud Data Lake Analytics (DLA) introduces solutions to support the Spark read-eval-print loop (REPL) feature. These solutions allow you to install JupyterLab and the Livy proxy of DLA on your on-premises machine, and to use the Docker image provided by DLA to quickly start JupyterLab. This helps you connect JupyterLab to the serverless Spark engine of DLA. After you establish the connection between JupyterLab and the serverless Spark engine of DLA, you can perform interactive testing and compute data by using the elastic resources of DLA.
Usage notes
- The serverless Spark engine of DLA supports JupyterLab interactive jobs that are programmed in Python 3.0 or Scala 2.11.
- JupyterLab of the latest version supports Python 3.6 and later.
- To develop a JupyterLab interactive job, we recommend that you use the Docker image to quickly start JupyterLab. For more information, see Use the Docker image to quickly start JupyterLab.
- JupyterLab interactive jobs are automatically released after they are idle for a specified period of time. By default, a JupyterLab interactive job is released 1200 seconds after the last code block of the job is executed. You can use the
spark.dla.session.ttl
parameter to configure the idle time of a JupyterLab interactive job before the job is automatically released.
Install the Livy proxy of DLA and JupyterLab on your on-premises machine
- Install the Livy proxy of DLA.
- Install SDK for Python aliyun-python-sdk-openanalytics-open 2.0.5.
Note The version of SDK for Python must be
2.0.4
or later. - Run the following command to install the Livy proxy of DLA:
pip install aliyun-dla-livy-proxy-0.0.4.zip
Note You must install the Livy proxy of DLA as a root user. If you do not install the Livy proxy of DLA as a root user, the command may fail to register to the path in which the command is run. After the Livy proxy of DLA is installed, you can find the
dlaproxy
command in the command line. - Start the Livy proxy of DLA.
The Livy proxy of DLA is used to interpret an interface of DLA as an
Apache Livy
interface required bySparkmagic
. After you start the Livy proxy of DLA, you can deploy a local HTTP proxy to listen to a port and forward requests. By default, port5000
is used for listening.# View the usage of the dlaproxy command. $dlaproxy -husage: dlaproxy [-h] --vcname VCNAME -i AK -k SECRET --region REGION [--host HOST] [--port PORT] [--loglevel LOGLEVEL]Proxy AliYun DLA as Livyoptional arguments: -h, --help show this help message and exit --vcname VCNAME Virtual Cluster Name -i AK, --access-key-id AK Aliyun Access Key Id -k SECRET, --access-key-secret SECRET Aliyun Access Key Secret --region REGION Aliyun Region Id --host HOST Proxy Host Ip --port PORT Proxy Host Port --loglevel LOGLEVEL python standard log level # Start the Livy proxy of DLA. dlaproxy --vcname <vcname> -i akid -k aksec --region <regionid>
The following table describes the parameters in the preceding code.
Parameter Description --vcname The name of the Spark virtual cluster in DLA. Note To query the cluster name, log on to the DLA console and click Virtual Cluster management in the left-side navigation pane. On the Virtual Cluster management page, find your cluster and click Details in the Actions column to query the cluster name.
-i The AccessKey ID of the Resource Access Management (RAM) user. Note If you have created an AccessKey pair for the RAM user, you can view the value of this parameter in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair for a RAM user.
-k The AccessKey secret of the RAM user. Note If you have created an AccessKey pair for the RAM user, you can view the value of this parameter in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair for a RAM user.
--region The ID of the region. For more information, see Regions and zones. --host The proxy host IP address of DLA. Default value: 127.0.0.1. This IP address is used to forward only local requests. You can change the value to 0.0.0.0 or a different address to listen to requests from the Internet or an internal network. We recommend that you use the default value. --port The listening port. Default value: 5000. We recommend that you use the default value. You can also change the value of this parameter. --loglevel The log level. Valid values: ERROR, WARNING, INFO, and DEBUG. Default value: INFO. We recommend that you use the default value.
- Install SDK for Python aliyun-python-sdk-openanalytics-open 2.0.5.
- Install JupyterLab.
- Optional:Install a virtual environment.
Note We recommend that you install JupyterLab in a virtual environment. This prevents subsequent installations from damaging the public Python environment within your Alibaba Cloud account.
- Run the following commands to install JupyterLab:
pip install jupyterlab # Install JupyterLab. jupyter lab # Check whether JupyterLab is installed. If the start log of JupyterLab is displayed, JupyterLab is installed.
- Perform the following steps to install Sparkmagic:
- Install the Sparkmagic library.
pip install sparkmagic
- Enable nbextension.
jupyter nbextension enable --py --sys-prefix widgetsnbextension
- If you use JupyterLab, run the following command to install JupyterLab labextension:
jupyter labextension install "@jupyter-widgets/jupyterlab-manager"
- Run the
pip show sparkmagic
command to query the path in which Sparkmagic is installed. Then, run the following commands to install kernels in the path:jupyter-kernelspec install sparkmagic/kernels/sparkkernel jupyter-kernelspec install sparkmagic/kernels/pysparkkernel jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
- Modify the configuration file config.json in the
~/.sparkmagic/
path. For more information about sample configurations, see example_config.json. - Enable Sparkmagic.
jupyter serverextension enable --py sparkmagic
- Install the Sparkmagic library.
After you start the Livy proxy of DLA, the default URL for listening isAfter you install Sparkmagic, you must manually create the configuration file config.json in the
~/.sparkmagic
path and direct theURL
to the local proxy server. Sample code in the config.json file:{ "kernel_python_credentials" : { "username": "", "password": "", "url": "http://127.0.0.1:5000", "auth": "None" }, "kernel_scala_credentials" : { "username": "", "password": "", "url": " http://127.0.0.1:5000", "auth": "None" }, "kernel_r_credentials": { "username": "", "password": "", "url": "http://localhost:5000" }, "logging_config": { "version": 1, "formatters": { "magicsFormatter": { "format": "%(asctime)s\t%(levelname)s\t%(message)s", "datefmt": "" } }, "handlers": { "magicsHandler": { "class": "hdijupyterutils.filehandler.MagicsFileHandler", "formatter": "magicsFormatter", "home_path": "~/.sparkmagic" } }, "loggers": { "magicsLogger": { "handlers": ["magicsHandler"], "level": "DEBUG", "propagate": 0 } } }, "wait_for_idle_timeout_seconds": 15, "livy_session_startup_timeout_seconds": 600, "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.", "ignore_ssl_errors": false, "session_configs": { "conf": { "spark.dla.connectors": "oss" } }, "use_auto_viz": true, "coerce_dataframe": true, "max_results_sql": 2500, "pyspark_dataframe_encoding": "utf-8", "heartbeat_refresh_seconds": 30, "livy_server_heartbeat_timeout_seconds": 0, "heartbeat_retry_seconds": 10, "server_extension_default_kernel_name": "pysparkkernel", "custom_headers": {}, "retry_policy": "configurable", "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5], "configurable_retry_policy_max_retries": 8}
Note The settings of
session_configs
in the sample code are the same as the settings ofconf
in the configurations of the job that you submit to the serverless Spark engine of DLA. If you want to load JAR packages of the job, you must use the serverless Spark engine to access the metadata service of DLA. For more information, see Configure a Spark job.http://127.0.0.1:5000
. If you change the host IP address or port number in the default URL, you must change the value of theurl
parameter in the sample code. For example, if you set the--host
parameter to 192.168.1.3 and the --port parameter to 8080 when you start the Livy proxy of DLA, you must change the value of theurl
parameter tohttp://192.168.1.3:8080
. The following code shows the new configurations in the config.json file.{ "kernel_python_credentials" : { "username": "", "password": "", "url": "http://192.168.1.3:8080", "auth": "None" }, "kernel_scala_credentials" : { "username": "", "password": "", "url": "http://192.168.1.3:8080", "auth": "None" }, "kernel_r_credentials": { "username": "", "password": "", "url": "http://192.168.1.3:8080" }, "logging_config": { "version": 1, "formatters": { "magicsFormatter": { "format": "%(asctime)s\t%(levelname)s\t%(message)s", "datefmt": "" } }, "handlers": { "magicsHandler": { "class": "hdijupyterutils.filehandler.MagicsFileHandler", "formatter": "magicsFormatter", "home_path": "~/.sparkmagic" } }, "loggers": { "magicsLogger": { "handlers": ["magicsHandler"], "level": "DEBUG", "propagate": 0 } } }, "wait_for_idle_timeout_seconds": 15, "livy_session_startup_timeout_seconds": 600, "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.", "ignore_ssl_errors": false, "session_configs": { "conf": { "spark.dla.connectors": "oss" } }, "use_auto_viz": true, "coerce_dataframe": true, "max_results_sql": 2500, "pyspark_dataframe_encoding": "utf-8", "heartbeat_refresh_seconds": 30, "livy_server_heartbeat_timeout_seconds": 0, "heartbeat_retry_seconds": 10, "server_extension_default_kernel_name": "pysparkkernel", "custom_headers": {}, "retry_policy": "configurable", "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5], "configurable_retry_policy_max_retries": 8}
- Optional:Install a virtual environment.
- Start JupyterLab.
# Restart JupyterLab. jupyter lab# Start the Livy proxy of DLA. dlaproxy --vcname vcname -i akid -k aksec --region <regionid>
After you start JupyterLab, the URL that is used to access JupyterLab is displayed in the start log of JupyterLab, as shown in the following figure.
If the message
Aliyun DLA Proxy is ready
appears, the Livy proxy of DLA is started. After the Livy proxy of DLA is started, you can use JupyterLab. For more information about how to use JupyterLab, see the JupyterLab official documentation.When you run a JupyterLab task, DLA automatically creates a Spark job. To view and manage the Spark job, log on to the DLA console. In the left-side navigation pane, choose Serverless Spark > Submit job to view and manage the Spark job. In the following figure, the Spark jobs whose names start with
notebook_
are JupyterLab interactive jobs.After you start JupyterLab, you can still modify the configurations of the Spark job by using the
magic
command. If you run the magic command, the new configurations overwrite the original configurations. Then, JupyterLab restarts the Spark job based on the new configurations.# Restart JupyterLab. jupyter lab# Start the Livy proxy of DLA. dlaproxy --vcname vcname -i akid -k aksec --region <regionid>
%%configure -f{ "conf": { "spark.sql.hive.metastore.version": "dla", "spark.dla.connectors": "oss" }}
To use custom dependencies, you can use the following format:
%%configure -f{ "conf": { ... }, "pyFiles": "oss://{your bucket name}/{path}/*.zip" # module}
- Terminate a JupyterLab job.
Click Restart Kernel in the main menu bar of JupyterLab Kernel.
Use the Docker image to quickly start JupyterLab
You can use the Docker image provided by DLA to quickly start JupyterLab. For more information about how to install and use a Docker image, see Docker Documentation.
- After you install and start Docker, run the following command to pull the JupyterLab image of DLA:
docker pull registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.4
- After you pull the image, run the following command to view the help information of the image:
docker run -ti registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.2Used to run jupyter lab for Aliyun DLA Usage example: docker run -it -p 8888:8888 dla-jupyter:0.1 -i akid -k aksec -r cn-hanghzou -c spark-vc -l INFO -i Aliyun AkId -k Aliyun AkSec -r Aliyun Region Id -c Aliyun DLA Virtual cluster name -l LogLevel
The parameters in the preceding code are similar to the parameters of the DLA Livy proxy. The following table describes the parameters in the preceding code.
Parameter Description -c The name of the Spark virtual cluster in DLA. Note To query the cluster name, log on to the DLA console and click Virtual Cluster management in the left-side navigation pane. On the Virtual Cluster management page, find your cluster and click Details in the Actions column to query the cluster name.
-i The AccessKey ID of the RAM user. Note If you create an AccessKey pair for the RAM user, you can view the value of this parameter in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair for a RAM user.
-k The AccessKey secret of the RAM user. Note If you have created an AccessKey pair for the RAM user, you can view the value of this parameter in the RAM console. For more information about how to create and view an AccessKey pair, see Create an AccessKey pair for a RAM user.
-r The ID of the region. For more information, see Regions and zones. -l The log level. Valid values: ERROR, WARNING, INFO, and DEBUG. Default value: INFO. We recommend that you use the default value. - After you set the parameters to appropriate values, run the following command to start JupyterLab:
docker run -it -p 8888:8888 registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.2 -i {AkId} -k {AkSec} -r {RegionId} -c {VcName}
If the information in the following figure is displayed, JupyterLab is started. In the following figure, you can copy the URL in the red box and paste it to the address bar of a browser to connect to DLA by using JupyterLab.
- When you perform troubleshooting, you must query the related information from the
dlaproxy.log
file. If the information in the following figure appears in the log file, JupyterLab is started. - You must mount the host path to the path of the Docker image. Otherwise, the system automatically deletes the notebooks that are in Edit mode when you terminate the Docker image. When you terminate the Docker image, the system also automatically attempts to terminate all JupyterLab interactive jobs that are running. To address this issue, you can use one of the following solutions:
- Before you terminate the Docker image, make sure that you keep all files secure.
- Mount the host path to the path of the Docker image and save job files to the path of the Docker image.
For example, in Linux, if you want to mount the host path
/home/admin/notebook
to the path of the Docker image/root/notebook
, run the following command:docker run -it --privileged=true -p 8888:8888 -v /home/admin/notebook:/root/notebook registry.cn-hangzhou.aliyuncs.com/dla_spark/dla-jupyter:0.2 -i {AkId} -k {AkSec} -r {RegionId} -c {VcName}
You must save the notebooks that are in Edit mode to the
/tmp
path. This ensures that you can view the related files in the/home/admin/notebook
path on the host and continue to use the notebooks the next time the Docker image starts.Note For more information, see Use volumes.
FAQ
-
Problem description: JupyterLab fails to start and the following error messages appear:
[C 09:53:15.840 LabApp] Bad config encountered during initialization:
[C 09:53:15.840 LabApp] Could not decode '\xe6\x9c\xaa\xe5\x91\xbd\xe5\x90\x8d' for unicode trait 'untitled_notebook' of a LargeFileManager instance.
Solution: Run LANG=zn jupyter lab.
-
Problem description: The error message "
$ jupyter nbextension enable --py --sys-prefix widgetsnbextension Enabling notebook extension jupyter-js-widgets/extension... - Validating: problems found: - require? X jupyter-js-widgets/extension
" appears.Solution: Run the
jupyter nbextension install --py widgetsnbextension --user
andjupyter nbextension enable widgetsnbextension --user --py
commands. -
Problem description: The error message "
ValueError: Please install nodejs >=12.0.0 before continuing. nodejs may be installed using conda or directly from the nodejs website.
" appears.Solution: Run the conda install nodejs command. For more information about how to install Conda, see the Conda official documentation.
-
Problem description: Sparkmagic fails to be installed and the error message in the following figure appears.
Solution: Install Rust.
-
Problem description: I fail to create charts by using Matplotlib. The error information in the following figure is displayed after I run the
%matplotlib inline
command.Solution: If you use PySpark in the cloud, run the
%matplot plt
command with theplt.show()
function to create a chart, as shown in the following figure.