Configuring your Microsoft Windows environment for Apache Spark in 2023
Python Edition
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
The supported languages are: Python, SQL, Java and R. This article will focus on Python.
PySpark is the Python API for Apache Spark. PySpark combines Python’s learnability and ease of use with the power of Apache Spark to enable processing and analysis of data at any size for everyone familiar with Python.
PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.
There are two ways to facilitate setting up a MS Windows environment for PySpark.
- For a cloud managed instance involving Notebooks, the limited version of Databricks called the Community Edition is available here.
2. For a local install involving a Python Integrated Development Environment (IDE) the configuration steps as follows:
Step one. Check Prerequisites
Before installing PySpark, ensure that you have Python and Java Development Kit (JDK) installed on your machine. You can download the latest version of Python from the official Python website (https://www.python.org/downloads/), and JDK from the Oracle website(JDK Builds from Oracle (java.net)).
In June 2023, These are the versions installed in my local environment for Python and Java respectively.
Microsoft Windows [Version 10.0.19044.2965]
© Microsoft Corporation. All rights reserved.C:\Users\badesanya>python — version
Python 3.11.4C:\Users\badesanya>java — version
java 20 2023–03–21
Java(TM) SE Runtime Environment (build 20+36–2344)
Java HotSpot(TM) 64-Bit Server VM (build 20+36–2344, mixed mode, sharing)
Step two. Download Apache Spark
Visit the Apache Spark official website (https://spark.apache.org/downloads.html) and select the latest stable release of Spark. Choose the pre-built package for Apache Hadoop and download the ZIP file.
Step three. Extract the Spark Package
Extract the downloaded ZIP file to a directory of your choice. For example, you can extract it to C:\spark
.
Step four. Set up Environment Variables
To use Apache Spark, you need to set up the environment variables:
- Right-click on the “This PC” or “My Computer” icon and select “Properties.”
- Click on “Advanced system settings” on the left-hand side.
- In the System Properties window, click on the “Environment Variables” button.
- In the “User variables” section, click on “New.”
- Set the variable name as
SPARK_HOME
and the variable value as the path where you extracted Spark (e.g.,C:\spark\spark-3.4.0-bin-hadoop3
). - Locate the
Path
variable in the "System variables" section and click on "Edit." - Add a new entry with
%SPARK_HOME%\bin
to the list of existing paths. - Click “OK” to save the changes.
Also setup Java environment variables as follows if not done already:
- JAVA_HOME environment variable set to location of jdk-11+.
- JAVA_HOME\bin is included in PATH environment variable.
- java — version command shows JAVA version.
Step five. Download Hadoop WinUtils
Download the file from GitHub (https://github.com/cdarlint/winutils)
Extract to a folder and set the environment variables as follows:
- HADOOP_HOME environment variable set to the folder location.
- HADOOP_HOME\bin is included in PATH environment variable.
- Also copy the winutils.exe file to the bin folder of spark
If everything is set up correctly, you should see the Spark logo and a Python prompt (>>>)
.
Congratulations! You have successfully installed PySpark on your Windows computer. You can now start leveraging the power of PySpark to perform distributed data processing and analysis using Python.
Note: Apache Spark is typically used in a distributed environment, but you can run it in standalone mode on your local machine for development and testing purposes.
Explore PySpark’s documentation (https://spark.apache.org/docs/latest/api/python/index.html) to learn more about its features and how to use it effectively for data engineering tasks.