Configuring your Microsoft Windows environment for Apache Spark in 2023

Bola Adesanya
3 min readJun 30, 2023

--

Python Edition

Photo by Windows on Unsplash

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

The supported languages are: Python, SQL, Java and R. This article will focus on Python.

PySpark is the Python API for Apache Spark. PySpark combines Python’s learnability and ease of use with the power of Apache Spark to enable processing and analysis of data at any size for everyone familiar with Python.

PySpark supports all of Spark’s features such as Spark SQL, DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.

There are two ways to facilitate setting up a MS Windows environment for PySpark.

  1. For a cloud managed instance involving Notebooks, the limited version of Databricks called the Community Edition is available here.

2. For a local install involving a Python Integrated Development Environment (IDE) the configuration steps as follows:

Step one. Check Prerequisites

Before installing PySpark, ensure that you have Python and Java Development Kit (JDK) installed on your machine. You can download the latest version of Python from the official Python website (https://www.python.org/downloads/), and JDK from the Oracle website(JDK Builds from Oracle (java.net)).

In June 2023, These are the versions installed in my local environment for Python and Java respectively.

Microsoft Windows [Version 10.0.19044.2965]
© Microsoft Corporation. All rights reserved.

C:\Users\badesanya>python — version
Python 3.11.4

C:\Users\badesanya>java — version
java 20 2023–03–21
Java(TM) SE Runtime Environment (build 20+36–2344)
Java HotSpot(TM) 64-Bit Server VM (build 20+36–2344, mixed mode, sharing)

Step two. Download Apache Spark

Visit the Apache Spark official website (https://spark.apache.org/downloads.html) and select the latest stable release of Spark. Choose the pre-built package for Apache Hadoop and download the ZIP file.

Step three. Extract the Spark Package

Extract the downloaded ZIP file to a directory of your choice. For example, you can extract it to C:\spark.

Step four. Set up Environment Variables

To use Apache Spark, you need to set up the environment variables:

  • Right-click on the “This PC” or “My Computer” icon and select “Properties.”
  • Click on “Advanced system settings” on the left-hand side.
  • In the System Properties window, click on the “Environment Variables” button.
  • In the “User variables” section, click on “New.”
  • Set the variable name as SPARK_HOME and the variable value as the path where you extracted Spark (e.g., C:\spark\spark-3.4.0-bin-hadoop3).
  • Locate the Path variable in the "System variables" section and click on "Edit."
  • Add a new entry with %SPARK_HOME%\bin to the list of existing paths.
  • Click “OK” to save the changes.

Also setup Java environment variables as follows if not done already:

  • JAVA_HOME environment variable set to location of jdk-11+.
  • JAVA_HOME\bin is included in PATH environment variable.
  • java — version command shows JAVA version.

Step five. Download Hadoop WinUtils

Download the file from GitHub (https://github.com/cdarlint/winutils)

Extract to a folder and set the environment variables as follows:

  • HADOOP_HOME environment variable set to the folder location.
  • HADOOP_HOME\bin is included in PATH environment variable.
  • Also copy the winutils.exe file to the bin folder of spark

If everything is set up correctly, you should see the Spark logo and a Python prompt (>>>).

Congratulations! You have successfully installed PySpark on your Windows computer. You can now start leveraging the power of PySpark to perform distributed data processing and analysis using Python.

Note: Apache Spark is typically used in a distributed environment, but you can run it in standalone mode on your local machine for development and testing purposes.

Explore PySpark’s documentation (https://spark.apache.org/docs/latest/api/python/index.html) to learn more about its features and how to use it effectively for data engineering tasks.

--

--

Bola Adesanya

I am a Data and Advanced Analytics practitioner developing and delivering solutions for organizations. I use open source technology and cloud-based innovation.