Apache Spark is an open-source distributed computational framework that is created to provide faster computational results. It is an in-memory computational engine, meaning the data will be processed in memory.
Spark supports various APIs for streaming, graph processing, SQL, MLLib. It also supports Java, Python, Scala, and R as the preferred languages. Spark is mostly installed in Hadoop clusters but you can also install and configure spark in standalone mode.
1) Install Java
java -version
# If java is not installed, install it:
sudo apt update
sudo apt install default-jre
java -version
2) Install Scala
sudo apt install scala
scala -version
3) Create “spark” user
sudo addgroup spark
sudo adduser --ingroup spark spark
sudo usermod -a -G hadoop spark
4) Install Spark
wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvzf spark-3.1.2-bin-hadoop3.2.tgz
sudo mv spark-3.1.2-bin-hadoop3.2.tgz /opt/spark
sudo chmod -R 755 /opt/spark/
sudo chown -R spark:spark /opt/spark
visudo
##------------------------------
# User privilege specification
root ALL=(ALL) ALL
spark ALL=(ALL) ALL
##------------------------------
sudo su - spark
5) Configure Environment Variables for Spark
sudo su - spark
echo "export SPARK_HOME=/opt/spark" >> ~/.profile
echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile
source ~/.profile
OR
sudo su - spark
sudo vim ~/.profile
# Add at the end of the file:
##------------------------------
export SPARK_HOME=/opt/spark
export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin
export PYSPARK_PYTHON=/usr/bin/python3
##------------------------------
6) SSH Config – This is needed for Spark Slave
mkdir ~/.ssh
cd ~/.ssh/
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
sudo vi /etc/ssh/sshd_config
sudo /etc/init.d/ssh reload
7) Start Apache Spark
sudo su - spark
start-master.sh
start-workers.sh spark://localhost:7077
Spark Master should be available at http://<your-ip-address>:8080
Also check if spark-shell works fine:
spark-shell