As a continuation of the last post, we now look at how to make it deployable in a proper Spark/Hadoop cluster. We will not go into the details of the setup of these clusters themselves but more into how do we make sure a program that we developed earlier could run as a job in a cluster.
We update the file with the following properties.
We update the file with the following properties.
We update the file with the following properties.
We update the file with the following properties.
Now we start the Hadoop.
The next post will go more into the details of how to process files in spark.
We will continue with the setup in our local machine. I am using a Mac so the instructions are with respect to that but most instructions would be common to any other platform.
If Spark is processing data from a database and writing into a hive, pretty much what we did in the last post would work. The problem arises if some of the data being processed exists as flat files. If we want to submit our jobs to a Spark cluster, we can not use local files because the jobs are not running in the local file system.
The best approach is to either use a hdfs cluster or deploy a single node hdfs on your machine. Here I am enumerating the steps to set up a single node hdfs cluster on a Mac OS X machine.
- Download the Hadoop distribution for your machine here.
- Hadoop distribution is available in the form of a .tar.gz file and you can expand it in some directory on your machine. The expansion will create a directory of the form hadoop-x.y.z assuming your Hadoop version is x.y.z. set the environment variable HADOOP_HOME to the full pathname of this Hadoop directory.
- Add $HADOOP_HOME/bin to the PATH variable.
- Now we need to update the configuration files for hadoop.
$ cd $HADOOP_HOME/etc/hadoop $ vi core-site.xml
We update the file with the following properties.
$ vi hdfs-site.xml
We update the file with the following properties.
$ vi mapred-site.xml
We update the file with the following properties.
$ vi yarn-site.xml
We update the file with the following properties.
Now we start the Hadoop.
$ cd $HADOOP_HOME $ sbin/start-all.shNow we can access files stored into the HDFS in our spark jobs.
The next post will go more into the details of how to process files in spark.