Monday, February 24, 2020

Hadoop and Spark Locally

As a continuation of the last post, we now look at how to make it deployable in a proper Spark/Hadoop cluster. We will not go into the details of the setup of these clusters themselves but more into how do we make sure a program that we developed earlier could run as a job in a cluster.
We will continue with the setup in our local machine. I am using a Mac so the instructions are with respect to that but most instructions would be common to any other platform.
If Spark is processing data from a database and writing into a hive, pretty much what we did in the last post would work. The problem arises if some of the data being processed exists as flat files. If we want to submit our jobs to a Spark cluster, we can not use local files because the jobs are not running in the local file system. 
The best approach is to either use a hdfs cluster or deploy a single node hdfs on your machine. Here I am enumerating the steps to set up a single node hdfs cluster on a Mac OS X machine.
  • Download the Hadoop distribution for your machine here.
  • Hadoop distribution is available in the form of a .tar.gz file and you can expand it in some directory on your machine. The expansion will create a directory of the form hadoop-x.y.z assuming your Hadoop version is x.y.z. set the environment variable HADOOP_HOME to the full pathname of this Hadoop directory.
  • Add $HADOOP_HOME/bin to the PATH variable.
  • Now we need to update the configuration files for hadoop.
$ cd $HADOOP_HOME/etc/hadoop
$ vi core-site.xml

We update the file with the following properties.

$ vi hdfs-site.xml

We update the file with the following properties.
$ vi mapred-site.xml

We update the file with the following properties.
$ vi yarn-site.xml

We update the file with the following properties.
Now we start the Hadoop.
$ cd $HADOOP_HOME
$ sbin/start-all.sh
Now we can access files stored into the HDFS in our spark jobs.
The next post will go more into the details of how to process files in spark.

Tuesday, February 18, 2020

Hive and Spark

In this blog post, we take a slight deviation from core issues related to the spring framework and look at an issue that spring programmers might face regularly. Recently I was looking with Spark and found a need to read the data from MySQL, do some processing and write it back to a hive instance. We will look at this issue in this post.
We start by making sure our hive instance is backed by a database. To do this, we do the following.
$ cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml
$ vi gnu/apache-hive-3.1.2-bin/conf/hive-site.xml

We edit the hive-site.xml file and make sure it is configured as below.
We can configure the values to suit our needs. But make sure the MySQL username, password, database URL exists and has relevant permissions.
Now we create another database in MySQL which will contain the data that we need to process. I am calling this database mystuff, with username mystuff, password mystuff123. We need to run following commands in MySQL to make sure everything exists and permissions are appropriate.
create database mystuff;
Query OK, 1 row affected (0.00 sec)

mysql> create user mystuff@localhost identified by 'mystuff123';
Query OK, 0 rows affected (0.00 sec)

mysql> create user mystuff@'%' identified by 'mystuff123';
Query OK, 0 rows affected (0.01 sec)

mysql> grant all on mystuff.* to mystuff@localhost;
Query OK, 0 rows affected (0.00 sec)

mysql> grant all on mystuff.* to mystuff@'%';
Query OK, 0 rows affected (0.00 sec)

Now we create a plain java project in IntelliJ with the following pom.xml file.
Now to look at the problem at hand. We have a table in MySQL with the following structure.
mysql> desc mydata;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int          | YES  |     | NULL    |       |
| k     | varchar(10)  | YES  |     | NULL    |       |
| v     | varchar(255) | YES  |     | NULL    |       |
+-------+--------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

We want to flatten this table such that each key gets converted to a column for each id. So assume our current data is as below.
mysql> select * from mydata;
+------+-------+-------------------+
| id   | k     | v                 |
+------+-------+-------------------+
|    1 | NAME  | John Doe.         |
|    1 | EMAIL | jd@example.com    |
|    2 | NAME  | Jane Doe          |
|    2 | EMAIL | janed@example.com |
+------+-------+-------------------+
2 rows in set (0.00 sec)
We want to load this data and convert it into a flattened table that has three columns i.e. id, name. email. Then we want to populate this into table person into hive with flattened data.
hive (default)> select * from person;
OK
person.id person.email person.name
1 jd@example.com.  John Doe
2 janed@example.com Jane Doe
Time taken: 0.094 seconds, Fetched: 2 row(s)

The following code will perform the above conversion.
Line numbers 11 through 17 create a SparkSession for hive operations. The key instruction here is enableHiveSupport. Lines 20 through 26 create a SparkSession that will be used for MySQL operations. Lines 29 through 37 load the complete contents of the table. Lines 39 through 42 will group the results by the id and pivot the table on the field K. Line 46 through 49 creates a data frame for hiveSession and writes the contents in a table with named person.