Manipulating Big Data

Manipulating Big Data


In this video, we’ll show you some of the
ways that you can transform your data with the power of Hive and Spark, using freely
available nodes in KNIME Analytics Platform. We’ll demonstrate in-database ETL processing
procedures on a Hive backend, and then show an approach for tackling these tasks using
both Spark nodes and execution of Spark SQL queries. We’ll also discuss Spark partitioning, and
briefly present the Spark Web UI, useful for monitoring your Spark jobs. To follow along with the workflows we’ll
build in this video, it will be helpful for you to install both the KNIME Big Data Connectors
extension, as well as the KNIME Extension for Apache Spark. Ready? Let’s get started. Let’s consider an example where you need
to connect to a dataset that exists on a distributed Hive cluster, and extract a subset of transformed data. The dataset is much too large to handle on
your local machine, so it’s an absolute must that all computation be “pushed-down”
to the database for efficient operation. Is this something KNIME can handle? Absolutely! For remote connections to a Hive cluster,
we could start by using a Hive Connector node like the one shown here to make an initial
connection to the cluster. In the configuration of the node, we would
just need to provide the host name, database name, and appropriate credentials. I can set additional JDBC Parameters in the
JDBC Parameters tab. For this example, I’ll instead use a Local
Big Data Environment node to create a Hive instance on my laptop. Here we’ll read in publically available
data on European football matches that I’ve already loaded into Hive. To do this, I’ll connect the Local Big Data
Environment node to a couple of DB Table Selector nodes to choose two tables from my database
– one containing records on individual events taking place during these games – for example,
shots on goal and fouls – and the other listing times and descriptions of the events. I’d like to combine these results into a
single record set, so I’ll next use a DB Joiner node and configure it to perform an
inner join on the event_id. At this point, I have more columns in my data
extract than I would like, so I’ll use a DB Column Filter to remove some of them. Next, I’ll count the events that exist in
the dataset by their type. At this point, I can choose to write these
results into a separate Hive table by using the DB Connection Table Writer node. Finally, since the result of my aggregation
is relatively small, I can pull it directly into KNIME using a DB Reader node, and create
a bar chart to visualize my football events. It’s important to note that if I were applying
these database operations in a cloud environment, no data would be processed on my local machine
until I explicitly requested a summary extract via the DB Reader node! Prior to this, all computation would take
place in the Hive cluster itself – I’m just using KNIME to visually govern the results
of analysis in a seamless way. Now that we’ve done a bit of ETL with KNIME’s
DB nodes on a Hive cluster, let’s see what we can do with Spark instead. KNIME has a large number of dedicated Spark
nodes to help you manipulate and transform your data – everything from simple filtering,
grouping, and pivoting, to conversion between different distributed file formats like Parquet
and ORC, and even advanced use of PySpark. Apart from extensive Spark ETL capabilities,
KNIME also supports many of Spark’s machine learning algorithms. To explore Spark in KNIME, let’s continue
our previous example – on a different branch of our workflow – using Hive to Spark nodes
to begin data processing. Let’s connect two of these nodes to the
DB Table Selector nodes as before. You’ll notice that the square output ports
have changed from a magenta color to black, indicating that we can now connect to Spark
nodes downstream. Let’s join our football events data as we
did before, but this time using a Spark Joiner node. We can use a Spark Row Filter node to choose
just records associated with a particular team – in this case Real Madrid. Now, perhaps there is an operation that you
would like to perform on your data by writing a SQL query? That’s easy enough – just use the Spark
SQL Query node to create a new first half / second half feature in our dataset by applying
a simple mathematical function, like this. To summarize our data extract, let’s connect
a Spark Pivoting node to aggregate events by player for each half, and look at the results. Finally, we can write our results to a new
table on our cluster using a Spark to Parquet node. We could also write to ORC, or back to Hive,
if preferred. Note that throughout this process Spark is
performing lazy evaluation. That is, Spark does not apply the transformations
until the data is actually needed – that is, when it is written to the file system. This can sometimes cause the node that triggers
the evaluation to run longer than expected, because all the previous transformations are
done in this step as well. Also worth noting is that Spark handles data
in partitions. Usually a DataFrame or RDD is split into several
partitions and distributed on the worker nodes in the cluster. This is done to parallelize the execution
and avoid network traffic. Due to the distributed storage, data in a
Hadoop cluster is spread to several worker nodes. Spark tries to process the data on nodes that
are close to the data, and only transfers them to other nodes if needed. Note that if you would like to adjust the
Spark partitions yourself, you can do so using a “PARTITIONED BY” statement in a DB Table Creator
node. Now that we’ve loaded and processed our
data, we can take a look at the Spark Web UI within KNIME, to see what’s going on
behind the scenes and monitor the progress of our operations. By right-clicking on the Create Local Big
Data Environment node and selecting Spark context, like this, KNIME displays a window
showing the options we used when executing the node. It also contains a blue hyperlink to open
the Spark Web UI. Let’s do that now. When we do, we can see that we have access
to a number of different summaries of Spark operations that have been performed, including
a summary of complete jobs and associated timeline, along with stages for the jobs,
environment properties, and even specific SQL queries that have run, along with their
elapsed times of execution. In this video, we showed how you can use KNIME
to orchestrate many different types of ETL functions in both Hive and Spark, with a focus
on pushing resource-intensive processing to your cluster. KNIME provides you with a convenient visual
representation of your workflow, while still bringing to bear the full computational capabilities
of your cluster on your Big Data environment.