Apache Spark and Apache Hadoop are two most popular big data frameworks. Contrary to the common idea of these frameworks being competitors, they are best used together as a team. They do different things and have different responsibilities that complement each other.
As shown below:
(1) Hadoop represents a data infrastructure with clusters of commodity hardware. Massive data volumes can be distributed and processed on this cluster.
(2) This cluster of nodes uses Hadoop data storage component, called HDFS (Hadoop Distributed File System) which is optimized for storing such massive collections of data.
(3) A data processing tool is used on top of this infrastructure, which can be either Apache Spark or Hadoop MapReduce.
So, Spark only rivals with Hadoop MapReduce component and not with the whole Hadoop framework.
It must be mentioned that Spark and Hadoop are not bound to each other and can be used independently. Hadoop’s replacement of Spark is MapReduce and Spark can easily be integrated into another data processing infrastructure. However, using them together still remains the best practice.
Spark’s main component is its Core API, which is a general-purpose engine, responsible for basic I/O operations, task scheduling, memory management etc. On top of this core, are built libraries for specific types of data processing operations:
– SparkSQL – for processing structured data using SQL queries
– Streaming – for interactive, real-time data analytics
– MLlib – for machine learning algorithms,
– GraphX – for building, transforming and analysing graph-structured data
All these modules are rich in algorithms, are easy to use and integrate well with each other, so you can build complex data processing workflows combining functionalities from these Spark modules.
Advantages of Apache Spark
These are some of the advantages Spark has over MapReduce or other data processing frameworks:
1) Spark is a high-level framework with an easy API, which allows building complex algorithms comparatively easily, whereas MapReduce has a low-level API, requiring much more expertise in data analytics, thus having a long learning curve.
2) A significant advantage of Spark is that it’s a general purpose framework and, as mentioned above, combines different libraries to cover more or less all kinds of data processing scenarious, including batch processing, streaming, interactive SQL querying, Machine Learning and Graph data processing using the same data in the same infrastructure, instead of splitting tasks across different platforms, whereas MapReduce is great exclusively in batch processing.
3) Spark is 100x faster, because it has been optimized to take advantage of in-memory analysis, which makes accessing the same data repeatedly much faster, making it a requirement for memory to be at least as big as the data being processed, in order to fit in. Spark is also 10x faster for batch processing.
Following diagram shows the difference in processing data between MapReduce and Apache Spark:
MapReduce reads data, performs the first operation and writes it back. Then it reads that modified data again, performs the next operation and writes it back and the cycle goes on until the last operation is executed and final results are written back.
Spark reads data, performs all the operations in memory and writes the end results back.
Side Note: Apache Spark framework is compatible with Java, Scala, Python and R languages.
Hadoop Infrastructure will be explained in more detail in the next blog post.