Tech Kaizen: Apache Hadoop vs Apache Spark

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.

Hadoop vs Spark

Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional map/reduce and Spark. We should look at Hadoop as a general purpose Framework that supports multiple models and We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.

Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop. But as it uses large RAM it needs a dedicated high end physical machine for producing effective results. It all depends and the variables on which this decision depends keep on changing dynamically with time.

MapReduce Hadoop is designed to run batch jobs that address every file in the system. Since that process takes time, MapReduce is well suited for large distributed data processing where fast performance is not an issue, such as running end-of day transactional reports. MapReduce is also ideal for scanning historical data and performing analytics where a short time-to-insight isn’t vital. Spark was purposely designed to support in-memory processing. The net benefit of keeping everything in memory is the ability to perform iterative computations at blazing fast speeds—something MapReduce is not designed to do.

Unlike MapReduce, Spark is designed for advanced, real-time analytics and has the framework and tools to deliver when shorter time-to-insight is critical. Included in Spark’s integrated framework are the Machine Learning Library (MLlib), the graph engine GraphX, the Spark Streaming analytics engine, and the real-time analytics tool, Shark. With this all-in-one platform, Spark is said to deliver greater consistency in product results across various types of analysis.

Over the years, MapReduce Hadoop has enjoyed widespread adoption in the enterprise, and that will continue to be the case. Going forward, as the need for advanced real-time analytics tools escalates, Spark is positioned to meet that challenge.

ref:

http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/

http://www.qubole.com/blog/big-data/spark-vs-mapreduce/

http://www.computerworld.com/article/2856063/enterprise-software/hadoop-successor-sparks-a-data-analysis-evolution.html

Apache Hadoop videos - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD

Apache Spark videos - https://www.youtube.com/watch?v=7k_9sdTOdX4&list=PL9ooVrP1hQOGyFc60sExNX1qBWJyV5IMb

Hadoop Training Videos -

Hortonworks Apache Hadoop popular videos - https://www.youtube.com/watch?v=OoEpfb6yga8&list=PLoEDV8GCixRe-JIs4rEUIkG0aTe3FZgXV
Cloudera Apache Hadoop popular videos - https://www.youtube.com/watch?v=eo1PwSfCXTI&list=PLoEDV8GCixRddiUuJzEESimo1qP5tZ1Kn
Stanford Hadoop Training material - https://www.youtube.com/watch?v=d2xeNpfzsYI&list=PLxRwCyObqFr3OZeYsI7X5Mq6GNjzH10e1
Edureka Hadoop Training material - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD
Durga Solutions Hadoop Training material - https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I

Tech Kaizen

Search this Blog:

Apache Hadoop vs Apache Spark

The Verge - YOUTUBE

Google - YOUTUBE

Meta Developers - YOUTUBE

Microsoft - YOUTUBE

Microsoft India - YOUTUBE

MIT OpenCourseWare - YOUTUBE

FREE CODE CAMP - YOUTUBE

NEET CODE - YOUTUBE

Reid Hoffman - YOUTUBE

Martin Fowler's Bliki - BLOG

GAURAV SEN INTERVIEWS - YOUTUBE

Tanay Pratap - YOUTUBE

Ashish Pratap Singh - YOUTUBE

Kantan Coding - YOUTUBE

SUCCESS IN TECH INTERVIEWS - YOUTUBE

IGotAnOffer: Engineering - YOUTUBE

CYBER SECURITY - YOUTUBE

CYBER SECURITY FUNDAMENTALS PROF MESSER - YOUTUBE

DEEPLEARNING AI - YOUTUBE

STANFORD UNIVERSITY - YOUTUBE

NPTEL IISC BANGALORE - YOUTUBE

NPTEL IIT MADRAS - YOUTUBE

NPTEL HYDERABAD - YOUTUBE

MIT News

MIT News - Artificial intelligence

The Berkeley Artificial Intelligence Research Blog

Microsoft Research

MachineLearningMastery.com

Monthly Blog Archives

Blog Archives Categories

Popular Posts

My Other Blogs

Total Pageviews

Who am I

Aryaka Insights

The Pragmatic Engineer

Stanford Online

MIT Corporate Relations

AI at Wharton

OpenAI

AI Workshop

Hugging Face - Blog

BYTE BYTE GO - YOUTBUE

HackerRank - YOUTUBE

freeCodeCamp.org