Tech Kaizen: 7/1/12

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. Apache Hadoop is an open source MapReduce implementation software framework that supports running data-intensive distributed applications on large cluster built of commodity hardware.

Apache Hadoop is an open source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor^[3] to the project, and uses Hadoop extensively across its businesses.

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

As a conceptual framework for processing huge data sets, MapReduce is highly optimized for distributed problem-solving using a large number of computers. The framework consists of two functions, as its name implies. The map function is designed to take a large data input and divide it into smaller pieces, which it then hands off to other processes that can do something with it. The reduce function digests the individual answers collected by map and renders them to a final output.

In Hadoop, you define map and reduce implementations by extending Hadoop's own base classes. The implementations are tied together by a configuration that specifies them, along with input and output formats. Hadoop is well-suited for processing huge files containing structured data. One particularly handy aspect of Hadoop is that it handles the raw parsing of an input file, so that you can deal with one line at a time. Defining a map function is thus really just a matter of determining what you want to grab from an incoming line of text.

HDFS(Hadoop Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

HDFS is so good for -

Storing large files

Terabytes, Petabytes, etc...
Millions rather than billions of files 100MB or more per file

Streaming data

Write once and read-many times patterns
Optimized for streaming reads rather than random reads
Append operation added to Hadoop 0.21

“Cheap” Commodity Hardware

No need for super-comp

HDFS is not so good for -

Low-latency reads

High-throughput rather than low latency for small chunks of data
HBase addresses this issue

Large amount of small files

Better for millions of large files instead of billions of small files
For example each file can be 100MB or more

Multiple Writers

Single writer per file
Writes only at the end of file, no-support for arbitrary offset

pig vs hive:

Pig is a language for expressing data analysis and infrastructure processes. Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

Apache Hive provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.

PIG is a procedural data-flow language. A procedural language is executing step-by-step approach defined by the programmers. You can control the optimization of every step. HIVE looks like SQL language. Thus, it becomes declarative language. You can specify what should be done rather how should be done. Optimization is difficult in HIVE since HIVE depends on its own optimizer

ref:

Hadoop Training Videos -

Hortonworks Apache Hadoop popular videos - https://www.youtube.com/watch?v=OoEpfb6yga8&list=PLoEDV8GCixRe-JIs4rEUIkG0aTe3FZgXV
Cloudera Apache Hadoop popular videos - https://www.youtube.com/watch?v=eo1PwSfCXTI&list=PLoEDV8GCixRddiUuJzEESimo1qP5tZ1Kn
Stanford Hadoop Training material - https://www.youtube.com/watch?v=d2xeNpfzsYI&list=PLxRwCyObqFr3OZeYsI7X5Mq6GNjzH10e1
Edureka Hadoop Training material - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD
Durga Solutions Hadoop Training material - https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I

Miscellaneous:

The Hadoop wiki provides community input related to Hadoop and HDFS.
The Hadoop API site documents the Java classes and interfaces that are used to program to Hadoop and HDFS.
Wikipedia's MapReduce page is a great place to begin your research into the MapReduce framework.
Visit Amazon S3 to learn about Amazon's S3 infrastructure.
The developerWorks Web development zone specializes in articles covering various web-based solutions.

Get products and technologies

The Hadoop project site contains valuable resources pertaining to the Hadoop architecture and the MapReduce framework.
The Hadoop Distributed File System project site offers downloads and documentation about HDFS.
Venture to the CloudStore site for downloads and documentation about the integration between CloudStore, Hadoop, and HDFS.

Discuss

Create your My developerWorks profile today and set up a watch list on Hadoop. Get connected and stay connected withdeveloperWorks community.
Find other developerWorks members interested in web development.
Share what you know: Join one of our developerWorks groups focused on web topics.
Roland Barcia talks about Web 2.0 and middleware in his blog.
Follow developerWorks' members' shared bookmarks on web topics.
Get answers quickly: Visit the Web 2.0 Apps forum.
Get answers quickly: Visit the Ajax forum.

Tech Kaizen

Search this Blog:

Apache Hadoop - An open source implementation of MapReduce programming model

Y Combinator Interviews - YOUTUBE

Masters of Scale - YOUTUBE

The Verge - YOUTUBE

Google - YOUTUBE

Meta Developers - YOUTUBE

Microsoft - YOUTUBE

Microsoft India - YOUTUBE

MIT OpenCourseWare - YOUTUBE

FREE CODE CAMP - YOUTUBE

NEET CODE - YOUTUBE

GAURAV SEN INTERVIEWS - YOUTUBE

SUCCESS IN TECH INTERVIEWS - YOUTUBE

IGotAnOffer: Engineering YOUTUBE

Tanay Pratap YOUTUBE

Ashish Pratap Singh YOUTUBE

Questpond YOUTUBE

Kantan Coding YOUTUBE

CYBER SECURITY - YOUTUBE

CYBER SECURITY FUNDAMENTALS PROF MESSER - YOUTUBE

DEEPLEARNING AI - YOUTUBE

STANFORD UNIVERSITY - YOUTUBE

NPTEL IISC BANGALORE - YOUTUBE

NPTEL IIT MADRAS - YOUTUBE

NPTEL HYDERABAD - YOUTUBE

MIT News

MIT News - Artificial intelligence

The Berkeley Artificial Intelligence Research Blog

Microsoft Research

MachineLearningMastery.com

Harward Business Review(HBR)

Wharton Magazine

Monthly Blog Archives

Blog Archives Categories

Popular Posts

My Other Blogs

Total Pageviews

who am i

Google Developers Blog

Blogs@Google

Berklee Blogs » Technology

Martin Fowler's Bliki

TED Blog

TEDTalks (video)

Psychology Today Blogs

Aryaka Insights

The Pragmatic Engineer

Stanford Online

MIT Corporate Relations

AI at Wharton

OpenAI

AI Workshop

Hugging Face - Blog

BYTE BYTE GO - YOUTBUE

Google Cloud Tech

3Blue1Brown

Bloomberg Originals

Dwarkesh Patel Youtube Channel

Reid Hoffman

Aswath Damodaran