Tech Kaizen

passion + usefulness = success .. change is the only constant in life

Search this Blog:

Apache Hadoop - An open source implementation of MapReduce programming model

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. Apache Hadoop is an open source MapReduce implementation software framework that supports running data-intensive distributed applications on large cluster built of commodity hardware.

Apache Hadoop is an open source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor[3] to the project, and uses Hadoop extensively across its businesses.

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

As a conceptual framework for processing huge data sets, MapReduce is highly optimized for distributed problem-solving using a large number of computers. The framework consists of two functions, as its name implies. The map function is designed to take a large data input and divide it into smaller pieces, which it then hands off to other processes that can do something with it. The reduce function digests the individual answers collected by map and renders them to a final output.

In Hadoop, you define map and reduce implementations by extending Hadoop's own base classes. The implementations are tied together by a configuration that specifies them, along with input and output formats. Hadoop is well-suited for processing huge files containing structured data. One particularly handy aspect of Hadoop is that it handles the raw parsing of an input file, so that you can deal with one line at a time. Defining a map function is thus really just a matter of determining what you want to grab from an incoming line of text.

HDFS(Hadoop Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

HDFS is so good for -
  • Storing large files
    • Terabytes, Petabytes, etc...
    • Millions rather than billions of files 100MB or more per file
  • Streaming data
    • Write once and read-many times patterns
    • Optimized for streaming reads rather than random reads
    • Append operation added to Hadoop 0.21
  • “Cheap” Commodity Hardware
    • No need for super-comp
HDFS is not so good for - 
  • Low-latency reads
    • High-throughput rather than low latency for small chunks of data
    • HBase addresses this issue
  • Large amount of small files
    • Better for millions of large files instead of billions of small files
    • For example each file can be 100MB or more
  • Multiple Writers
    • Single writer per file
    • Writes only at the end of file, no-support for arbitrary offset
pig vs hive:

Pig is a language for expressing data analysis and infrastructure processes. Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

Apache Hive provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.


PIG is a procedural data-flow language. A procedural language is executing step-by-step approach defined by the programmers. You can control the optimization of every step. HIVE looks like SQL language. Thus, it becomes declarative language. You can specify what should be done rather how should be done. Optimization is difficult in HIVE since HIVE depends on its own optimizer


ref:

MapReduce: Simplified Data Processing on Large Clusters - http://static.usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf


Apache Hadoop Goes Realtime at Facebook - http://borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf


Java development 2.0: Big data analysis with Hadoop MapReduce - http://www.ibm.com/developerworks/java/library/j-javadev2-15/index.html


To Hadoop, or not to Hadoop - https://www.ibm.com/developerworks/mydeveloperworks/blogs/theTechTrek/entry/to_hadoop_or_not_to_hadoop2?lang=en


What is Hadoop - http://www-01.ibm.com/software/data/infosphere/hadoop/


Distributed data processing with Hadoop, Part 1: Getting started - http://www.ibm.com/developerworks/linux/library/l-hadoop-1/


Distributed data processing with Hadoop, Part 2: Going further - http://www.ibm.com/developerworks/linux/library/l-hadoop-2/


Distributed data processing with Hadoop, Part 3: Application development - http://www.ibm.com/developerworks/linux/library/l-hadoop-3/


An introduction to the Hadoop Distributed File System - http://www.ibm.com/developerworks/web/library/wa-introhdfs/


Scheduling in Hadoop - http://www.ibm.com/developerworks/linux/library/os-hadoop-scheduling/index.html


Using MapReduce and load balancing on the cloud - http://www.ibm.com/developerworks/cloud/library/cl-mapreduce/


Intel Big Data - http://www.intel.com/bigdata


Apache Hadoop Framework Spotlights - http://www.intel.com/content/www/us/en/big-data/big-data-apache-hadoop-framework-spotlights-landing.html


Hadoop Books - https://drive.google.com/folderview?id=0B_hC-3L4eq17VFZfdVE0Z2NCUzQ&usp=sharing_eid


Hadoop Training Videos -


  1. Hortonworks Apache Hadoop popular videos - https://www.youtube.com/watch?v=OoEpfb6yga8&list=PLoEDV8GCixRe-JIs4rEUIkG0aTe3FZgXV
  2. Cloudera Apache Hadoop popular videos - https://www.youtube.com/watch?v=eo1PwSfCXTI&list=PLoEDV8GCixRddiUuJzEESimo1qP5tZ1Kn
  3. Stanford Hadoop Training material - https://www.youtube.com/watch?v=d2xeNpfzsYI&list=PLxRwCyObqFr3OZeYsI7X5Mq6GNjzH10e1
  4. Edureka Hadoop Training material - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD
  5. Durga Solutions Hadoop Training material - https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I
Miscellaneous:

  • The Hadoop wiki provides community input related to Hadoop and HDFS.
  • The Hadoop API site documents the Java classes and interfaces that are used to program to Hadoop and HDFS.
  • Wikipedia's MapReduce page is a great place to begin your research into the MapReduce framework.
  • Visit Amazon S3 to learn about Amazon's S3 infrastructure.
  • The developerWorks Web development zone specializes in articles covering various web-based solutions.
       Get products and technologies
  • The Hadoop project site contains valuable resources pertaining to the Hadoop architecture and the MapReduce framework.
  • The Hadoop Distributed File System project site offers downloads and documentation about HDFS.
  • Venture to the CloudStore site for downloads and documentation about the integration between CloudStore, Hadoop, and HDFS.
       Discuss
  • Create your My developerWorks profile today and set up a watch list on Hadoop. Get connected and stay connected withdeveloperWorks community.
  • Find other developerWorks members interested in web development.
  • Share what you know: Join one of our developerWorks groups focused on web topics.
  • Roland Barcia talks about Web 2.0 and middleware in his blog.
  • Follow developerWorks' members' shared bookmarks on web topics.
  • Get answers quickly: Visit the Web 2.0 Apps forum.
  • Get answers quickly: Visit the Ajax forum.

Labels: BIG DATA ANALYTICS, CLOUD COMPUTING, DATABASE, TECHNICAL MISCELLANEOUS, TECHNOLOGY INTEGRATION
Newer Posts Older Posts Home
Subscribe to: Posts (Atom)

The Verge - YOUTUBE

Loading...

Google - YOUTUBE

Loading...

Microsoft - YOUTUBE

Loading...

MIT OpenCourseWare - YOUTUBE

Loading...

FREE CODE CAMP - YOUTUBE

Loading...

NEET CODE - YOUTUBE

Loading...

GAURAV SEN INTERVIEWS - YOUTUBE

Loading...

Y Combinator Discussions

Loading...

SUCCESS IN TECH INTERVIEWS - YOUTUBE

Loading...

IGotAnOffer: Engineering YOUTUBE

Loading...

Tanay Pratap YOUTUBE

Loading...

Ashish Pratap Singh YOUTUBE

Loading...

Questpond YOUTUBE

Loading...

Kantan Coding YOUTUBE

Loading...

CYBER SECURITY - YOUTUBE

Loading...

CYBER SECURITY FUNDAMENTALS PROF MESSER - YOUTUBE

Loading...

DEEPLEARNING AI - YOUTUBE

Loading...

STANFORD UNIVERSITY - YOUTUBE

Loading...

NPTEL IISC BANGALORE - YOUTUBE

Loading...

NPTEL IIT MADRAS - YOUTUBE

Loading...

NPTEL HYDERABAD - YOUTUBE

Loading...

MIT News

Loading...

MIT News - Artificial intelligence

Loading...

The Berkeley Artificial Intelligence Research Blog

Loading...

Microsoft Research

Loading...

MachineLearningMastery.com

Loading...

Harward Business Review(HBR)

Loading...

Wharton Magazine

Loading...
My photo
Krishna Kishore Koney
View my complete profile
" It is not the strongest of the species that survives nor the most intelligent that survives, It is the one that is the most adaptable to change "

View krishna kishore koney's profile on LinkedIn

Monthly Blog Archives

  • ►  2025 (2)
    • ►  May (1)
    • ►  April (1)
  • ►  2024 (18)
    • ►  December (1)
    • ►  October (2)
    • ►  September (5)
    • ►  August (10)
  • ►  2022 (2)
    • ►  December (2)
  • ►  2021 (2)
    • ►  April (2)
  • ►  2020 (17)
    • ►  November (1)
    • ►  September (7)
    • ►  August (1)
    • ►  June (8)
  • ►  2019 (18)
    • ►  December (1)
    • ►  November (2)
    • ►  September (3)
    • ►  May (8)
    • ►  February (1)
    • ►  January (3)
  • ►  2018 (3)
    • ►  November (1)
    • ►  October (1)
    • ►  January (1)
  • ►  2017 (2)
    • ►  November (1)
    • ►  March (1)
  • ►  2016 (5)
    • ►  December (1)
    • ►  April (3)
    • ►  February (1)
  • ►  2015 (15)
    • ►  December (1)
    • ►  October (1)
    • ►  August (2)
    • ►  July (4)
    • ►  June (2)
    • ►  May (3)
    • ►  January (2)
  • ►  2014 (13)
    • ►  December (1)
    • ►  November (2)
    • ►  October (4)
    • ►  August (5)
    • ►  January (1)
  • ►  2013 (5)
    • ►  September (2)
    • ►  May (1)
    • ►  February (1)
    • ►  January (1)
  • ▼  2012 (19)
    • ►  November (1)
    • ►  October (2)
    • ►  September (1)
    • ▼  July (1)
      • Apache Hadoop - An open source implementation of M...
    • ►  June (6)
    • ►  May (1)
    • ►  April (2)
    • ►  February (3)
    • ►  January (2)
  • ►  2011 (20)
    • ►  December (5)
    • ►  August (2)
    • ►  June (6)
    • ►  May (4)
    • ►  April (2)
    • ►  January (1)
  • ►  2010 (41)
    • ►  December (2)
    • ►  November (1)
    • ►  September (5)
    • ►  August (2)
    • ►  July (1)
    • ►  June (1)
    • ►  May (8)
    • ►  April (2)
    • ►  March (3)
    • ►  February (5)
    • ►  January (11)
  • ►  2009 (113)
    • ►  December (2)
    • ►  November (5)
    • ►  October (11)
    • ►  September (1)
    • ►  August (14)
    • ►  July (5)
    • ►  June (10)
    • ►  May (4)
    • ►  April (7)
    • ►  March (11)
    • ►  February (15)
    • ►  January (28)
  • ►  2008 (61)
    • ►  December (7)
    • ►  September (6)
    • ►  August (1)
    • ►  July (17)
    • ►  June (6)
    • ►  May (24)
  • ►  2006 (7)
    • ►  October (7)

Blog Archives Categories

  • .NET DEVELOPMENT (38)
  • 5G (5)
  • AI (Artificial Intelligence) (9)
  • AI/ML (4)
  • ANDROID DEVELOPMENT (7)
  • BIG DATA ANALYTICS (6)
  • C PROGRAMMING (7)
  • C++ PROGRAMMING (24)
  • CAREER MANAGEMENT (6)
  • CHROME DEVELOPMENT (2)
  • CLOUD COMPUTING (45)
  • CODE REVIEWS (3)
  • CYBERSECURITY (12)
  • DATA SCIENCE (4)
  • DATABASE (14)
  • DESIGN PATTERNS (9)
  • DEVICE DRIVERS (5)
  • DOMAIN KNOWLEDGE (14)
  • EDGE COMPUTING (4)
  • EMBEDDED SYSTEMS (9)
  • ENTERPRISE ARCHITECTURE (10)
  • IMAGE PROCESSING (3)
  • INTERNET OF THINGS (2)
  • J2EE PROGRAMMING (10)
  • KERNEL DEVELOPMENT (6)
  • KUBERNETES (19)
  • LATEST TECHNOLOGY (18)
  • LINUX (9)
  • MAC OPERATING SYSTEM (2)
  • MOBILE APPLICATION DEVELOPMENT (14)
  • PORTING (4)
  • PYTHON PROGRAMMING (6)
  • RESEARCH AND DEVELOPMENT (1)
  • SCRIPTING LANGUAGES (8)
  • SERVICE ORIENTED ARCHITECTURE (SOA) (10)
  • SOFTWARE DESIGN (13)
  • SOFTWARE QUALITY (5)
  • SOFTWARE SECURITY (23)
  • SYSTEM and NETWORK ADMINISTRATION (3)
  • SYSTEM PROGRAMMING (4)
  • TECHNICAL MISCELLANEOUS (31)
  • TECHNOLOGY INTEGRATION (5)
  • TEST AUTOMATION (5)
  • UNIX OPERATING SYSTEM (4)
  • VC++ PROGRAMMING (44)
  • VIRTUALIZATION (8)
  • WEB PROGRAMMING (8)
  • WINDOWS OPERATING SYSTEM (13)
  • WIRELESS DEVELOPMENT (5)
  • XML (3)

Popular Posts

  • Observer Pattern - Push vs Pull Model
  • AI Agent vs AI Workflow
  • Microservices Architecture ..
  • SSCLI(Shared Source Common Language Infrastructure)

My Other Blogs

  • Career Management: Invest in Yourself
  • Color your Career
  • Attitude is everything(in Telugu language)
WINNING vs LOSING

Hanging on, persevering, WINNING
Letting go, giving up easily, LOSING

Accepting responsibility for your actions, WINNING
Always having an excuse for your actions, LOSING

Taking the initiative, WINNING
Waiting to be told what to do, LOSING

Knowing what you want and setting goals to achieve it, WINNING
Wishing for things, but taking no action, LOSING

Seeing the big picture, and setting your goals accordingly, WINNING
Seeing only where you are today, LOSING

Being determined, unwilling to give up WINNING
Gives up easily, LOSING

Having focus, staying on track, WINNING
Allowing minor distractions to side track them, LOSING

Having a positive attitude, WINNING
having a "poor me" attitude, LOSING

Adopt a WINNING attitude!

Total Pageviews

who am i

My photo
Krishna Kishore Koney

Blogging is about ideas, self-discovery, and growth. This is a small effort to grow outside my comfort zone.

Most important , A Special Thanks to my parents(Sri Ramachandra Rao & Srimathi Nagamani), my wife(Roja), my lovely daughter (Hansini) and son (Harshil) for their inspiration and continuous support in developing this Blog.

... "Things will never be the same again. An old dream is dead and a new one is being born, as a flower that pushes through the solid earth. A new vision is coming into being and a greater consciousness is being unfolded" ... from Jiddu Krishnamurti's Teachings.

Now on disclaimer :
1. Please note that my blog posts reflect my perception of the subject matter and do not reflect the perception of my Employer.

2. Most of the times the content of the blog post is aggregated from Internet articles and other blogs which inspired me. Due respect is given by mentioning the referenced URLs below each post.

Have a great time

My LinkedIn Profile
View my complete profile

Failure is not falling down, it is not getting up again. Success is the ability to go from failure to failure without losing your enthusiasm.

Where there's a Will, there's a Way. Keep on doing what fear you, that is the quickest and surest way to to conquer it.

Vision is the art of seeing what is invisible to others. For success, attitude is equally as important as ability.

Favourite RSS Syndications ...

Google Developers Blog

Loading...

Blogs@Google

Loading...

Berklee Blogs » Technology

Loading...

Martin Fowler's Bliki

Loading...

TED Blog

Loading...

TEDTalks (video)

Loading...

Psychology Today Blogs

Loading...

Aryaka Insights

Loading...

The Pragmatic Engineer

Loading...

Stanford Online

Loading...

MIT Corporate Relations

Loading...

AI at Wharton

Loading...

OpenAI

Loading...

AI Workshop

Loading...

Hugging Face - Blog

Loading...

BYTE BYTE GO - YOUTBUE

Loading...

Google Cloud Tech

Loading...

3Blue1Brown

Loading...

Bloomberg Originals

Loading...

Dwarkesh Patel Youtube Channel

Loading...

Reid Hoffman

Loading...

Aswath Damodaran

Loading...