Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. Apache Hadoop is an open source MapReduce implementation software framework that supports running data-intensive distributed applications on large cluster built of commodity hardware.
HDFS(Hadoop Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
HDFS is so good for -
Pig is a language for expressing data analysis and infrastructure processes. Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.
Apache Hive provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.
PIG is a procedural data-flow language. A procedural language is executing step-by-step approach defined by the programmers. You can control the optimization of every step. HIVE looks like SQL language. Thus, it becomes declarative language. You can specify what should be done rather how should be done. Optimization is difficult in HIVE since HIVE depends on its own optimizer
MapReduce: Simplified Data Processing on Large Clusters - http://static.usenix.org/ events/osdi04/tech/full_ papers/dean/dean.pdf
Apache Hadoop Goes Realtime at Facebook - http://borthakur.com/ftp/ RealtimeHadoopSigmod2011.pdf
Java development 2.0: Big data analysis with Hadoop MapReduce - http://www.ibm.com/ developerworks/java/library/j- javadev2-15/index.html
To Hadoop, or not to Hadoop - https://www.ibm.com/ developerworks/ mydeveloperworks/blogs/ theTechTrek/entry/to_hadoop_ or_not_to_hadoop2?lang=en
What is Hadoop - http://www-01.ibm.com/software/data/infosphere/hadoop/
Distributed data processing with Hadoop, Part 1: Getting started - http://www.ibm.com/ developerworks/linux/library/ l-hadoop-1/
Distributed data processing with Hadoop, Part 2: Going further - http://www.ibm.com/ developerworks/linux/library/ l-hadoop-2/
Distributed data processing with Hadoop, Part 3: Application development - http://www.ibm.com/ developerworks/linux/library/ l-hadoop-3/
An introduction to the Hadoop Distributed File System - http://www.ibm.com/ developerworks/web/library/wa- introhdfs/
Scheduling in Hadoop - http://www.ibm.com/ developerworks/linux/library/ os-hadoop-scheduling/index. html
Using MapReduce and load balancing on the cloud - http://www.ibm.com/ developerworks/cloud/library/ cl-mapreduce/
Intel Big Data - http://www.intel.com/bigdata
Apache Hadoop Framework Spotlights - http://www.intel.com/content/www/us/en/big-data/big-data-apache-hadoop-framework-spotlights-landing.html
Hadoop Books - https://drive.google.com/ folderview?id=0B_hC- 3L4eq17VFZfdVE0Z2NCUzQ&usp= sharing_eid
Hadoop Training Videos -
Miscellaneous:
MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. Apache Hadoop is an open source MapReduce implementation software framework that supports running data-intensive distributed applications on large cluster built of commodity hardware.
Apache Hadoop is an open source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Go ogle File System (GFS) papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor[3] to the project, and uses Hadoop extensively across its businesses.
Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.
As a conceptual framework for processing huge data sets, MapReduce is highly optimized for distributed problem-solving using a large number of computers. The framework consists of two functions, as its name implies. The
map
function is designed to take a large data input and divide it into smaller pieces, which it then hands off to other processes that can do something with it. The reduce
function digests the individual answers collected by map
and renders them to a final output.
In Hadoop, you define tations by extending Hadoop's own base classes. The implementations are tied together by a configuration that specifies them, along with input and output formats. Hadoop is well-suited for processing huge files containing structured data. One particularly handy aspect of Hadoop is that it handles the raw parsing of an input file, so that you can deal with one line at a time. Defining a
map
and reduce
implemenmap
function is thus really just a matter of determining what you want to grab from an incoming line of text.- Storing large files
- Terabytes, Petabytes, etc...
- Millions rather than billions of files 100MB or more per file
- Streaming data
- Write once and read-many times patterns
- Optimized for streaming reads rather than random reads
- Append operation added to Hadoop 0.21
- “Cheap” Commodity Hardware
- No need for super-comp
- Low-latency reads
- High-throughput rather than low latency for small chunks of data
- HBase addresses this issue
- Large amount of small files
- Better for millions of large files instead of billions of small files
- For example each file can be 100MB or more
- Multiple Writers
- Single writer per file
- Writes only at the end of file, no-support for arbitrary offset
pig vs hive:
Apache Hive provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.
PIG is a procedural data-flow language. A procedural language is executing step-by-step approach defined by the programmers. You can control the optimization of every step. HIVE looks like SQL language. Thus, it becomes declarative language. You can specify what should be done rather how should be done. Optimization is difficult in HIVE since HIVE depends on its own optimizer
ref:
MapReduce: Simplified Data Processing on Large Clusters - http://static.usenix.org/
Apache Hadoop Goes Realtime at Facebook - http://borthakur.com/ftp/
Java development 2.0: Big data analysis with Hadoop MapReduce - http://www.ibm.com/
To Hadoop, or not to Hadoop - https://www.ibm.com/
What is Hadoop - http://www-01.ibm.com/software/data/infosphere/hadoop/
Distributed data processing with Hadoop, Part 1: Getting started - http://www.ibm.com/
Distributed data processing with Hadoop, Part 2: Going further - http://www.ibm.com/
Distributed data processing with Hadoop, Part 3: Application development - http://www.ibm.com/
An introduction to the Hadoop Distributed File System - http://www.ibm.com/
Scheduling in Hadoop - http://www.ibm.com/
Using MapReduce and load balancing on the cloud - http://www.ibm.com/
Intel Big Data - http://www.intel.com/bigdata
Apache Hadoop Framework Spotlights - http://www.intel.com/content/www/us/en/big-data/big-data-apache-hadoop-framework-spotlights-landing.html
Hadoop Books - https://drive.google.com/
Hadoop Training Videos -
- Hortonworks Apache Hadoop popular videos - https://www.youtube.com/watch?v=OoEpfb6yga8&list=PLoEDV8GCixRe-JIs4rEUIkG0aTe3FZgXV
- Cloudera Apache Hadoop popular videos - https://www.youtube.com/watch?v=eo1PwSfCXTI&list=PLoEDV8GCixRddiUuJzEESimo1qP5tZ1Kn
- Stanford Hadoop Training material - https://www.youtube.com/watch?v=d2xeNpfzsYI&list=PLxRwCyObqFr3OZeYsI7X5Mq6GNjzH10e1
- Edureka Hadoop Training material - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD
- Durga Solutions Hadoop Training material - https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I
- The Hadoop wiki provides community input related to Hadoop and HDFS.
- The Hadoop API site documents the Java classes and interfaces that are used to program to Hadoop and HDFS.
- Wikipedia's MapReduce page is a great place to begin your research into the MapReduce framework.
- Visit Amazon S3 to learn about Amazon's S3 infrastructure.
- The developerWorks Web development zone specializes in articles covering various web-based solutions.
Get products and technologies
- The Hadoop project site contains valuable resources pertaining to the Hadoop architecture and the MapReduce framework.
- The Hadoop Distributed File System project site offers downloads and documentation about HDFS.
- Venture to the CloudStore site for downloads and documentation about the integration between CloudStore, Hadoop, and HDFS.
Discuss
- Create your My developerWorks profile today and set up a watch list on Hadoop. Get connected and stay connected withdeveloperWorks community.
- Find other developerWorks members interested in web development.
- Share what you know: Join one of our developerWorks groups focused on web topics.
- Roland Barcia talks about Web 2.0 and middleware in his blog.
- Follow developerWorks' members' shared bookmarks on web topics.
- Get answers quickly: Visit the Web 2.0 Apps forum.
- Get answers quickly: Visit the Ajax forum.