Tech Kaizen: Apache Hadoop Info Dump ..

HDFS lacks the random read/write capability. It is good for sequential data access. And this is where HBase comes into picture. It is a NoSQL database that runs on top your Hadoop cluster and provides you random real-time read/write access to your data. Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access). Like all other FileSystems, HDFS provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss(because of the replication).But, being a File System , HDFS lacks random read and write access. This is where HBase comes into picture. It’s a distributed, scalable, big data store, modelled after Google’s BigTable. Cassandra is somewhat similar to hbase.

You can store both structured and unstructured data in Hadoop, and HBase as well. Both of them provide you multiple mechanisms to access the data, like the shell and other APIs. And, HBase stores data as key/value pairs in a columnar fashion while HDFS stores data as flat files. Some of the salient features of both the systems are :

Hadoop:

Optimized for streaming access of large files.
Follows write-once read-many ideology.
Doesn't support random read/write.

HBase:

Stores key/value pairs in columnar fashion (columns are clubbed together as column families).
Provides low latency access to small amounts of data from within a large data set.
Provides flexible data model.

Hadoop is most suited for offline batch-processing kinda stuff while HBase is used when you have real-time needs.

An analogous comparison would be between MySQL and Ext4.

HDFS: Hadoop Distributed File System Its Optimized for streaming access of large files and stores files that are 100 s of MB upwards on HDFS and access through Map reduce Its Optimized use cases where once we write and read many times

HBASE: HBASE is an Open Source ,non relational distributed database and its a part of apache software foundations Apache Hadoop runs on top HDFS.HBASE does not support a structured query language like sql

Hadoop is a set of integrated technologies. Most notable parts are:

HDFS - distributed file system specially built for massive data processing
MapReduce - framework implementing Map Reduce paradigm ove distributed file systems, where HDFS - one of them. It can work over other DFS - for example Amazon S3.
HBase - distributed sorted key-value map built on top of DFS. In best of my knowledge HDFS is only DFS implementation compatible with HBase. HBase needs append capability to write its write ahead log. For example DFS over amazon's s3 does not support it.

"If you want to know about Hadoop and Hbase in deatil, you can visit the respective home pages -"hadoop.apache.org" and "hbase.apache.org". you can also go through the following books if you want to learn in depth "Hadoop.The.Definitive.Guide" and "HBase.The.Definitive.Guide".

I recommend you this talk by Todd Lipcon (Cloudera): "Apache HBase: an introduction" - http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction

“Apache Difinitive Guide” Tom White Book

Distributed programming – Partial failure & recovery

Data analytics everywhere

Hadoop Clusters => all Linux commodity nodes

Google gfs

Google hdfs

Nutch engine – Hadoop creator(doug cutting)

Hadoop is all bout JAVA (other lang support minimal(meaning only map & reduce not other functionality)

HDFS is virtual distributed file system implemented in JAVA

Hadoop pipelining => this is how hadoop provides replication/mirroring

Default replication factor(redundancy) of hadoop 3

Classic (or) high availability hadoop cluster

STONITH alogorithm

Hadoop Federation => namespace for various clusters

Bit Rot

Hadoop partial failure - MTBF (mean time between failures)

Hadoop Rack Awareness

Windows Hadoop => Microsoft HD Inside

Hadoop =>

MN = master nodes

NN = name node

Secondary name node = offload node; House keeping (not a back-up node ; it’s classic code)

Stand-by name node = back-up name node

DN = data nodes

WN = worker nodes

SN = slave nodes

TN = task nodes

Hadoop Applications => Text mining, Sentiment analysis, prediction models, index building, collaborative filtering, graph creation and analysis

Nature of Hadoop => Batch Processing, huge volume of data

HDFS => storage(Files in HDFS are write-once), Batch Processing & sequential Reads( no random reads )

Map-Reduce => Processing

Two master nodes =>

Name Node : manages HDFS (meta data about files , blocks); Name Node daemon always runs;

Job Tracker : manages Mapreduce

Hadoop fs => Hadoop shell

example: hadoop fs -ls

Hadoop has it’s own HDFS file system & Hadoop users ( example: /user/krishna)

MRv1 daemons => Job Tracker, Task Tracker

MRv2(MapReduce V2) => Resource Manager, Application master, Node Manager, JobHistory

Impala is based on C++ & it’s pretty fast. Impala is similar to Hive. It does not use MapReduce but has it’s own Impala agents.

Streaming API - Mappers & Reduce only in python, ruby … all other stuff(practitioners ..) all have to be in JAVA only ..

We need Hadoop streaming jar file ..

MRUnit is built on JUnit and uses mockito framework

LocalJobRunner

InputSplit means what HDFS blocks allocated ..

MRUnit gives us InputSplit(It’s nothing but blocks), MapDriver, ReduceDriver, MapReduceDriver ..

withInput()

withOutput()

runTest()

resetOutput()

withInput()

withOutput()

runTest()

inheritance => “is a” kind of relationship

interface => “is a capability” kind of relationship

Hadoop fs => talking to HDFS

Hadoop jar => talking to job tracker

ToolRunner => ability to pass commandline arguments. It gives Generic Options

Combiner is mini-reducer. It resides on Mapper node. Distributed cache is READ-ONLY. Distributed cache is available in LOCAL working directory.

To debug MapReduce, goto Psuedo mode (all on the same machine). Use LocalJobRunner for debugging code. No name-node, job-tracker, HDFS … all on same machine ( It’s just like executing your Java program on the local machine as Driver code executes on client and has main() api) ..

Each Hadoop node runs a small web-server so you can see all logs …

If you do not need Reducers then setNumReduceTasks(0)

Never try to do RDBMS jobs in Hadoop MapReduce but use more of PIG, HIVE. Example : join

Hadoop is not good for relational processing; RDBMS is meant for this kind of processing.

If you have to JOIN - map side join , reduce side join

Map side join – keep side data in memory(under setup()) & comparison in map() api

Reduce side join – complex & weird; It used composite key; sorting comparator & grouping comparator

Gzip is not splittable

LZO, Snappy is splittable compression technology. As Hadoop needs to distribute files across the HDFS, it needs a splittable compression algorithm.

Hadoop strives for DATA LOCALITY.

Terasort

Hadoop is good for OLAP(analytical). It’s kind of offline.

OLTP is real-time. It need instantaneous responses.

Sqoop => sqll to Hadoop & Hadoop to sql. CLoudera sqoop has connectors for all RDBMS vendors.

Sqoop starts 4 mappers when we try to import the database.

Sqoop user-guide: https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

Standard is –

Use Hadoop for ETL operations (OLAP)
Populate the Hadoop output to a RDBMS (OLTP)
Use BI tools on the RDBMS generated (OLTP)

BI Tools => Cognos, Pableau, Informatica, TerraData

Cloudera Impala is trying to bridge the gap beween Hadoop & RDBMS performance. Cloudera Impala does not use MapReduce.

Spark is replacement of MapReduce for developers(technical). It sits above HBASE (or) HADOOP.

Hadoop MapReduce map() is called for each line in your file for processing, so never connect/load to database in map() api; do it in setup() API

Cloudera Flume => Gather all Log files( syslog, web logs …) and submit to Hadoop.

example: gather syslog from all slaves & submit to Hadoop for processing.

Hive & Impala use SCHEMA on read not write.

Hive has meta database (of Apache Derby). Hive is closely integrated with Java/Python; I mean Hive can use the JAVA APIs that we already have !

Hive has UI tool called Hue.

How Hive works: It create as table pointing to file in HDFS; All HQL queries are executed via table on HDFS file using MapReduce ( All MapReduce joins using composite keys in the background).

Cloudera Impala is not using MapReduce and has it’s own agents. Hive & Impala are for SQL-developers.

Pig => All pig scripts get converted to MapReduce jobs. It works directly on the HDFS file.

Oozie => Apache workflow scheduler tool ( all configuration in XML file).

Miscellaneous:

MapReduce => For programmers & Full control (JAVA code)

Ping, Hive and Impala => For Business Analysts

Pig, Hive and Impala are used in conjunction with Data visualization tools like qlickview, tableau ..

Hadoop Hive Web UI tools - Hue, Beeswax

Hadoop Pipelining is what replicates/mirrors the mappers.

If job failed 4 times it’s taken out by HDFS

Avro is schema evolution

waitForCompletion => Sync

submit => Async

Any language that can read standard-input & emit standard-output can be used to write MapReduce jobs.

Streaming API(python …) can be used only with MAP REDUCE api ( not for combiners, partitioners ..).

MRUnit => gives it’s own Driver & input split

ToolRunner => allows passing commandline args

Combiner => reduce network traffic.

Default partitioner => hash partition, override getpartitioner() API

Mapper/Reducer => does not guarantee call to cleanup() ..

InpputFormat gives data blocks to Mapper; Mapper outputs key & iterable values ; Reducer takes them and sorts & merges ..

Apache Flume => gets LOG files to Hadoop

Apache sqoop => gets RDBMS to Hadoop

Tech Kaizen

Search this Blog:

Apache Hadoop Info Dump ..

The Verge - YOUTUBE

Google - YOUTUBE

Microsoft - YOUTUBE

MIT OpenCourseWare - YOUTUBE

FREE CODE CAMP - YOUTUBE

NEET CODE - YOUTUBE

GAURAV SEN INTERVIEWS - YOUTUBE

Y Combinator Discussions

SUCCESS IN TECH INTERVIEWS - YOUTUBE

IGotAnOffer: Engineering YOUTUBE

Tanay Pratap YOUTUBE

Ashish Pratap Singh YOUTUBE

Questpond YOUTUBE

Kantan Coding YOUTUBE

CYBER SECURITY - YOUTUBE

CYBER SECURITY FUNDAMENTALS PROF MESSER - YOUTUBE

DEEPLEARNING AI - YOUTUBE

STANFORD UNIVERSITY - YOUTUBE

NPTEL IISC BANGALORE - YOUTUBE

NPTEL IIT MADRAS - YOUTUBE

NPTEL HYDERABAD - YOUTUBE

MIT News

MIT News - Artificial intelligence

The Berkeley Artificial Intelligence Research Blog

Microsoft Research

MachineLearningMastery.com

Harward Business Review(HBR)

Wharton Magazine

Monthly Blog Archives

Blog Archives Categories

Popular Posts

My Other Blogs

Total Pageviews

who am i

Google Developers Blog

Blogs@Google

Berklee Blogs » Technology

Martin Fowler's Bliki

TED Blog

TEDTalks (video)

Psychology Today Blogs

Aryaka Insights

The Pragmatic Engineer

Stanford Online

MIT Corporate Relations

AI at Wharton

OpenAI

AI Workshop

Hugging Face - Blog

BYTE BYTE GO - YOUTBUE

Google Cloud Tech

3Blue1Brown

Bloomberg Originals

Dwarkesh Patel Youtube Channel

Reid Hoffman

Aswath Damodaran