HDFS lacks the random read/write
capability. It is good for sequential data access. And this is where HBase comes into picture.
It is a NoSQL database that runs on top your Hadoop cluster and provides you
random real-time read/write access to your data. Hadoop can perform only batch
processing, and data will be accessed only in a sequential manner. That means one has to search the
entire dataset even for the simplest of jobs. A huge dataset when processed
results in another huge data set, which should also be processed sequentially.
At this point, a new solution is needed to access any point of data in a single
unit of time (random access). Like all other FileSystems, HDFS provides
us storage, but in a fault tolerant manner with high throughput and lower risk
of data loss(because of the replication).But, being a File System , HDFS lacks
random read and write access. This is where HBase comes into picture. It’s a
distributed, scalable, big data store, modelled after Google’s BigTable.
Cassandra is somewhat similar to hbase.
You can store both structured and
unstructured data in Hadoop, and HBase as well. Both of them provide you
multiple mechanisms to access the data, like the shell and other APIs. And,
HBase stores data as key/value pairs in a columnar fashion while HDFS stores
data as flat files. Some of the salient features of both the systems are :
- Optimized for streaming access of large files.
- Follows write-once read-many ideology.
- Doesn't support random read/write.
- Stores key/value pairs in columnar fashion (columns are clubbed together as column families).
- Provides low latency access to small amounts of data from within a large data set.
- Provides flexible data model.
Hadoop is most suited for offline batch-processing kinda stuff while
HBase is used when you have real-time needs.
An analogous comparison would be
between MySQL and Ext4.
HDFS: Hadoop Distributed File System Its Optimized
for streaming access of large files and stores files that are 100 s of MB
upwards on HDFS and access through Map reduce Its Optimized use cases where
once we write and read many times
HBASE: HBASE is an Open Source ,non relational
distributed database and its a part of apache software foundations Apache
Hadoop runs on top HDFS.HBASE does not support a structured query language like
sql
Hadoop is a set of integrated technologies. Most notable
parts are:
- HDFS - distributed file system specially built for massive data processing
- MapReduce - framework implementing Map Reduce paradigm ove distributed file systems, where HDFS - one of them. It can work over other DFS - for example Amazon S3.
- HBase - distributed sorted key-value map built on top of DFS. In best of my knowledge HDFS is only DFS implementation compatible with HBase. HBase needs append capability to write its write ahead log. For example DFS over amazon's s3 does not support it.
"If you want to know about Hadoop and Hbase in deatil,
you can visit the respective home pages -"hadoop.apache.org" and
"hbase.apache.org". you can also go through the following books if
you want to learn in depth "Hadoop.The.Definitive.Guide" and
"HBase.The.Definitive.Guide".
I recommend you this talk by Todd Lipcon (Cloudera):
"Apache HBase: an introduction" - http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction
“Apache Difinitive
Guide” Tom White Book
Distributed programming
– Partial failure & recovery
Data analytics
everywhere
Hadoop Clusters =>
all Linux commodity nodes
Google gfs
Google hdfs
Nutch engine – Hadoop
creator(doug cutting)
Hadoop is all bout JAVA
(other lang support minimal(meaning only map & reduce not other functionality)
HDFS is virtual
distributed file system implemented in JAVA
Hadoop pipelining => this is how hadoop provides replication/mirroring
Default replication
factor(redundancy) of hadoop 3
Classic (or) high
availability hadoop cluster
STONITH alogorithm
Hadoop Federation =>
namespace for various clusters
Bit Rot
Hadoop partial failure
- MTBF (mean time between failures)
Hadoop Rack Awareness
Windows Hadoop =>
Microsoft HD Inside
Hadoop =>
MN = master nodes
NN = name node
Secondary name node =
offload node; House keeping (not a back-up node ; it’s classic code)
Stand-by name node =
back-up name node
DN = data nodes
WN = worker nodes
SN = slave nodes
TN = task nodes
Hadoop
Applications => Text mining, Sentiment analysis, prediction models, index
building, collaborative filtering, graph creation and analysis
Nature of
Hadoop => Batch Processing, huge volume of data
HDFS
=> storage(Files in HDFS are write-once), Batch Processing & sequential
Reads( no random reads )
Map-Reduce
=> Processing
Two
master nodes =>
Name Node
: manages HDFS (meta data about files , blocks); Name Node daemon always runs;
Job
Tracker : manages Mapreduce
Hadoop fs
=> Hadoop shell
example:
hadoop fs -ls
Hadoop
has it’s own HDFS file system & Hadoop users ( example: /user/krishna)
MRv1
daemons => Job Tracker, Task Tracker
MRv2(MapReduce
V2) => Resource Manager, Application master, Node Manager, JobHistory
Impala is
based on C++ & it’s pretty fast. Impala is similar to Hive. It does not use
MapReduce but has it’s own Impala agents.
Streaming
API - Mappers
& Reduce only in python, ruby … all other stuff(practitioners ..) all have
to be in JAVA only ..
We need
Hadoop streaming jar file ..
MRUnit is
built on JUnit and uses mockito framework
LocalJobRunner
InputSplit
means what HDFS blocks allocated ..
MRUnit
gives us InputSplit(It’s nothing but blocks), MapDriver, ReduceDriver,
MapReduceDriver ..
withInput()
withOutput()
runTest()
resetOutput()
withInput()
withOutput()
runTest()
inheritance
=> “is a” kind of relationship
interface
=> “is a capability” kind of relationship
Hadoop fs
=> talking to HDFS
Hadoop
jar => talking to job tracker
ToolRunner
=> ability to pass commandline arguments. It gives Generic Options
Combiner
is mini-reducer. It resides on Mapper node. Distributed cache is READ-ONLY.
Distributed cache is available in LOCAL working directory.
To debug
MapReduce, goto Psuedo mode (all on the same machine). Use LocalJobRunner
for debugging code. No name-node, job-tracker, HDFS … all on same machine (
It’s just like executing your Java program on the local machine as Driver code
executes on client and has main() api) ..
Each
Hadoop node runs a small web-server so you can see all logs …
If you do
not need Reducers then setNumReduceTasks(0)
Never try
to do RDBMS jobs in Hadoop MapReduce but use more of PIG, HIVE. Example :
join
Hadoop is not good for relational processing; RDBMS is meant for this kind of
processing.
If you
have to JOIN - map side join , reduce side join
Map side
join – keep side data in memory(under setup()) & comparison in map() api
Reduce
side join – complex & weird; It used composite key; sorting comparator
& grouping comparator
Gzip is
not splittable
LZO,
Snappy is splittable compression technology. As Hadoop needs to distribute
files across the HDFS, it needs a splittable compression algorithm.
Hadoop
strives for DATA LOCALITY.
Terasort
Hadoop is
good for OLAP(analytical). It’s kind of offline.
OLTP is
real-time. It need instantaneous responses.
Sqoop => sqll to Hadoop & Hadoop
to sql. CLoudera sqoop has connectors for all RDBMS vendors.
Sqoop
starts 4 mappers when we try to import the database.
Sqoop
user-guide: https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html
Standard
is –
- Use Hadoop for ETL operations (OLAP)
- Populate the Hadoop output to a RDBMS (OLTP)
- Use BI tools on the RDBMS generated (OLTP)
BI Tools
=> Cognos, Pableau, Informatica, TerraData
Cloudera
Impala is trying to bridge the gap beween Hadoop & RDBMS performance. Cloudera
Impala does not use MapReduce.
Spark is replacement of MapReduce for
developers(technical). It sits above HBASE (or) HADOOP.
Hadoop MapReduce map() is called for
each line in your file for processing, so never connect/load to database in
map() api; do it in setup() API
Cloudera Flume => Gather all Log
files( syslog, web logs …) and submit to Hadoop.
example:
gather syslog from all slaves & submit to Hadoop for processing.
Hive & Impala use SCHEMA on read
not write.
Hive has
meta database (of Apache Derby). Hive is closely integrated with Java/Python; I
mean Hive can use the JAVA APIs that we already have !
Hive has
UI tool called Hue.
How Hive
works: It create as table pointing to file in HDFS; All HQL queries are
executed via table on HDFS file using MapReduce ( All MapReduce joins using
composite keys in the background).
Cloudera Impala is not using MapReduce
and has it’s own agents. Hive & Impala are for SQL-developers.
Pig => All pig scripts get converted
to MapReduce jobs. It works directly on the HDFS file.
Oozie => Apache workflow scheduler
tool ( all configuration in XML file).
Miscellaneous:
MapReduce
=> For programmers & Full control (JAVA code)
Ping,
Hive and Impala => For Business Analysts
Pig,
Hive and Impala are used in conjunction with Data visualization tools like
qlickview, tableau ..
Hadoop
Hive Web UI tools - Hue, Beeswax
Hadoop
Pipelining is what replicates/mirrors the mappers.
If job
failed 4 times it’s taken out by HDFS
Avro is
schema evolution
waitForCompletion
=> Sync
submit
=> Async
Any
language that can read standard-input & emit standard-output can be used to
write MapReduce jobs.
Streaming
API(python …) can be used only with MAP REDUCE api ( not for combiners,
partitioners ..).
MRUnit
=> gives it’s own Driver & input split
ToolRunner
=> allows passing commandline args
Combiner
=> reduce network traffic.
Default
partitioner => hash partition, override getpartitioner() API
Mapper/Reducer
=> does not guarantee call to cleanup() ..
InpputFormat gives data blocks to
Mapper; Mapper outputs key & iterable values ; Reducer takes them and sorts
& merges ..
Apache Flume => gets LOG files to
Hadoop
Apache sqoop => gets RDBMS to Hadoop