Tech Kaizen

passion + usefulness = success .. change is the only constant in life

Search this Blog:

Showing posts with label BIG DATA ANALYTICS. Show all posts
Showing posts with label BIG DATA ANALYTICS. Show all posts

Apache Hadoop Info Dump ..

HDFS lacks the random read/write capability. It is good for sequential data access. And this is where HBase comes into picture. It is a NoSQL database that runs on top your Hadoop cluster and provides you random real-time read/write access to your data. Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access). Like all other FileSystems, HDFS provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss(because of the replication).But, being a File System , HDFS lacks random read and write access. This is where HBase comes into picture. It’s a distributed, scalable, big data store, modelled after Google’s BigTable. Cassandra is somewhat similar to hbase.
You can store both structured and unstructured data in Hadoop, and HBase as well. Both of them provide you multiple mechanisms to access the data, like the shell and other APIs. And, HBase stores data as key/value pairs in a columnar fashion while HDFS stores data as flat files. Some of the salient features of both the systems are :

Hadoop:
  1. Optimized for streaming access of large files.
  2. Follows write-once read-many ideology.
  3. Doesn't support random read/write.
HBase:
  1. Stores key/value pairs in columnar fashion (columns are clubbed together as column families).
  2. Provides low latency access to small amounts of data from within a large data set.
  3. Provides flexible data model.
Hadoop is most suited for offline batch-processing kinda stuff while HBase is used when you have real-time needs.
An analogous comparison would be between MySQL and Ext4.
HDFS: Hadoop Distributed File System Its Optimized for streaming access of large files and stores files that are 100 s of MB upwards on HDFS and access through Map reduce Its Optimized use cases where once we write and read many times
HBASE: HBASE is an Open Source ,non relational distributed database and its a part of apache software foundations Apache Hadoop runs on top HDFS.HBASE does not support a structured query language like sql
Hadoop is a set of integrated technologies. Most notable parts are: 
  1. HDFS - distributed file system specially built for massive data processing 
  2. MapReduce - framework implementing Map Reduce paradigm ove distributed file systems, where HDFS - one of them. It can work over other DFS - for example Amazon S3. 
  3. HBase - distributed sorted key-value map built on top of DFS. In best of my knowledge HDFS is only DFS implementation compatible with HBase. HBase needs append capability to write its write ahead log. For example DFS over amazon's s3 does not support it.
"If you want to know about Hadoop and Hbase in deatil, you can visit the respective home pages -"hadoop.apache.org" and "hbase.apache.org". you can also go through the following books if you want to learn in depth "Hadoop.The.Definitive.Guide" and "HBase.The.Definitive.Guide".

I recommend you this talk by Todd Lipcon (Cloudera): "Apache HBase: an introduction" - http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction

“Apache Difinitive Guide” Tom White Book

Distributed programming – Partial failure & recovery
Data analytics everywhere

Hadoop Clusters => all Linux commodity nodes
Google gfs
Google hdfs

Nutch engine – Hadoop creator(doug cutting)

Hadoop is all bout JAVA (other lang support minimal(meaning only map & reduce not other functionality)

HDFS is virtual distributed file system implemented in JAVA

Hadoop pipelining => this is how hadoop provides replication/mirroring

Default replication factor(redundancy) of hadoop 3

Classic (or) high availability hadoop cluster

STONITH alogorithm

Hadoop Federation => namespace for various clusters

Bit Rot

Hadoop partial failure - MTBF (mean time between failures)

Hadoop Rack Awareness

Windows Hadoop => Microsoft HD Inside

Hadoop =>

MN = master nodes

NN =  name node
Secondary name node = offload node; House keeping (not a back-up node ; it’s classic code)
Stand-by name node = back-up name node

DN = data nodes
WN = worker nodes
SN = slave nodes
TN = task nodes

Hadoop Applications => Text mining, Sentiment analysis, prediction models, index building, collaborative filtering, graph creation and analysis
Nature of Hadoop => Batch Processing, huge volume of data

HDFS => storage(Files in HDFS are write-once), Batch Processing & sequential Reads( no random reads )
Map-Reduce => Processing

Two master nodes =>

Name Node : manages HDFS (meta data about files , blocks); Name Node daemon always runs;
Job Tracker : manages Mapreduce

Hadoop fs => Hadoop shell
example: hadoop fs -ls
Hadoop has it’s own HDFS file system & Hadoop users ( example: /user/krishna)

MRv1 daemons => Job Tracker, Task Tracker

MRv2(MapReduce V2) => Resource Manager, Application master, Node Manager, JobHistory

Impala is based on C++ & it’s pretty fast. Impala is similar to Hive. It does not use MapReduce but has it’s own Impala agents.

Streaming API - Mappers & Reduce only in python, ruby … all other stuff(practitioners ..) all have to be in JAVA only ..

We need Hadoop streaming jar file ..

MRUnit is built on JUnit and uses mockito framework
LocalJobRunner

InputSplit means what HDFS blocks allocated ..

MRUnit gives us InputSplit(It’s nothing but blocks), MapDriver, ReduceDriver, MapReduceDriver ..

withInput()
withOutput()
runTest()
resetOutput()
withInput()
withOutput()
runTest()

inheritance => “is a” kind of relationship
interface => “is a capability” kind of relationship

Hadoop fs => talking to HDFS
Hadoop jar => talking to job tracker

ToolRunner => ability to pass commandline arguments. It gives Generic Options

Combiner is mini-reducer. It resides on Mapper node. Distributed cache is READ-ONLY. Distributed cache is available in LOCAL working directory.

To debug MapReduce, goto Psuedo mode (all on the same machine). Use LocalJobRunner for debugging code. No name-node, job-tracker, HDFS … all on same machine ( It’s just like executing your Java program on the local machine as Driver code executes on client  and has main() api) ..

Each Hadoop node runs a small web-server so you can see all logs …

If you do not need Reducers then setNumReduceTasks(0)

Never try to do RDBMS jobs in Hadoop MapReduce but use more of PIG, HIVE.  Example : join
    Hadoop is not good for relational processing; RDBMS is meant for this kind of processing.

If you have to JOIN - map side join , reduce side join

Map side join – keep side data in memory(under setup()) & comparison in map() api

Reduce side join – complex & weird; It used composite key; sorting comparator & grouping comparator

Gzip is not splittable

LZO, Snappy is splittable compression technology. As Hadoop needs to distribute files across the HDFS, it needs a splittable compression algorithm.

Hadoop strives for DATA LOCALITY.

Terasort

Hadoop is good for OLAP(analytical). It’s kind of offline.
OLTP is real-time. It need instantaneous responses.

Sqoop => sqll to Hadoop & Hadoop to sql. CLoudera sqoop has connectors for all RDBMS vendors.
Sqoop starts 4 mappers when we try to import the database.
Sqoop user-guide: https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

Standard is –
  1. Use Hadoop for ETL operations (OLAP)
  2. Populate the Hadoop output to a RDBMS (OLTP)
  3. Use BI tools on the RDBMS generated (OLTP)
BI Tools => Cognos, Pableau, Informatica, TerraData

Cloudera Impala is trying to bridge the gap beween Hadoop & RDBMS performance. Cloudera Impala does not use MapReduce.

Spark is replacement of MapReduce for developers(technical). It sits above HBASE (or) HADOOP.

Hadoop MapReduce map() is called for each line in your file for processing, so never connect/load to database in map() api; do it in setup() API

Cloudera Flume => Gather all Log files( syslog, web logs …) and submit to Hadoop.
example: gather syslog from all slaves & submit to Hadoop for processing.

Hive & Impala use SCHEMA on read not write.
Hive has meta database (of Apache Derby). Hive is closely integrated with Java/Python; I mean Hive can use the JAVA APIs that we already have !
Hive has UI tool called Hue.
How Hive works: It create as table pointing to file in HDFS; All HQL queries are executed via table on HDFS file using MapReduce ( All MapReduce joins using composite keys in the background).

Cloudera Impala is not using MapReduce and has it’s own agents. Hive & Impala are for SQL-developers.

Pig => All pig scripts get converted to MapReduce jobs. It works directly on the HDFS file.

Oozie => Apache workflow scheduler tool ( all configuration in XML file).

Miscellaneous:

MapReduce => For programmers & Full control (JAVA code)
Ping, Hive and Impala => For Business Analysts

Pig, Hive and Impala are used in conjunction with Data visualization tools like qlickview, tableau ..

Hadoop Hive Web UI tools  - Hue, Beeswax

Hadoop Pipelining is what replicates/mirrors the mappers.

If job failed 4 times it’s taken out by HDFS

Avro is schema evolution

waitForCompletion => Sync
submit => Async

Any language that can read standard-input & emit standard-output can be used to write MapReduce jobs.
Streaming API(python …) can be used only with MAP REDUCE api ( not for combiners, partitioners ..).

MRUnit => gives it’s own Driver & input split

ToolRunner => allows passing commandline args

Combiner => reduce network traffic.

Default partitioner => hash partition, override getpartitioner() API

Mapper/Reducer => does not guarantee call to cleanup() ..

InpputFormat gives data blocks to Mapper; Mapper outputs key & iterable values ; Reducer takes them and sorts & merges ..

Apache Flume => gets LOG files to Hadoop

Apache sqoop => gets RDBMS to Hadoop

Posted by Krishna Kishore Koney
Labels: BIG DATA ANALYTICS, CLOUD COMPUTING, PYTHON PROGRAMMING

Apache Hadoop vs Apache Spark

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.


Hadoop vs Spark


Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional map/reduce and Spark. We should look at Hadoop as a general purpose Framework that supports multiple models and We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.


Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop. But as it uses large RAM it needs a dedicated high end physical machine for producing effective results. It all depends and the variables on which this decision depends keep on changing dynamically with time.


MapReduce Hadoop is designed to run batch jobs that address every file in the system. Since that process takes time, MapReduce is well suited for large distributed data processing where fast performance is not an issue, such as running end-of day transactional reports. MapReduce is also ideal for scanning historical data and performing analytics where a short time-to-insight isn’t vital. Spark was purposely designed to support in-memory processing. The net benefit of keeping everything in memory is the ability to perform iterative computations at blazing fast speeds—something MapReduce is not designed to do. 


Unlike MapReduce, Spark is designed for advanced, real-time analytics and has the framework and tools to deliver when shorter time-to-insight is critical. Included in Spark’s integrated framework are the Machine Learning Library (MLlib), the graph engine GraphX, the Spark Streaming analytics engine, and the real-time analytics tool, Shark. With this all-in-one platform, Spark is said to deliver greater consistency in product results across various types of analysis.


Over the years, MapReduce Hadoop has enjoyed widespread adoption in the enterprise, and that will continue to be the case. Going forward, as the need for advanced real-time analytics tools escalates, Spark is positioned to meet that challenge.


ref:


http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/

http://www.qubole.com/blog/big-data/spark-vs-mapreduce/


http://www.computerworld.com/article/2856063/enterprise-software/hadoop-successor-sparks-a-data-analysis-evolution.html


Apache Hadoop videos - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD


Apache Spark videos - https://www.youtube.com/watch?v=7k_9sdTOdX4&list=PL9ooVrP1hQOGyFc60sExNX1qBWJyV5IMb


Hadoop Training Videos -

  1. Hortonworks Apache Hadoop popular videos - https://www.youtube.com/watch?v=OoEpfb6yga8&list=PLoEDV8GCixRe-JIs4rEUIkG0aTe3FZgXV
  2. Cloudera Apache Hadoop popular videos - https://www.youtube.com/watch?v=eo1PwSfCXTI&list=PLoEDV8GCixRddiUuJzEESimo1qP5tZ1Kn
  3. Stanford Hadoop Training material - https://www.youtube.com/watch?v=d2xeNpfzsYI&list=PLxRwCyObqFr3OZeYsI7X5Mq6GNjzH10e1
  4. Edureka Hadoop Training material - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD
  5. Durga Solutions Hadoop Training material - https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I

Posted by Krishna Kishore Koney
Labels: BIG DATA ANALYTICS, CLOUD COMPUTING, DATA SCIENCE

Big Data

Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.


MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.


ref:

Big Data - http://en.wikipedia.org/wiki/Big_data, http://en.wikipedia.org/wiki/MapReduce


Hadoop tutorial - http://www.coreservlets.com/hadoop-tutorial/


What is Hadoop - http://www-01.ibm.com/software/data/infosphere/hadoop/

MapReduce: Simplified Data Processing on Large Clusters - http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf

Google’s MapReduce Programming Model(Revisited) - http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf

MapReduce: Simplified Data Processing on Large Clusters - http://www.cs.utexas.edu/~pingali/CS395T/2012sp/lectures/MR-nikhil-panpalia.pdf


Hadoop/MapReduce - http://www.cs.colorado.edu/~kena/classes/5448/s11/presentations/hadoop.pdf


Apache's implementation of Google's MapReduce framework - https://www.defcon.org/images/defcon-17/dc-17-presentations/defcon-17-calca-anguiano-hadoop.pdf


Intel big data - http://www.intel.com/bigdata


Apache Hadoop Framework Spotlights - http://www.intel.com/content/www/us/en/big-data/big-data-apache-hadoop-framework-spotlights-landing.html

Posted by Krishna Kishore Koney
Labels: BIG DATA ANALYTICS, CLOUD COMPUTING

Apache Cassandra: An open source distributed database management system

Apache Cassandra is an open source distributed database management system. It is an Apache Software Foundation top-level project designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010. Jeff Hammerbacher, who led the Facebook Data team at the time, has described Cassandra as a BigTable data model running on an Amazon Dynamo-like infrastructure.

Apache Cassandra is a distributed storage system for managing structured/unstructured data while providing reliability at a massive scale. Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra's Column Family data model offers the convenience of column indexes with the performance of log-structured updates, strong support for materialized views, and powerful built-in caching.


Cassandra is designed to scale to a very large size across many commodity servers, with no single point of failure.  The philosophy behind the design of the storage portion of Cassandra is that it be able to satisfy the requirements of applications that demand storage of large amounts of structured data. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different datacenters). At this scale, small and large components fail continuously; the way Cassandra manages the persistent state in the context of these failures enables the reliability and scalability of the software systems relying on this service. 


HBase vs Cassandra:
  • HBase is based on BigTable (Google)
  • Cassandra is based on DynamoDB (Amazon).   Initially developed at Facebook by former Amazon engineers.  This is one reason why Cassandra supports multi data center.  Rackspace is a big contributor to Cassandra due to multi data center support.
Prominent users:
  • Cisco's WebEx uses Cassandra to store user feed and activity in near real time.
  • Facebook used Cassandra to power Inbox Search, with over 200 nodes deployed. This was abandoned in late 2010 when they built Facebook Messaging platform on HBase.
  • IBM has done research in building a scalable email system based on Cassandra
  • Netflix uses Cassandra as their back-end database for their streaming services
  • Formspring uses Cassandra to count responses, as well as store Social Graph data 
  • Twitter announced it is planning to use Cassandra because it can be run on large server clusters and is capable of taking in very large amounts of data at a time.Twitter continues to use it but not for Tweets themselves.
  • WalmartLabs (previously Kosmix) uses Cassandra with SSD
ref:

Apache Cassandra - http://cassandra.apache.org/, http://en.wikipedia.org/wiki/Apache_Cassandra

Cassandra - https://wiki.intuit.com/display/ARCH/Cassandra

Cassandra NoSQL Database: Getting Started - http://msdn.microsoft.com/en-us/magazine/jj553519.aspx

HBase vs Cassandra - http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

Use Cassandra to Run Hadoop MapReduce - http://architects.dzone.com/articles/use-cassandra-run-hadoop

Cassandra vs HBase - http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

Running Hadoop MapReduce With Cassandra NoSQL - http://allthingshadoop.com/2010/04/24/running-hadoop-mapreduce-with-cassandra-nosql/

Posted by Krishna Kishore Koney
Labels: BIG DATA ANALYTICS, CLOUD COMPUTING, DATABASE

Apache Mahout - Scalable Machine Learning Algorithms


Apache Mahout is a new open source project by the Apache Software Foundation (ASF) with the primary goal of creating scalable machine-learning algorithms that are free to use under the Apache license. The project is entering its second year, with one public release under its belt. Mahout contains implementations for clustering, categorization, CF, and evolutionary programming. Furthermore, where prudent, it uses the Apache Hadoop library to enable Mahout to scale effectively in the cloud.

ref:

Apache Mahout Wikipedia - http://en.wikipedia.org/wiki/Apache_Mahout

Apache Mahout Wiki - https://cwiki.apache.org/MAHOUT/mahout-wiki.html

What is Apache Mahout - http://mahout.apache.org/ 

Introducing Apache Mahout - http://www.ibm.com/developerworks/java/library/j-mahout/

Introduction to Apache Mahout - http://www.slideshare.net/gsingers/intro-to-mahout-dc-hadoop

Data Mining using Mahout - http://search.iiit.ac.in/cloud/presentations/8.pdf

Apache Mahout: a formidable collection of Data mining algos on the top of Hadoop (Map Reduce) - http://codingplayground.blogspot.com/2010/08/apache-mahout-formidable-collection-of.html

Machine Learning using Mahout - http://www.slideshare.net/gsingers/intro-to-apache-mahout

IBM Hadoop and Mahout online resources - http://www.ibm.com/developerworks/java/library/j-mahout/#resources

Posted by Krishna Kishore Koney
Labels: BIG DATA ANALYTICS, CLOUD COMPUTING, DATA SCIENCE

Apache Hadoop - An open source implementation of MapReduce programming model

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. Apache Hadoop is an open source MapReduce implementation software framework that supports running data-intensive distributed applications on large cluster built of commodity hardware.

Apache Hadoop is an open source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor[3] to the project, and uses Hadoop extensively across its businesses.

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

As a conceptual framework for processing huge data sets, MapReduce is highly optimized for distributed problem-solving using a large number of computers. The framework consists of two functions, as its name implies. The map function is designed to take a large data input and divide it into smaller pieces, which it then hands off to other processes that can do something with it. The reduce function digests the individual answers collected by map and renders them to a final output.

In Hadoop, you define map and reduce implementations by extending Hadoop's own base classes. The implementations are tied together by a configuration that specifies them, along with input and output formats. Hadoop is well-suited for processing huge files containing structured data. One particularly handy aspect of Hadoop is that it handles the raw parsing of an input file, so that you can deal with one line at a time. Defining a map function is thus really just a matter of determining what you want to grab from an incoming line of text.

HDFS(Hadoop Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

HDFS is so good for -
  • Storing large files
    • Terabytes, Petabytes, etc...
    • Millions rather than billions of files 100MB or more per file
  • Streaming data
    • Write once and read-many times patterns
    • Optimized for streaming reads rather than random reads
    • Append operation added to Hadoop 0.21
  • “Cheap” Commodity Hardware
    • No need for super-comp
HDFS is not so good for - 
  • Low-latency reads
    • High-throughput rather than low latency for small chunks of data
    • HBase addresses this issue
  • Large amount of small files
    • Better for millions of large files instead of billions of small files
    • For example each file can be 100MB or more
  • Multiple Writers
    • Single writer per file
    • Writes only at the end of file, no-support for arbitrary offset
pig vs hive:

Pig is a language for expressing data analysis and infrastructure processes. Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

Apache Hive provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.


PIG is a procedural data-flow language. A procedural language is executing step-by-step approach defined by the programmers. You can control the optimization of every step. HIVE looks like SQL language. Thus, it becomes declarative language. You can specify what should be done rather how should be done. Optimization is difficult in HIVE since HIVE depends on its own optimizer


ref:

MapReduce: Simplified Data Processing on Large Clusters - http://static.usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf


Apache Hadoop Goes Realtime at Facebook - http://borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf


Java development 2.0: Big data analysis with Hadoop MapReduce - http://www.ibm.com/developerworks/java/library/j-javadev2-15/index.html


To Hadoop, or not to Hadoop - https://www.ibm.com/developerworks/mydeveloperworks/blogs/theTechTrek/entry/to_hadoop_or_not_to_hadoop2?lang=en


What is Hadoop - http://www-01.ibm.com/software/data/infosphere/hadoop/


Distributed data processing with Hadoop, Part 1: Getting started - http://www.ibm.com/developerworks/linux/library/l-hadoop-1/


Distributed data processing with Hadoop, Part 2: Going further - http://www.ibm.com/developerworks/linux/library/l-hadoop-2/


Distributed data processing with Hadoop, Part 3: Application development - http://www.ibm.com/developerworks/linux/library/l-hadoop-3/


An introduction to the Hadoop Distributed File System - http://www.ibm.com/developerworks/web/library/wa-introhdfs/


Scheduling in Hadoop - http://www.ibm.com/developerworks/linux/library/os-hadoop-scheduling/index.html


Using MapReduce and load balancing on the cloud - http://www.ibm.com/developerworks/cloud/library/cl-mapreduce/


Intel Big Data - http://www.intel.com/bigdata


Apache Hadoop Framework Spotlights - http://www.intel.com/content/www/us/en/big-data/big-data-apache-hadoop-framework-spotlights-landing.html


Hadoop Books - https://drive.google.com/folderview?id=0B_hC-3L4eq17VFZfdVE0Z2NCUzQ&usp=sharing_eid


Hadoop Training Videos -


  1. Hortonworks Apache Hadoop popular videos - https://www.youtube.com/watch?v=OoEpfb6yga8&list=PLoEDV8GCixRe-JIs4rEUIkG0aTe3FZgXV
  2. Cloudera Apache Hadoop popular videos - https://www.youtube.com/watch?v=eo1PwSfCXTI&list=PLoEDV8GCixRddiUuJzEESimo1qP5tZ1Kn
  3. Stanford Hadoop Training material - https://www.youtube.com/watch?v=d2xeNpfzsYI&list=PLxRwCyObqFr3OZeYsI7X5Mq6GNjzH10e1
  4. Edureka Hadoop Training material - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD
  5. Durga Solutions Hadoop Training material - https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I
Miscellaneous:

  • The Hadoop wiki provides community input related to Hadoop and HDFS.
  • The Hadoop API site documents the Java classes and interfaces that are used to program to Hadoop and HDFS.
  • Wikipedia's MapReduce page is a great place to begin your research into the MapReduce framework.
  • Visit Amazon S3 to learn about Amazon's S3 infrastructure.
  • The developerWorks Web development zone specializes in articles covering various web-based solutions.
       Get products and technologies
  • The Hadoop project site contains valuable resources pertaining to the Hadoop architecture and the MapReduce framework.
  • The Hadoop Distributed File System project site offers downloads and documentation about HDFS.
  • Venture to the CloudStore site for downloads and documentation about the integration between CloudStore, Hadoop, and HDFS.
       Discuss
  • Create your My developerWorks profile today and set up a watch list on Hadoop. Get connected and stay connected withdeveloperWorks community.
  • Find other developerWorks members interested in web development.
  • Share what you know: Join one of our developerWorks groups focused on web topics.
  • Roland Barcia talks about Web 2.0 and middleware in his blog.
  • Follow developerWorks' members' shared bookmarks on web topics.
  • Get answers quickly: Visit the Web 2.0 Apps forum.
  • Get answers quickly: Visit the Ajax forum.

Posted by Krishna Kishore Koney
Labels: BIG DATA ANALYTICS, CLOUD COMPUTING, DATABASE, TECHNICAL MISCELLANEOUS, TECHNOLOGY INTEGRATION
Older Posts Home
Subscribe to: Posts (Atom)

The Verge - YOUTUBE

Loading...

Microsoft Research

Loading...

Hugging Face - Blog

Loading...

AI at Wharton

Loading...

Stanford Online

Loading...

MIT OpenCourseWare - YOUTUBE

Loading...

NPTEL IISC BANGALORE - YOUTUBE

Loading...

HackerRank - YOUTUBE

Loading...

FREE CODE CAMP - YOUTUBE

Loading...

BYTE BYTE GO - YOUTBUE

Loading...

GAURAV SEN INTERVIEWS - YOUTUBE

Loading...

Tanay Pratap - YOUTUBE

Loading...

Ashish Pratap Singh - YOUTUBE

Loading...

Kantan Coding - YOUTUBE

Loading...

SUCCESS IN TECH INTERVIEWS - YOUTUBE

Loading...

IGotAnOffer: Engineering - YOUTUBE

Loading...

DEEPLEARNING AI - YOUTUBE

Loading...

MIT News - Artificial intelligence

Loading...
My photo
Krishna Kishore Koney
View my complete profile
" It is not the strongest of the species that survives nor the most intelligent that survives, It is the one that is the most adaptable to change "

View krishna kishore koney's profile on LinkedIn


Failure is not falling down, it is not getting up again. Success is the ability to go from failure to failure without losing your enthusiasm.

Where there's a Will, there's a Way. Keep on doing what fear you, that is the quickest and surest way to to conquer it.

Vision is the art of seeing what is invisible to others. For success, attitude is equally as important as ability.

Monthly Blog Archives

  • ▼  2026 (5)
    • ▼  May (1)
      • Open-source AI agent frameworks
    • ►  April (1)
    • ►  March (3)
  • ►  2025 (4)
    • ►  October (1)
    • ►  August (1)
    • ►  May (1)
    • ►  April (1)
  • ►  2024 (18)
    • ►  December (1)
    • ►  October (2)
    • ►  September (5)
    • ►  August (10)
  • ►  2022 (2)
    • ►  December (2)
  • ►  2021 (2)
    • ►  April (2)
  • ►  2020 (18)
    • ►  November (1)
    • ►  September (8)
    • ►  August (1)
    • ►  June (8)
  • ►  2019 (18)
    • ►  December (1)
    • ►  November (2)
    • ►  September (3)
    • ►  May (8)
    • ►  February (1)
    • ►  January (3)
  • ►  2018 (3)
    • ►  November (1)
    • ►  October (1)
    • ►  January (1)
  • ►  2017 (2)
    • ►  November (1)
    • ►  March (1)
  • ►  2016 (5)
    • ►  December (1)
    • ►  April (3)
    • ►  February (1)
  • ►  2015 (15)
    • ►  December (1)
    • ►  October (1)
    • ►  August (2)
    • ►  July (4)
    • ►  June (2)
    • ►  May (3)
    • ►  January (2)
  • ►  2014 (13)
    • ►  December (1)
    • ►  November (2)
    • ►  October (4)
    • ►  August (5)
    • ►  January (1)
  • ►  2013 (5)
    • ►  September (2)
    • ►  May (1)
    • ►  February (1)
    • ►  January (1)
  • ►  2012 (19)
    • ►  November (1)
    • ►  October (2)
    • ►  September (1)
    • ►  July (1)
    • ►  June (6)
    • ►  May (1)
    • ►  April (2)
    • ►  February (3)
    • ►  January (2)
  • ►  2011 (20)
    • ►  December (5)
    • ►  August (2)
    • ►  June (6)
    • ►  May (4)
    • ►  April (2)
    • ►  January (1)
  • ►  2010 (41)
    • ►  December (2)
    • ►  November (1)
    • ►  September (5)
    • ►  August (2)
    • ►  July (1)
    • ►  June (1)
    • ►  May (8)
    • ►  April (2)
    • ►  March (3)
    • ►  February (5)
    • ►  January (11)
  • ►  2009 (113)
    • ►  December (2)
    • ►  November (5)
    • ►  October (11)
    • ►  September (1)
    • ►  August (14)
    • ►  July (5)
    • ►  June (10)
    • ►  May (4)
    • ►  April (7)
    • ►  March (11)
    • ►  February (15)
    • ►  January (28)
  • ►  2008 (61)
    • ►  December (7)
    • ►  September (6)
    • ►  August (1)
    • ►  July (17)
    • ►  June (6)
    • ►  May (24)
  • ►  2006 (7)
    • ►  October (7)

Blog Archives Categories

  • .NET DEVELOPMENT (38)
  • 5G (5)
  • AI (Artificial Intelligence) (16)
  • AI/ML (10)
  • ANDROID DEVELOPMENT (7)
  • BIG DATA ANALYTICS (6)
  • C PROGRAMMING (7)
  • C++ PROGRAMMING (24)
  • CAREER MANAGEMENT (6)
  • CHROME DEVELOPMENT (2)
  • CLOUD COMPUTING (46)
  • CODE REVIEWS (3)
  • CYBERSECURITY (12)
  • DATA SCIENCE (4)
  • DATABASE (14)
  • DESIGN PATTERNS (9)
  • DEVICE DRIVERS (5)
  • DOMAIN KNOWLEDGE (14)
  • EDGE COMPUTING (4)
  • EMBEDDED SYSTEMS (9)
  • ENTERPRISE ARCHITECTURE (10)
  • IMAGE PROCESSING (3)
  • INTERNET OF THINGS (2)
  • J2EE PROGRAMMING (10)
  • KERNEL DEVELOPMENT (6)
  • KUBERNETES (20)
  • LATEST TECHNOLOGY (23)
  • LINUX (9)
  • MAC OPERATING SYSTEM (2)
  • MOBILE APPLICATION DEVELOPMENT (14)
  • PORTING (4)
  • PYTHON PROGRAMMING (6)
  • RESEARCH AND DEVELOPMENT (1)
  • SCRIPTING LANGUAGES (8)
  • SERVICE ORIENTED ARCHITECTURE (SOA) (10)
  • SOFTWARE DESIGN (13)
  • SOFTWARE QUALITY (5)
  • SOFTWARE SECURITY (24)
  • SYSTEM and NETWORK ADMINISTRATION (3)
  • SYSTEM PROGRAMMING (4)
  • TECHNICAL MISCELLANEOUS (32)
  • TECHNOLOGY INTEGRATION (5)
  • TEST AUTOMATION (5)
  • UNIX OPERATING SYSTEM (4)
  • VC++ PROGRAMMING (44)
  • VIRTUALIZATION (8)
  • WEB PROGRAMMING (8)
  • WINDOWS OPERATING SYSTEM (13)
  • WIRELESS DEVELOPMENT (5)
  • XML (3)

Popular Posts

  • Windows User-Mode Driver Framework (UMDF) ..
  • Open-source AI agent frameworks
  • Windows FileSystem Mini Filter Driver Development
  • Service Discovery Protocols

My Other Blogs

  • Career Management: Invest in Yourself
  • A la carte: Color your Career
  • Attitude is everything(in Telugu language)
WINNING vs LOSING

Hanging on, persevering, WINNING
Letting go, giving up easily, LOSING

Accepting responsibility for your actions, WINNING
Always having an excuse for your actions, LOSING

Taking the initiative, WINNING
Waiting to be told what to do, LOSING

Knowing what you want and setting goals to achieve it, WINNING
Wishing for things, but taking no action, LOSING

Seeing the big picture, and setting your goals accordingly, WINNING
Seeing only where you are today, LOSING

Being determined, unwilling to give up WINNING
Gives up easily, LOSING

Having focus, staying on track, WINNING
Allowing minor distractions to side track them, LOSING

Having a positive attitude, WINNING
having a "poor me" attitude, LOSING

Adopt a WINNING attitude!

Total Pageviews

Who am I

My photo
Krishna Kishore Koney

Blogging is about ideas, self-discovery, and growth. This is a small effort to grow outside my comfort zone.

Most important , A Special Thanks to my parents(Sri Ramachandra Rao & Srimathi Nagamani), my wife(Roja), my lovely daughter (Hansini) and son (Harshil) for their inspiration and continuous support in developing this Blog.

... "Things will never be the same again. An old dream is dead and a new one is being born, as a flower that pushes through the solid earth. A new vision is coming into being and a greater consciousness is being unfolded" ... from Jiddu Krishnamurti's Teachings.

Now on disclaimer :
1. Please note that my blog posts reflect my perception of the subject matter and do not reflect the perception of my Employer.

2. Most of the times the content of the blog post is aggregated from Internet articles and other blogs which inspired me. Due respect is given by mentioning the referenced URLs below each post.

Have a great time

My LinkedIn Profile
View my complete profile

Aryaka Insights

Loading...

Reid Hoffman - YOUTUBE

Loading...

Martin Fowler's Bliki - BLOG

Loading...

The Pragmatic Engineer

Loading...

AI Workshop

Loading...

CYBER SECURITY - YOUTUBE

Loading...

CYBER SECURITY FUNDAMENTALS PROF MESSER - YOUTUBE

Loading...