Tech Kaizen: 7/1/15

Apache Hadoop Info Dump ..

HDFS lacks the random read/write capability. It is good for sequential data access. And this is where HBase comes into picture. It is a NoSQL database that runs on top your Hadoop cluster and provides you random real-time read/write access to your data. Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A huge dataset when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access). Like all other FileSystems, HDFS provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss(because of the replication).But, being a File System , HDFS lacks random read and write access. This is where HBase comes into picture. It’s a distributed, scalable, big data store, modelled after Google’s BigTable. Cassandra is somewhat similar to hbase.

You can store both structured and unstructured data in Hadoop, and HBase as well. Both of them provide you multiple mechanisms to access the data, like the shell and other APIs. And, HBase stores data as key/value pairs in a columnar fashion while HDFS stores data as flat files. Some of the salient features of both the systems are :

Hadoop:

Optimized for streaming access of large files.
Follows write-once read-many ideology.
Doesn't support random read/write.

HBase:

Stores key/value pairs in columnar fashion (columns are clubbed together as column families).
Provides low latency access to small amounts of data from within a large data set.
Provides flexible data model.

Hadoop is most suited for offline batch-processing kinda stuff while HBase is used when you have real-time needs.

An analogous comparison would be between MySQL and Ext4.

HDFS: Hadoop Distributed File System Its Optimized for streaming access of large files and stores files that are 100 s of MB upwards on HDFS and access through Map reduce Its Optimized use cases where once we write and read many times

HBASE: HBASE is an Open Source ,non relational distributed database and its a part of apache software foundations Apache Hadoop runs on top HDFS.HBASE does not support a structured query language like sql

Hadoop is a set of integrated technologies. Most notable parts are:

HDFS - distributed file system specially built for massive data processing
MapReduce - framework implementing Map Reduce paradigm ove distributed file systems, where HDFS - one of them. It can work over other DFS - for example Amazon S3.
HBase - distributed sorted key-value map built on top of DFS. In best of my knowledge HDFS is only DFS implementation compatible with HBase. HBase needs append capability to write its write ahead log. For example DFS over amazon's s3 does not support it.

"If you want to know about Hadoop and Hbase in deatil, you can visit the respective home pages -"hadoop.apache.org" and "hbase.apache.org". you can also go through the following books if you want to learn in depth "Hadoop.The.Definitive.Guide" and "HBase.The.Definitive.Guide".

I recommend you this talk by Todd Lipcon (Cloudera): "Apache HBase: an introduction" - http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction

“Apache Difinitive Guide” Tom White Book

Distributed programming – Partial failure & recovery

Data analytics everywhere

Hadoop Clusters => all Linux commodity nodes

Google gfs

Google hdfs

Nutch engine – Hadoop creator(doug cutting)

Hadoop is all bout JAVA (other lang support minimal(meaning only map & reduce not other functionality)

HDFS is virtual distributed file system implemented in JAVA

Hadoop pipelining => this is how hadoop provides replication/mirroring

Default replication factor(redundancy) of hadoop 3

Classic (or) high availability hadoop cluster

STONITH alogorithm

Hadoop Federation => namespace for various clusters

Bit Rot

Hadoop partial failure - MTBF (mean time between failures)

Hadoop Rack Awareness

Windows Hadoop => Microsoft HD Inside

Hadoop =>

MN = master nodes

NN = name node

Secondary name node = offload node; House keeping (not a back-up node ; it’s classic code)

Stand-by name node = back-up name node

DN = data nodes

WN = worker nodes

SN = slave nodes

TN = task nodes

Hadoop Applications => Text mining, Sentiment analysis, prediction models, index building, collaborative filtering, graph creation and analysis

Nature of Hadoop => Batch Processing, huge volume of data

HDFS => storage(Files in HDFS are write-once), Batch Processing & sequential Reads( no random reads )

Map-Reduce => Processing

Two master nodes =>

Name Node : manages HDFS (meta data about files , blocks); Name Node daemon always runs;

Job Tracker : manages Mapreduce

Hadoop fs => Hadoop shell

example: hadoop fs -ls

Hadoop has it’s own HDFS file system & Hadoop users ( example: /user/krishna)

MRv1 daemons => Job Tracker, Task Tracker

MRv2(MapReduce V2) => Resource Manager, Application master, Node Manager, JobHistory

Impala is based on C++ & it’s pretty fast. Impala is similar to Hive. It does not use MapReduce but has it’s own Impala agents.

Streaming API - Mappers & Reduce only in python, ruby … all other stuff(practitioners ..) all have to be in JAVA only ..

We need Hadoop streaming jar file ..

MRUnit is built on JUnit and uses mockito framework

LocalJobRunner

InputSplit means what HDFS blocks allocated ..

MRUnit gives us InputSplit(It’s nothing but blocks), MapDriver, ReduceDriver, MapReduceDriver ..

withInput()

withOutput()

runTest()

resetOutput()

withInput()

withOutput()

runTest()

inheritance => “is a” kind of relationship

interface => “is a capability” kind of relationship

Hadoop fs => talking to HDFS

Hadoop jar => talking to job tracker

ToolRunner => ability to pass commandline arguments. It gives Generic Options

Combiner is mini-reducer. It resides on Mapper node. Distributed cache is READ-ONLY. Distributed cache is available in LOCAL working directory.

To debug MapReduce, goto Psuedo mode (all on the same machine). Use LocalJobRunner for debugging code. No name-node, job-tracker, HDFS … all on same machine ( It’s just like executing your Java program on the local machine as Driver code executes on client and has main() api) ..

Each Hadoop node runs a small web-server so you can see all logs …

If you do not need Reducers then setNumReduceTasks(0)

Never try to do RDBMS jobs in Hadoop MapReduce but use more of PIG, HIVE. Example : join

Hadoop is not good for relational processing; RDBMS is meant for this kind of processing.

If you have to JOIN - map side join , reduce side join

Map side join – keep side data in memory(under setup()) & comparison in map() api

Reduce side join – complex & weird; It used composite key; sorting comparator & grouping comparator

Gzip is not splittable

LZO, Snappy is splittable compression technology. As Hadoop needs to distribute files across the HDFS, it needs a splittable compression algorithm.

Hadoop strives for DATA LOCALITY.

Terasort

Hadoop is good for OLAP(analytical). It’s kind of offline.

OLTP is real-time. It need instantaneous responses.

Sqoop => sqll to Hadoop & Hadoop to sql. CLoudera sqoop has connectors for all RDBMS vendors.

Sqoop starts 4 mappers when we try to import the database.

Sqoop user-guide: https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

Standard is –

Use Hadoop for ETL operations (OLAP)
Populate the Hadoop output to a RDBMS (OLTP)
Use BI tools on the RDBMS generated (OLTP)

BI Tools => Cognos, Pableau, Informatica, TerraData

Cloudera Impala is trying to bridge the gap beween Hadoop & RDBMS performance. Cloudera Impala does not use MapReduce.

Spark is replacement of MapReduce for developers(technical). It sits above HBASE (or) HADOOP.

Hadoop MapReduce map() is called for each line in your file for processing, so never connect/load to database in map() api; do it in setup() API

Cloudera Flume => Gather all Log files( syslog, web logs …) and submit to Hadoop.

example: gather syslog from all slaves & submit to Hadoop for processing.

Hive & Impala use SCHEMA on read not write.

Hive has meta database (of Apache Derby). Hive is closely integrated with Java/Python; I mean Hive can use the JAVA APIs that we already have !

Hive has UI tool called Hue.

How Hive works: It create as table pointing to file in HDFS; All HQL queries are executed via table on HDFS file using MapReduce ( All MapReduce joins using composite keys in the background).

Cloudera Impala is not using MapReduce and has it’s own agents. Hive & Impala are for SQL-developers.

Pig => All pig scripts get converted to MapReduce jobs. It works directly on the HDFS file.

Oozie => Apache workflow scheduler tool ( all configuration in XML file).

Miscellaneous:

MapReduce => For programmers & Full control (JAVA code)

Ping, Hive and Impala => For Business Analysts

Pig, Hive and Impala are used in conjunction with Data visualization tools like qlickview, tableau ..

Hadoop Hive Web UI tools - Hue, Beeswax

Hadoop Pipelining is what replicates/mirrors the mappers.

If job failed 4 times it’s taken out by HDFS

Avro is schema evolution

waitForCompletion => Sync

submit => Async

Any language that can read standard-input & emit standard-output can be used to write MapReduce jobs.

Streaming API(python …) can be used only with MAP REDUCE api ( not for combiners, partitioners ..).

MRUnit => gives it’s own Driver & input split

ToolRunner => allows passing commandline args

Combiner => reduce network traffic.

Default partitioner => hash partition, override getpartitioner() API

Mapper/Reducer => does not guarantee call to cleanup() ..

InpputFormat gives data blocks to Mapper; Mapper outputs key & iterable values ; Reducer takes them and sorts & merges ..

Apache Flume => gets LOG files to Hadoop

Apache sqoop => gets RDBMS to Hadoop

Windows 10 Universal Windows Platform (UWP) ..

Universal Windows Platform (UWP) which provides a guaranteed core API layer across devices. You can create a single app package that can be installed onto a wide range of devices. A single store makes it easy to publish apps across all device types. Because your UWP app runs on a wide variety of devices with different form factors and input modalities, you want it to be tailored to each device and be able to unlock the unique capabilities of each device. Devices add their own unique APIs to the guaranteed API layer. You can write code to access those unique APIs conditionally so that your app lights up features specific to one type of device while presenting a different experience on other devices. Adaptive UI controls and new layout panels help you to tailor your UI across a broad range of screen resolutions.

The Windows Runtime (WinRT) is the technology that lets you build Universal Windows Platform (UWP) apps. A Universal Windows app is a Windows experience that is built upon the Universal Windows Platform (UWP), which was first introduced in Windows 8 as the Windows Runtime. Universal Windows apps are most often distributed via the Windows Store (but can also be side-loaded), and are most often packaged and distributed using the .APPX packaging format. Windows on ARM will only support WinRT, and not Win32.Those application which are metro application they has to go through WinRT & Classic Applications has to go through WIN32 Api's.

Visual Studio provide a Universal Windows app template that lets you create a Windows Store app (for PCs, tablets, and laptops) and a Windows Phone Store app in the same project. When your work is finished, you can produce app packages for the Windows Store and Windows Phone Store with a single action to get your app out to customers on any Windows device. You can create Universal Windows apps using the programming languages you're most familiar with, like JavaScript, C#, Visual Basic, or C++. You can even write components in one language and use them in an app that's written in another language. Universal Windows apps can use the Windows Runtime, a native API built into the operating system. This API is implemented in C++ and supported in JavaScript, C#, Visual Basic, and C++ in a way that feels natural for each language.

The following are the conditional compilation constants that you can use to write platform-specific code:

C#	WINDOWS_APP	WINDOWS_PHONE_APP
C++	WINAPI_FAMILY_PC_APP	WINAPI_FAMILY_PHONE_APP

ref:

Guide to Universal Windows Platform (UWP) apps - https://msdn.microsoft.com/en-us/library/Dn894631.aspx

Build a Windows 10 universal app - https://msdn.microsoft.com/library/windows/apps/xaml/dn609832.aspx#target_win10

Best practices in developing (universal) apps for Windows Runtime - https://github.com/futurice/windows-app-development-best-practices

Port your app to the Universal Windows Platform (UWP) -

How to: Use Existing C++ Code in a Universal Windows Platform App - https://msdn.microsoft.com/en-us/library/mt186162.aspx

Porting Android Apps to Windows 10 - http://venturebeat.com/2015/05/01/everything-you-need-to-know-about-porting-android-and-ios-apps-to-windows-10/

VC++ Team Blog -

Universal Windows Platform(UWP) sample code -

Microsoft Virtual Academy resources -

UWP Channel9 Videos -

Apache Hadoop vs Apache Spark

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well suited to machine learning algorithms.

Hadoop vs Spark

Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional map/reduce and Spark. We should look at Hadoop as a general purpose Framework that supports multiple models and We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.

Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop. But as it uses large RAM it needs a dedicated high end physical machine for producing effective results. It all depends and the variables on which this decision depends keep on changing dynamically with time.

MapReduce Hadoop is designed to run batch jobs that address every file in the system. Since that process takes time, MapReduce is well suited for large distributed data processing where fast performance is not an issue, such as running end-of day transactional reports. MapReduce is also ideal for scanning historical data and performing analytics where a short time-to-insight isn’t vital. Spark was purposely designed to support in-memory processing. The net benefit of keeping everything in memory is the ability to perform iterative computations at blazing fast speeds—something MapReduce is not designed to do.

Unlike MapReduce, Spark is designed for advanced, real-time analytics and has the framework and tools to deliver when shorter time-to-insight is critical. Included in Spark’s integrated framework are the Machine Learning Library (MLlib), the graph engine GraphX, the Spark Streaming analytics engine, and the real-time analytics tool, Shark. With this all-in-one platform, Spark is said to deliver greater consistency in product results across various types of analysis.

Over the years, MapReduce Hadoop has enjoyed widespread adoption in the enterprise, and that will continue to be the case. Going forward, as the need for advanced real-time analytics tools escalates, Spark is positioned to meet that challenge.

ref:

http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/

http://www.qubole.com/blog/big-data/spark-vs-mapreduce/

http://www.computerworld.com/article/2856063/enterprise-software/hadoop-successor-sparks-a-data-analysis-evolution.html

Apache Hadoop videos - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD

Apache Spark videos - https://www.youtube.com/watch?v=7k_9sdTOdX4&list=PL9ooVrP1hQOGyFc60sExNX1qBWJyV5IMb

Hadoop Training Videos -

Hortonworks Apache Hadoop popular videos - https://www.youtube.com/watch?v=OoEpfb6yga8&list=PLoEDV8GCixRe-JIs4rEUIkG0aTe3FZgXV
Cloudera Apache Hadoop popular videos - https://www.youtube.com/watch?v=eo1PwSfCXTI&list=PLoEDV8GCixRddiUuJzEESimo1qP5tZ1Kn
Stanford Hadoop Training material - https://www.youtube.com/watch?v=d2xeNpfzsYI&list=PLxRwCyObqFr3OZeYsI7X5Mq6GNjzH10e1
Edureka Hadoop Training material - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD
Durga Solutions Hadoop Training material - https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I

Secure coding guidelines ...

Security Development Lifecycle (SDL) is a software development process that helps developers build more secure software and address security compliance requirements while reducing development cost.

Common Secure Coding Guidelines:

Input Validation
Output Encoding
Authentication and Password Management (includes secure handling of credentials by external services/scripts)
Session Management
Access Control
Cryptography Practices
Error Handling and Logging
Data Protection
Communication Security
System Configuration
Database Security
File Management
Memory Management
General Coding Practices

Secure Coding Books:

The CERT Oracle Secure Coding Standard for Java (SEI Series in Software Engineering)
Java Coding Guidelines: 75 Recommendations for Reliable and Secure Programs (SEI Series in Software Engineering)
Secure Coding in C and C++ (SEI Series in Software Engineering)

ref:

1. OWASP Security Reference Guide -

2. Java coding guidelines -

carnegie mellon university Android coding guidelines - https://www.securecoding.cert.org/confluence/display/android/Android+Secure+Coding+Standard
carnegie mellon university ‘oracle coding standard for java’ - https://www.securecoding.cert.org/confluence/display/java/Java+Coding+Guidelines
carnegie mellon university ‘oracle coding standard for java synchronization ’ - https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=1858104, https://www.securecoding.cert.org/confluence/display/java/SEI+CERT+Oracle+Coding+Standard+for+Java
Java Thread synchronization - https://www.securecoding.cert.org/confluence/pages/viewpage.action?pageId=18581044

3. Oracle's Java Security guidelines -

4. Java Security guidelines videos -

5. Twelve rules for developing more secure Java code - http://www.javaworld.com/article/2076837/mobile-java/twelve-rules-for-developing-more-secure-java-code.html

6. Security Development Life Cycle(SDL) -