Tech Kaizen: DATABASE

Showing posts with label DATABASE. Show all posts

Apache Cassandra: An open source distributed database management system

Apache Cassandra is an open source distributed database management system. It is an Apache Software Foundation top-level project designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010. Jeff Hammerbacher, who led the Facebook Data team at the time, has described Cassandra as a BigTable data model running on an Amazon Dynamo-like infrastructure.

Apache Cassandra is a distributed storage system for managing structured/unstructured data while providing reliability at a massive scale. Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra's Column Family data model offers the convenience of column indexes with the performance of log-structured updates, strong support for materialized views, and powerful built-in caching.

Cassandra is designed to scale to a very large size across many commodity servers, with no single point of failure. The philosophy behind the design of the storage portion of Cassandra is that it be able to satisfy the requirements of applications that demand storage of large amounts of structured data. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different datacenters). At this scale, small and large components fail continuously; the way Cassandra manages the persistent state in the context of these failures enables the reliability and scalability of the software systems relying on this service.

HBase vs Cassandra:

HBase is based on BigTable (Google)
Cassandra is based on DynamoDB (Amazon). Initially developed at Facebook by former Amazon engineers. This is one reason why Cassandra supports multi data center. Rackspace is a big contributor to Cassandra due to multi data center support.

Prominent users:

Cisco's WebEx uses Cassandra to store user feed and activity in near real time.
Facebook used Cassandra to power Inbox Search, with over 200 nodes deployed. This was abandoned in late 2010 when they built Facebook Messaging platform on HBase.
IBM has done research in building a scalable email system based on Cassandra
Netflix uses Cassandra as their back-end database for their streaming services
Formspring uses Cassandra to count responses, as well as store Social Graph data
Twitter announced it is planning to use Cassandra because it can be run on large server clusters and is capable of taking in very large amounts of data at a time.Twitter continues to use it but not for Tweets themselves.
WalmartLabs (previously Kosmix) uses Cassandra with SSD

ref:

Apache Cassandra - http://cassandra.apache.org/, http://en.wikipedia.org/wiki/Apache_Cassandra

Cassandra - https://wiki.intuit.com/display/ARCH/Cassandra

Cassandra NoSQL Database: Getting Started - http://msdn.microsoft.com/en-us/magazine/jj553519.aspx

HBase vs Cassandra - http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html

Use Cassandra to Run Hadoop MapReduce - http://architects.dzone.com/articles/use-cassandra-run-hadoop

Cassandra vs HBase - http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

Running Hadoop MapReduce With Cassandra NoSQL - http://allthingshadoop.com/2010/04/24/running-hadoop-mapreduce-with-cassandra-nosql/

Apache Hadoop - An open source implementation of MapReduce programming model

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

MapReduce, a programming model and implementation developed by Google for processing massive-scale, distributed data sets. Apache Hadoop is an open source MapReduce implementation software framework that supports running data-intensive distributed applications on large cluster built of commodity hardware.

Apache Hadoop is an open source software framework that supports data-intensive distributed applications licensed under the Apache v2 license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor^[3] to the project, and uses Hadoop extensively across its businesses.

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

As a conceptual framework for processing huge data sets, MapReduce is highly optimized for distributed problem-solving using a large number of computers. The framework consists of two functions, as its name implies. The map function is designed to take a large data input and divide it into smaller pieces, which it then hands off to other processes that can do something with it. The reduce function digests the individual answers collected by map and renders them to a final output.

In Hadoop, you define map and reduce implementations by extending Hadoop's own base classes. The implementations are tied together by a configuration that specifies them, along with input and output formats. Hadoop is well-suited for processing huge files containing structured data. One particularly handy aspect of Hadoop is that it handles the raw parsing of an input file, so that you can deal with one line at a time. Defining a map function is thus really just a matter of determining what you want to grab from an incoming line of text.

HDFS(Hadoop Distributed File System): a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

HDFS is so good for -

Storing large files

Terabytes, Petabytes, etc...
Millions rather than billions of files 100MB or more per file

Streaming data

Write once and read-many times patterns
Optimized for streaming reads rather than random reads
Append operation added to Hadoop 0.21

“Cheap” Commodity Hardware

No need for super-comp

HDFS is not so good for -

Low-latency reads

High-throughput rather than low latency for small chunks of data
HBase addresses this issue

Large amount of small files

Better for millions of large files instead of billions of small files
For example each file can be 100MB or more

Multiple Writers

Single writer per file
Writes only at the end of file, no-support for arbitrary offset

pig vs hive:

Pig is a language for expressing data analysis and infrastructure processes. Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

Apache Hive provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database. To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.

PIG is a procedural data-flow language. A procedural language is executing step-by-step approach defined by the programmers. You can control the optimization of every step. HIVE looks like SQL language. Thus, it becomes declarative language. You can specify what should be done rather how should be done. Optimization is difficult in HIVE since HIVE depends on its own optimizer

ref:

Hadoop Training Videos -

Hortonworks Apache Hadoop popular videos - https://www.youtube.com/watch?v=OoEpfb6yga8&list=PLoEDV8GCixRe-JIs4rEUIkG0aTe3FZgXV
Cloudera Apache Hadoop popular videos - https://www.youtube.com/watch?v=eo1PwSfCXTI&list=PLoEDV8GCixRddiUuJzEESimo1qP5tZ1Kn
Stanford Hadoop Training material - https://www.youtube.com/watch?v=d2xeNpfzsYI&list=PLxRwCyObqFr3OZeYsI7X5Mq6GNjzH10e1
Edureka Hadoop Training material - https://www.youtube.com/watch?v=A02SRdyoshM&list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD
Durga Solutions Hadoop Training material - https://www.youtube.com/watch?v=Pq3OyQO-l3E&list=PLpc4L8tPSURCdIXH5FspLDUesmTGRQ39I

Miscellaneous:

The Hadoop wiki provides community input related to Hadoop and HDFS.
The Hadoop API site documents the Java classes and interfaces that are used to program to Hadoop and HDFS.
Wikipedia's MapReduce page is a great place to begin your research into the MapReduce framework.
Visit Amazon S3 to learn about Amazon's S3 infrastructure.
The developerWorks Web development zone specializes in articles covering various web-based solutions.

Get products and technologies

The Hadoop project site contains valuable resources pertaining to the Hadoop architecture and the MapReduce framework.
The Hadoop Distributed File System project site offers downloads and documentation about HDFS.
Venture to the CloudStore site for downloads and documentation about the integration between CloudStore, Hadoop, and HDFS.

Discuss

Create your My developerWorks profile today and set up a watch list on Hadoop. Get connected and stay connected withdeveloperWorks community.
Find other developerWorks members interested in web development.
Share what you know: Join one of our developerWorks groups focused on web topics.
Roland Barcia talks about Web 2.0 and middleware in his blog.
Follow developerWorks' members' shared bookmarks on web topics.
Get answers quickly: Visit the Web 2.0 Apps forum.
Get answers quickly: Visit the Ajax forum.

NoSQL Databases

What is NoSQL?

NoSQL database management systems are useful when working with a huge quantity of data and the data's nature does not require a relational model for the data structure. The data could be structured, but it is of minimal importance and what really matters is the ability to store and retrieve great quantities of data, and not the relationships between the elements. For example, to store millions of key-value pairs in one or a few associative arrays or to store millions of data records. This is particularly useful for statistical or real-time analyses for growing list of elements (such as Twitter posts or the Internet server logs from a big group of users).

Advantages of NoSQL databases:

Horizontally Scalable
Schema-less
Cloud Model

NoSQL Categories:

The current NoSQL world fits into 4 basic categories -

Key-values Stores are based primarily on Amazon's Dynamo Paper which was written in 2007. The main idea is the existence of a hash table where there is a unique key and a pointer to a particular item of data. These mappings are usually accompanied by cache mechanisms to maximize performance.
Column Family Stores were created to store and process very large amounts of data distributed over many machines. There are still keys but they point to multiple columns. In the case of BigTable (Google's Column Family NoSQL model), rows are identified by a row key with the data sorted and stored by this key. The columns are arranged by column family.
Document Databases were inspired by Lotus Notes and are similar to key-value stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON.
Graph Databases are built with nodes, relationships between notes and the properties of nodes. Instead of tables of rows and columns and the rigid structure of SQL, a flexible graph model is used which can scale across many machines.

Major NoSQL Players

The major players in NoSQL have emerged primarily because of the organizations that have adopted them. Some of the largest NoSQL technologies include:

CouchDB: CouchDB is a database that uses JSON for documents, JavaScript for MapReduce queries,and regular HTTP for an API
MongoDB: MongoDB(from "humongous") is a scalable, high-performance, open source NoSQL database. Written in C++
SimpleDB: SimpleDB is a highly available and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests and Amazon SimpleDB does the rest.
BigTable: BigTable is Google's proprietary column oriented database. Google allows the use of BigTable but only for the Google App Engine.
Dynamo: Dynamo was created by Amazon.com and is the most prominent Key-Value NoSQL database. Amazon was in need of a highly scalable distributed platform for their e-commerce businesses so they developed Dynamo. Amazon S3 uses Dynamo as the storage mechanism.
Cassandra: Cassandra was open sourced by Facebook and is a column oriented NoSQL database.
Neo4J: Neo4j is an open source graph database.

ref:

NoSQL Databases - http://en.wikipedia.org/wiki/NoSQL

Document Oriented Databases - http://en.wikipedia.org/wiki/Document-oriented_database

NoSQL and Document Oriented Databases - http://ruby.about.com/od/nosqldatabases/a/nosql1.htm

NoSQL Databases - http://newtech.about.com/od/databasemanagement/a/Nosql.htm

10 things you should know about NoSQL databases - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772

What is a document database - http://dssresources.com/faq/index.php?action=artikel&id=236

Top NoSQL Databases - http://opensourcebyte.blogspot.com/2012/01/top-nosql-databases.html

Sybase Database(SQL Anywhere 11) Binaries Description

Sybase Database(SQL Anywhere 11) Binaries:

dbsrv.exe => Sybase Network Database Server

dbeng.exe => Sybase Personal Database Server

dbmlsync.exe => Use the dbmlsync utility to synchronize SQL Anywhere remote databases with a consolidated database.

dblib.dll => Sybase C library(Interface library)

ctlib.dll => Sybase C library(Interface library)

Details: The Client Library(CTlib) syntax has been unified and simplified compared to DBlibrary, and the number of API calls has been greatly reduced: the Sybase::DBlib module implements 93 calls, the Sybase::CTlib module fewer than 20, with similar functionality. In addition, native handling of Dates and other special data types can make things a lot easier when writing scripts.Missing features in the current version of the Sybase::CTlib module includes access to the bulk copy library (it is available in the Sybase::DBlib version), asynchronous programming and dynamic SQL. The bulk copy library will certainly be added in the near future.

dbextclr11.exe => Allow Sybase database server make calls into the CLR

Details: The dbextclr11.exe has two versions, a 32-bit version and a 64-bit version - one is provided in \bin32 and one in \bin64. These are to allow 32-bit/64-bit database servers make calls into the CLR, but both dbextclr11 processes will launch the "same" system CLR underneath the covers. This is where the "Any CPU" setting of the .NET assembly is used (as the CLR then picks itself to run in '64-bit' mode, even if it was launched from a 32-bit version of dbextclr11).

sqlpp.exe => The Sybase IQ SQL preprocessor utility translates the SQL statements in an input file (.sqc) into C language source that is put into an output file (.c).

Details: Embedded SQL is a database programming interface for the C and C++ programming languages. Embedded SQL consists of SQL statements intermixed with (embedded in) C or C++ source code. These SQL statements are translated by a SQL preprocessor into C or C++ source code, which you then compile.

dbisqlc.exe => Executes SQL commands against a database

Miscellaneous Sybase database Info:

List of Sybase global variables =>

@@error => Commonly used to check the error status (succeeded or failed) of the most recently executed statement. Contains 0 if the previous transaction succeeded; otherwise, contains the last error number generated by the system. A statement such as if @@error != 0 return causes an exit if an error occurs. Every SQL statement resets @@error, so the status check must immediately follow the statement whose success is in question.

@@fetch_status => Contains status information resulting from the last fetch statement. @@fetch_status may contain the following values

0 The fetch statement completed successfully.

-1 The fetch statement resulted in an error.

-2 There is no more data in the result set.

This feature is the same as @@sqlstatus, except that it returns different values. It is for Microsoft SQL Server compatibility.

@@identity => The last value inserted into an Identity/Autoincrement column by an insert, load or update statement. @@identity is reset each time a row is inserted into a table. If a statement inserts multiple rows, @@identity reflects the Identity/Autoincrement value for the last row inserted. If the affected table does not contain an Identity/Autoincrement column, @@identity is set to 0. The value of @@identity is not affected by the failure of an insert, load, or update statement, or the rollback of the transaction that contained the failed statement. @@identity retains the last value inserted into an Identity/Autoincrement column, even if the statement that inserted that value fails to commit.

@@isolation => Current isolation level. @@isolation takes the value of the active level.

@@procid =. Stored procedure ID of the currently executing procedure.

@@servername =. Name of the current database server.

@@sqlstatus => Contains status information resulting from the last FETCH statement.

@@version => Version number of the current version of Sybase IQ.

ref:

Sybase documentation - http://infocenter.sybase.com/help/index.jsp, http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc37774_1150/html/apptech/BGBCBACD.htm

SQL Anywhere 11 - http://dcx.sybase.com/index.html#1100/en/saintro_en11/saintro_en11.html

Sybase Transactions - http://manuals.sybase.com/onlinebooks/group-as/asg1250e/sqlug/@ebt-link;pt=19681?target=%25N%15_52735_START_RESTART_N%25

Sybase SQL Variables - http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc38151.1510/html/iqrefbb/CACGCGBI.htm

Pessimistic vs Optimistic Locking

Pessimistic concurrency control (or pessimistic locking) is called "pessimistic" because the system assumes the worst — it assumes that two or more users will want to update the same record at the same time, and then prevents that possibility by locking the record, no matter how unlikely conflicts actually are.

The locks are placed as soon as any piece of the row is accessed, making it impossible for two or more users to update the row at the same time. Depending on the lock mode (shared, exclusive, or update), other users might be able to read the data even though a lock has been placed. For more details on the lock modes, see Lock modes: shared, exclusive, and update.

Optimistic concurrency control (or optimistic locking) assumes that although conflicts are possible, they will be very rare. Instead oflocking every record every time that it is used, the system merely looks for indications that two users actually did try to update the same record at the same time. If that evidence is found, then one user's updates are discarded and the user is informed.

For example, if User1 updates a record and User2 only wants to read it, then User2 simply reads whatever data is on the disk and then proceeds, without checking whether the data is locked. User2 might see slightly out-of-date information if User1 has read the data and updated it, but has not yet committed the transaction.

Choosing concurrency control mechanism:

In most scenarios, optimistic concurrency control is more efficient and offers higher performance. When choosing between pessimistic and optimistic locking, consider the following:

Pessimistic locking is useful if there are a lot of updates and relatively high chances of users trying to update data at the same time.
For example, if each operation can update a large number of records at a time (the bank might add interest earnings to every account at the end of each month), and two applications are running such operations at the same time, they will have conflicts.

Pessimistic concurrency control is also more appropriate in applications that contain small tables that are frequently updated. In the case of these so-called hotspots, conflicts are so probable that optimistic concurrency control wastes effort in rolling back conflicting transactions.
Optimistic locking is useful if the possibility for conflicts is very low – there are many records but relatively few users, or very few updates and mostly read-type operations.

ref:

Database Concurrency Control - http://publib.boulder.ibm.com/infocenter/soliddb/v6r3/index.jsp?topic=/com.ibm.swg.im.soliddb.sql.doc/doc/the.purpose.of.concurrency.control.html

Database Transactions - http://publib.boulder.ibm.com/infocenter/soliddb/v6r3/index.jsp?topic=/com.ibm.swg.im.soliddb.sql.doc/doc/the.purpose.of.concurrency.control.html

MySQL - Popular Opensource Database

MySQL is a relational database management system (RDBMS) named after Monty Widenius's daughter My. The program runs as a server providing multi-user access to a number of databases. MySQL is officially pronounced My S-Q-L, but often pronounced My SeQueL. MySQL is often used in free software projects that require a full-featured database management system built on the LAMP software stack.

MySQL code uses C and C++. All major programming languages with language-specific APIs include Libraries for accessing MySQL databases. In addition, an ODBC interface called MyODBC allows additional programming languages that support the ODBC interface to communicate with a MySQL database. The MySQL server and official libraries are mostly implemented in ANSI C/ANSI C++.

To administer MySQL databases one can use the included command-line tool (commands: mysql and mysqladmin). Potential users may also download from the MySQL site: GUI administration tools: MySQL Administrator, MySQL Migration Toolkit and MySQL Query Browser. The GUI tools are now included in one package called MySQL GUI Tools.