BigData Analysis

21
Big Data Analysis Tools & Methods Spring 2015 OCCC - Tehran

Transcript of BigData Analysis

Big Data Analysis Tools & Methods

Spring 2015

OCCC - Tehran

Personal Profile:

● Ehsan Derakhshan

● Founder & CEO at innfinision Cloud & BigData Solutions

● More than 15 year experience (Telecom & Datacom)

[email protected]

● Innfinision.net

About innfinision:

● Providing Cloud, Virtualization and Data Center Solutions

● BigData Management - Analysis & Development Solutions

● Developing Software for Cloud Environments

● Providing Services to Telecom, Education, Banking & more...

● Supporting OpenStack Foundation as the First Iranian Company

● Partner of : Docker - MongoDB - RedHat

BigData Analysis Tools & Methods innfinision.net

● What is Data & BigData?

● Important Questions

● Tools & Solutions

● Advantages - Why & Where

Agenda:

What is Data & BigData ?

innfinision.netBigData Analysis Tools & Methods

innfinision.netBigData Analysis Tools & Methods

What is Data?

Data is a collection of facts, such as numbers, words, measurements, observations or even just descriptions of things.

Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind. Strictly speaking, data is the plural of datum, a single piece of information.

Big data can be described by the following characteristics:

1- Volume2- Velocity3- Variety4- Variability5- Veracity6- Complexity7- & etc

Of information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making

innfinision.netBigData Analysis Tools & Methods

Important Questions

innfinision.netBigData Analysis Tools & Methods

Important Question:

Can a database really deliver quantifiable business advantage?

To some, the database is a low-level infrastructure component of a much larger application -- something that only developers, DBAs and operations staff need to care or worry about.

However, in the digital economy, data is the raw currency. How an organization stores, manages, analyzes and uses data has a direct impact on its success -- and its costs. Its choice of database affects how quickly it can deliver new applications to market, support business growth and improve customer experience.

innfinision.netBigData Analysis Tools & Methods

Consider these examples:

- After trying for eight years to build a single view of their customer, one of the world's leading insurance companies changed database and delivered the project in just three months

- A leading telecommunications provider adopted a new database technology and were able to accelerate time to market by 4x, reduce engineering costs by 50% and improve customer experience by 10x

- A Tier 1 investment bank rebuilt its globally-distributed reference data platform on a new database technology, enabling it to save an estimated $40M over five yearsSingles can now find their ideal partner 95% faster after one of the world’s leading relationship providers switched data and machine learning to a new platform

innfinision.netBigData Analysis Tools & Methods

innfinision.netBigData Analysis Tools & Methods

So Why is database selection becoming so critical?

Because the requirements of modern applications and the demands of sophisticated, data-savvy users are changing.

Data is being generated at much faster rates than ever before and can yield insights never previously possible. The data no longer fits neatly into structured rows and columns. Windows of market opportunity are getting smaller. Underlying infrastructure is being commoditized, with powerful systems available for just pennies per hour.The database chosen by a project team can be the enabler -- or the blocker -- to success. All of the assumptions that have dictated database selection over the past 30 years are being revisited as a result of the factors discussed above.

Challenges for DataBase Selection:

- Risk tolerance for bugs and unmapped behaviors- HA- Redundancy- Access- and location-based requirements- Security requirements- Skill sets and tooling- Architecture and infrastructure- Growth expectations and the timeline therein (Scalable)- Support? Community?- Free Schema (Flexible Data Model)- Scale Out- Real-time- Rich Queries- Migration- Drivers- Faster- Agile- Backup/Restore- Monitoring & …

innfinision.netBigData Analysis Tools & Methods

Tools & Solutions

innfinision.netBigData Analysis Tools & Methods

innfinision.netBigData Analysis Tools & Methods

Innfinision BigData Solutions:

1- MongoDB :MongoDB (from 'humongous') is a Scalable, High performance, OpenSource, Schema-free, Document-Oriented Database.MongoDB provides high performance, high availability, and easy scalability. Document Database. Documents (objects) map nicely to programming language data types. Embedded documents and arrays reduce need for joins. Dynamic schema makes polymorphism easier.

2- PyTables :PyTables is a package for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data.It is built on top of the HDF5 library and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using Cython), makes it a fast, yet extremely easy to use tool for interactively save and retrieve very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that they take much less space (between a factor 3 to 5, and more if the data is compressible) than other solutions, like for example, relational or object oriented databases.

innfinision.netBigData Analysis Tools & Methods

3- Blosc :

Blosc is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations.

4- Blaze :

Blaze is a high-level user interface for databases and array computing systems. It consists of the following components:- A symbolic expression system to describe and reason about analytic queries- A set of interpreters from that query system to various databases /

computational enginesThis architecture allows a single Blaze code to run against several computational backends. Blaze interacts rapidly with the user and only communicates with the database when necessary. Blaze is also able to analyze and optimize queries to improve the interactive experience.

Advantages - Why - Where

innfinision.netBigData Analysis Tools & Methods

innfinision.netBigData Analysis Tools & Methods

MongoDB Advantages :

Any relational database has a typical schema design that shows number of tables and the relationship between these tables. While in MongoDB there is no concept of relationship.

Advantages of MongoDB over RDBMS

- - Schema less : MongoDB is document database in which one collection holds different different documents. Number of fields, content and size of the document can be differ from one document to another.

- - Structure of a single object is clear.- - No complex joins.- - Deep query-ability. MongoDB supports dynamic queries on documents using a

document-based query language that's nearly as powerful as SQL- - Tuning- - Ease of scale-out. MongoDB is easy to scale- Conversion / mapping of application objects to database objects not neededUses internal memory for storing the (windowed) working set, enabling faster

access of data

innfinision.netBigData Analysis Tools & Methods

Why should use MongoDB?

- Document Oriented Storage : Data is stored in the form of JSON style documents

- Index on any attribute- Replication & High Availability- Auto-Sharding- Rich Queries- Fast In-Place Updates- Professional Support

Where should use MongoDB?

- Big Data- Content Management and Delivery- Mobile and Social Infrastructure- User Data Management- Data Hub

innfinision.netBigData Analysis Tools & Methods

Why should use PyTables?

PyTables can be used on any scenario where you need to save and retrieve large amounts of data and provide metadata (that is, data about actual data) for it. Whether you want to work with large datasets of (potentially multidimensional) data, save and structure your NumPy datasets or just to provide a categorized structure for some portions of your cluttered RDBMS, then give PyTables a try. It works well for storing data from data acquisition systems, sensors in geosciences, simulation software, network data monitoring systems or as a centralized repository for system logs, to name only a few possible uses.However, it's important to emphasize the fact that PyTables is not designed to work as a relational database competitor, but rather as a teammate. For example, if you have very large tables in your existing relational database, then you can move those tables to PyTables so as to reduce the burden of your existing database while efficiently keeping those huge tables on-disk.

innfinision.netBigData Analysis Tools & Methods

Why should use Blosc?

- multi-threaded compressor that can transmit data from caches to memory, and back,

- speed can be larger than a OS memcpy()

Why Shoud Use Blaze?

Because Blaze is a query system that looks like NumPy/Pandas. You write Blaze queries, Blaze translates those queries to something else (like SQL), and ships those queries to various database to run on other people's fast code. It smoothes out this process to make interacting with foreign data as accessible as using Pandas. This is actually quite difficult.

Ehsan Derakhshan

[email protected]

innfinision.net

Thank you