Team 2 Big Data Presentation

21
Big Data and Hadoop Team 2: Stephen Allegretto, Jeffery Daly, Christopher Rizza, Matthew Urdan

Transcript of Team 2 Big Data Presentation

PowerPoint Presentation

Big Data and HadoopTeam 2: Stephen Allegretto, Jeffery Daly, Christopher Rizza, Matthew Urdan

The Business world is filled with acronyms and buzzwords, new theories, approaches and technologies. Among the most important being written about and increasingly utilized, however, is Big Data.

1

WHAT IS BIG DATA?

Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. Big Data can be characterized by the 3 Vs: Variety, Volume and Velocity. Big data may be as important to business and society as the Internet has become. Why? More data may lead to more accurate analyses (SAS Institute, 2015, para 1). However, to analyze Big Data requires tremendous computing power. Hadoop makes Big Data accessible.

2

WHAT IS HADOOP?

Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. (IBM, 2015, para 1). Through the process of Map Reduce, Hadoop can analyze extremely large data sets quickly because Map Reduce brings the software to the data, rather than the time-consuming process of serving vast amounts of data to the software. This presentation will examine Big Data and Hadoop in detail, especially as they apply to business applications.

3

WHAT CAN BIG DATA DO FOR BUSINESS?

Quite simply, Big Data has the power to transform any business organization in three significant ways. Big Data can transform a companys business processes. Big Data can transform a companys understanding of its market and its customers. Finally, Big Data can give any company better forecasting tools through predictive analytics.

4

TRANSFORMATION OF BUSINESS PROCESSES

Big Data provides opportunities to improve efficiencies on scales not previously envisioned (Vera-Baquero, Palacios, Stantchev, & Molloy, 2015). For example, in a project named Orion, UPS leverages Big Data to create the most efficient route possible for its carrier trucks. (Noyes, 2014). UPS determined that the enormous effort needed to help optimize its routes would pay huge dividends. UPS estimates that a reduction of one mile per day per driver, could save the company as much as $50M annually.

5

TRANSFORMATION OF BUSINESS UNDERSTANDING

Big Data can transform a companys understanding of its position in its market and create complete profiles of its customers. Through descriptive analytics applied to Big Data sources, organizations are able to develop a more robust and complete understanding of their market niche. In the past, limited data generation, limited storage capacity, and limitations on processing power forced organizations to sample populations in small doses (Duan & Xiong, 2015). Today we see those historic limits unbounded. For example Facebook examines what used to be disparate and disconnected data points to form a more complete view of the individual (Newman, 2015). Facebook data points include a persons friends, family, photos, companies liked, posts, comments, shared content and so much more. Facebook uses this information to target advertising to the individual user. Not surprisingly, companies are willing to pay large amounts of money to learn from this complete customer profile.

6

PREDICTIVE ANALYTICS

Another strength of Big Data is predictive analytics. The ability to store and analyze great amounts of data allows companies to learn quickly from past experiences and apply those lessons to future situations (Duan & Xiong, 2015). Amara Health Analytics, for example, is focused on the early diagnosis of sepsis, a disease that is notoriously difficult to identify. Amara looks at the data points from previous diagnoses and looks for commonalities in the data points to predict in real time which patients currently being monitored in hospitals might be at risk and alerts clinicians accordingly (Clancy, 2015).

7

CHALLENGES TO UTILIZING BIG DATA

But before Big Data can transform businesses, the data has to be collected and formatted into a usable form. Consequently, there are three major challenges to the utilization of Big Data: the challenges of data itself, process challenges and management challenges.

8

DATA CHALLENGES

Data challenges can range from redundancy issues, data discovery, data quality, data availability, and scalability. All of these challenges can make implementing Big Data difficult. Many databases have a redundancy issue. One key challenge is being able to reduce redundancy throughout the database. Redundancy and data compression can assist in reducing the costs of the entire database in the long run. The problem here lies with being able to notice the redundancy and avoiding them. Redundancies will clutter up a database which will waste money and time. Being able to find available data that is useful can sometimes be a challenge for companies. Once data is found it can be even harder to decipher if it will be beneficial and of good quality. Scalability is another type of challenge with big data. With scalability the analytical system of big data must support present and future datasets. The analytical algorithm must be able to process increasingly expanding and more complex datasets (Chen, 2014). Infrastructure limitations affect all companies in some way. As the hardware gets older it will likely become unreliable: Companies cant afford to lose data that they gathered in the past years (Adrian, 2013), so overcoming this challenge is critical.

9

PROCESS CHALLENGES

Process challenges entail the capturing of data and making it useful. Data is useless if it is not interpreted and analyzed. It can take significant trial and error to find the right model for analysis (Akerkar, n.d.). IT specialists say that they spend more time trying to clean up the data than they are analyzing it. Sorting and cleaning up data is a challenge that is hardly overcome (Adrian, 2013). There are ways to help speed up the process but if they are not implemented properly sorting through big data can cause delays to a company. Once the data is outputted it can be difficult to properly share the information in the right manner and with the right people. Being able to get the outputted data to the right people is imperative. Big data provides valuable information that can be beneficial if shared. Sadly a number of companies do not want to share information for reasons other than security. Regarding companies, this is a challenge that most refuse to overcome (Adrian, 2013).

10

SECURITY AND PRIVACY CHALLENGES

The security challenge for Big Data lies in providing an effective security model across the life cycle of the process without impeding Volume, Variety and Velocity or compromising the rest of the information estate (Morton, 2014).

A significant challenge for management is being able to secure a companys data. There are multiple security challenges that data administrators face when protecting large amounts of data. Management has to focus on data security and privacy to ensure databases are used and protected properly. Similar to all other forms of information technology big data is subject to misuse and criminal activity. Big data can be misused through abuse of privilege by those with access to the data and analysis tools; curiosity may lead to unauthorized access and information may be deliberately leaked. Mistakes can also cause problems where corner-cutting could lead to disclosure or incorrect analysis (Morton). Managers need to have control over the databases to ensure it is protected against intruders and against unauthorized users. The security challenge for big data lies in providing an effective security model across the life cycle of the process without impeding volume, variety and velocity or compromising the rest of the information estate (Morton, 2014).

11

3 RISKS TO BIG DATA ASSETS

Three major risks to a companys big data assets include information life cycle, data provenance, and technology unknowns. The information life cycle is always different when big data is involved. In certain cases the owner of the data might not be known. In other cases it is unknown what type of useful information might be discovered even after analysis. Data provenance is another security concern that managers need to pay attention too. Big data might not be coming from a reliable source. It might be compiled from a number of different areas. Big data involves absorbing and analyzing large amounts of data that may have originated outside the organization that is using it. If you dont control the data creation and collection process, then how can you be sure of the data source and the integrity of the data? (Morton, 2014). The final risk involves information unknowns. The technology that was designed and is in use to process big data is focused on "massive scalability." The main focus has not been security which can lead to problems in the long run. Focusing on security is vital to protecting sensitive data.

12

PRIVACY CONCERNS

Along with the issues of data security come the challenges of data privacy. Databases contain large amounts of personal information. Management has to be able to use the information for their benefit, but at the same time make sure it stays private and away from criminals. The challenges are: ensuring that data are used correctly (abiding by its intended uses and relevant laws), tracking how the data are used, transformed, derived, etc., and managing its lifecycle (Akerkar, n.d.). Companies like Home Depot and Target are examples of large institutions that have had sensitive data stolen. These are massive data breaches for large well-known companies. Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled (Akerkar, n.d.). The way data is accessed and secured needs to be constantly monitored.

13

SO WHERE DOES HADOOP COME IN?

Again, Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware (Bappalige, 2014). The fact that it is open source means that its code is free and accessible to all programmers and users to edit, comment and improve upon. Hadoop has a framework composed of several modules: 1) Hadoop Common, which contains libraries and utilities needed by other Hadoop modules; 2) Hadoop Distributed File System (HDFS), which stores data on commodity machines across a distributed file system (which are off the shelf devices that use large numbers of already-available computing components) that provides high aggregate bandwidth across the cluster; 3) Hadoop YARN, which is a resource-management platform that manages cluster resources and schedules user applications; and 4) Hadoop MapReduce, which is a programming model for large scale data processing (Bappalige, 2014). Overall, Hadoop is a very powerful and versatile tool that can aid in harnessing the challenges related to dealing with large amounts of data clusters across segregated file systems.14

HADOOP AND DATA PROCESSING

At the very basics level, Hadoop is able to store many files that, individually, are larger than an individual PCs capacity. The benefits this presents for businesses that need to store large amounts of data then are readily apparent. Utilization of Hadoop clusters easily removes the constraints [companies] had on storing and processing data (Bertolucci, 2013). So, Hadoop helps to store large amounts of data, but it also is very efficient at processing data. The Bertolucci article references a comparison of when you are trying to open a very large file on a PC and it takes an extremely long time to open. This is because, in most cases, data flows to the software for processing; however, Hadoop brings the software to the data, which allows it to process extremely large amounts of data very quickly. Just these two basic principles of storing and processing data can help a business become more efficient overnight and for very little cost.15

HADOOP AS A BUSINESS

Because of the success that Hadoop has had in the market since 2002, a whole new industry has emerged from its creation. Allied Market Research calls this Hadoop-as-a-Service, which they anticipate will grow to $50.2 billion by the year 2020 (Top 6 Hadoop Vendors providing Big Data Solutions in Open Data Platform, 2015). Many large companies have actually become Hadoop Vendors, including: Amazon, IBM, and Microsoft. These companies have helped package Hadoop and distribute it among users. For example, IBM Hadoop users can easily set up and move data to Hadoop clusters in approximately 30 minutes with data processing rates of 60 cents per cluster per hour (Top 6 Hadoop Vendors providing Big Data Solutions in Open Data Platform, 2015). The benefit of this is that customers are able to get to market at a rapid rate and IBM also incorporates advanced Big Data Analytics by harnessing the power of Hadoop.

16

CHALLENGES IN HADOOP UTILIZATION

Despite its extreme growth and increased popularity of Hadoop utilization, Hadoop technology is still in its developmental stages when it comes to management and deployment tools. Additionally, installation and implementation is time consuming. Andrew Oliver, a Strategic Developer, cites four challenges that companies face upon attempting to centralize Hadoop: 1) Hadoop isnt a single thing, meaning that there are many pieces that make up Hadoop as a whole and each piece is packaged and implemented separately; 2) Diverse workloads makes systems balancing difficult; 3) Partitioning, which presents an issue when differentiating between streaming jobs and batch jobs because they require different levels of service (this can result in the need for multiple Hadoop clusters, which would need to be managed separately); and finally, 4) Priorities, which Oliver explains as a situation where just because your company or organization requires a certain amount of resources, doesnt guarantee you will receive the resources you need because of the way the database is stored (Oliver, 2015). Overall, there is not a large selection of solutions to these challenges, but they are slowly being developed and will aid in the deployment and maintenance of Hadoop within larger organizations.

17

THE FUTURE OF BIG DATA

We are only just beginning to realize the potential of Big Data. Jennifer McGinn of the IBM Big Data and Analytics Hub hints at the tip of the iceberg of what we will be able to accomplish with Big Data and the analytical capabilities of Hadoop:Nothing will ever be out of stock because companies will be able to better predict what we want and where we want to buy it (McGinn, 2015, para 5).Cars, trucks and equipment wont breakdown as often because predictive maintenance will tell you when and where to get things fixed before they break (McGinn, 2015, para 5).Roads will be free from pot holes because sensors will know where they are and tell crews to fix them (McGinn, 2015, para 5).The common flu wont stand a chance of spreading because healthcare workers will be able to track outbreaks and treat them on the spot (McGinn, 2015, para 5).

18

CONCLUSION

Big datadata from many sources, of varying formats, both structured and unstructuredmeans different things in different industries. But as different as their needs and usage of big data may be, there is one commonality among all industries: the opportunity to plumb big data for better, more informed perspectives on their customers, products, partners, competitors and strategies. As organizations begin to explore the possibilities enabled by big data and analytics, they need new ways to store and access datafast. Apache Hadoop provides an answer to that challenge (IBM Software, 2015, p. 3).

19

ReferencesAll Images used in this presentation are Copyright Free and Fully Licensed from Adobe Stock ImagesAkerkar, Rajendra. Big Data Computing. N.p.: Boca Raton : CRC, n.d. Arnold Bernhard Library Database. Web. 15 Sept. 2015.Bappalige, S. (2014, August 26). An introduction to Apache Hadoop for big data. Retrieved September 16, 2015, from http://opensource.com/life/14/8/intro-apache-hadoop-big-dataBertolucci, J. (2013, November 19). How to explain Hadoop to non-geeks. Retrieved September 16, 2015, from Information Week: http://www.informationweek.com/big-data/software-platforms/how-to-explain-hadoop-to-non-geeks/d/d-id/899721Chen, Min, Shiwen Mao, Yin Zhang, and Victor Chung Ming Leung. Big Data: Related Technologies, Challenges and Future Prospects. N.p.: Cham : Springer International : Imprint: Springer, 2014. Arnold Bernhard Library Database. Web. 15 Sept. 2015.Clancy, H. (2015, January 5). Predictive analytics, a potent prescription for health care. Retrieved September 14, 2015, from Fortune: http://fortune.com/2015/01/05/predictive-analytics-health-care/Collins, Keith. "A Quick Guide to the Worst Corporate Hack Attacks." Bloomberg.com. Bloomberg, 18 Mar. 2015. Web. 17 Sept. 2015.Davenport, T. H., & Dyche, J. (2013). Big data in big companies. SAS Institute. International Institute for Analytics.Duan, L., & Xiong, Y. (2015, March 19). Big data analytics and business analytics. Journal of Management Analytics, 2(1), 1-21.IBM. (2015). What is Hadoop? Retrieved September 16, 2015, from IBM: http://www-01.ibm.com/software/data/infosphere/hadoop/

ReferencesIBM Software. (2015). Making the case for big data and Hadoop in the enterprise. Retrieved September 16, 2015, from IBM: http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=BK&infotype=PM&appname=SWGE_IM_DD_USEN&htmlfid=IMM14161USEN&attachment=IMM14161USEN.PDF#loadedMcGinn, J. (2015, February 17). The future of data potential is here. Retrieved September 16, 2015, from IBM Big Data Hub: http://www.ibmbigdatahub.com/blog/future-data-potential-hereMorton, John. Big Data: Opportunities and Challenges. N.p.: Swindon : BCS, The Chartered Institute for IT, 2014. Arnold Bernhard Library Database. Web. 14 Sept. 2015.Newman, D. (2015, February 2015). Big Data: Why Facebook Knows Us Better Than Our Therapist. Retrieved September 14, 2015, from Forbes: http://www.forbes.com/sites/danielnewman/2015/02/24/big-data-why-facebook-knows-us-better-than-our-therapist/Noyes, K. (2014, July 25). The shortest distance between two points? At UPS, it's complicated. Retrieved September 14, 2015, from Fortune: http://fortune.com/2014/07/25/the-shortest-distance-between-two-points-at-ups-its-complicated/Oliver, A. (2015, July 2). Big data, big challenges: Hadoop in the enterprise. Retrieved September 16, 2015, from http://www.infoworld.com/article/2943252/application-development/the-challenges-of-deploying-hadoop-in-the-enterprise.htmlSAS Institute. (2015). What is big data? Retrieved September 16, 2015, from SAS: http://www.sas.com/en_us/insights/big-data/what-is-big-data.htmlTop 6 Hadoop Vendors providing Big Data Solutions in Open Data Platform. (2015, April 8). Retrieved September 16, 2015, from http://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93Vera-Baquero, A., Palacios, R. C., Stantchev, V., & Molloy, O. (2015). Leveraging big-data for business process analytics. The Learning Organization. Emerald Group Publishing Limited.

21