Big Data Opportunities and Challenges: …soft-computing.de/CIM_BD.pdfIEEE COMPUTATIONAL...

20
IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 1 Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives Zhi-Hua Zhou, Nitesh V. Chawla, Yaochu Jin, and Graham J. Williams Abstract—“Big Data” as a term has been among the biggest trends of the last three years, leading to an upsurge of research, as well as industry and government applications. Data is deemed a powerful raw material that can impact multidisciplinary research endeavors as well as government and business performance. The goal of this discussion paper is to share the data analytics opinions and perspectives of the authors relating to the new opportunities and challenges brought forth by the big data movement. The authors bring together diverse perspectives, coming from different geographical locations with different core research expertise and different affiliations and work experiences. The aim of this paper is to evoke discussion rather than to provide a comprehensive survey of big data research. Index Terms—Big data, data analytics, machine learning, data mining, global optimization, application 1 I NTRODUCTION B IG data is one of the “hottest” phrases be- ing used today. Everyone is talking about big data, and it is believed that science, busi- ness, industry, government, society, etc. will undergo a thorough change with the influence of big data. Technically speaking, the process of handling big data encompasses collection, storage, transportation and exploitation. It is no doubt that the collection, storage and trans- portation stages are necessary precursors for Z.-H. Zhou is with the National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China ([email protected]) N. V. Chawla is with the Department of Computer Science and Engineering and the Interdisciplinary Center for Network Science, University of Notre Dame, United States ([email protected]) Y. Jin is with the Department of Computing, University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom ([email protected]) G. J. Williams is with Togaware Pty Ltd and the Australian Tax Office, Camberra, Australia ([email protected]) the ultimate goal of exploitation through data analytics, which is the core of big data process- ing. Turning to a data analytics perspective, we note that “big data” has come to be defined by the four V’s — Volume, Velocity, Veracity, and Variety. It is assumed that either all or any one of them needs to be met for the classifi- cation of a problem as a Big Data problem. Volume indicates the size of the data, which might be too big to be handled by the current state of algorithms and/or systems. Velocity implies data are streaming at rates faster than that can be handled by traditional algorithms and systems. Sensors are rapidly reading and communicating streams of data. We are ap- proaching the world of quantified self, which is presenting data that was not available hitherto. Veracity suggests that despite the data being available, the quality of data is still a major concern. That is, we cannot assume that with big data comes higher quality. In fact, with size comes quality issues, which needs to be either tackled at the data pre-processing stage or by the learning algorithm. Variety is the most compelling of all V’s as it is presenting

Transcript of Big Data Opportunities and Challenges: …soft-computing.de/CIM_BD.pdfIEEE COMPUTATIONAL...

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 1

Big Data Opportunities and Challenges:Discussions from Data Analytics Perspectives

Zhi-Hua Zhou, Nitesh V. Chawla, Yaochu Jin, and Graham J. Williams

Abstract—“Big Data” as a term has been among the biggest trends of the last three years, leading to anupsurge of research, as well as industry and government applications. Data is deemed a powerful raw materialthat can impact multidisciplinary research endeavors as well as government and business performance. Thegoal of this discussion paper is to share the data analytics opinions and perspectives of the authors relatingto the new opportunities and challenges brought forth by the big data movement. The authors bring togetherdiverse perspectives, coming from different geographical locations with different core research expertise anddifferent affiliations and work experiences. The aim of this paper is to evoke discussion rather than to provide acomprehensive survey of big data research.

Index Terms—Big data, data analytics, machine learning, data mining, global optimization, application

F

1 INTRODUCTION

B IG data is one of the “hottest” phrases be-ing used today. Everyone is talking about

big data, and it is believed that science, busi-ness, industry, government, society, etc. willundergo a thorough change with the influenceof big data. Technically speaking, the processof handling big data encompasses collection,storage, transportation and exploitation. It isno doubt that the collection, storage and trans-portation stages are necessary precursors for

∙ Z.-H. Zhou is with the National Key Laboratory for NovelSoftware Technology, Nanjing University, Nanjing 210023,China ([email protected])

∙ N. V. Chawla is with the Department of Computer Scienceand Engineering and the Interdisciplinary Center forNetwork Science, University of Notre Dame, United States([email protected])

∙ Y. Jin is with the Department of Computing, Universityof Surrey, Guildford, Surrey, GU2 7XH, United Kingdom([email protected])

∙ G. J. Williams is with Togaware Pty Ltd andthe Australian Tax Office, Camberra, Australia([email protected])

the ultimate goal of exploitation through dataanalytics, which is the core of big data process-ing.

Turning to a data analytics perspective, wenote that “big data” has come to be definedby the four V’s — Volume, Velocity, Veracity,and Variety. It is assumed that either all or anyone of them needs to be met for the classifi-cation of a problem as a Big Data problem.Volume indicates the size of the data, whichmight be too big to be handled by the currentstate of algorithms and/or systems. Velocityimplies data are streaming at rates faster thanthat can be handled by traditional algorithmsand systems. Sensors are rapidly reading andcommunicating streams of data. We are ap-proaching the world of quantified self, which ispresenting data that was not available hitherto.Veracity suggests that despite the data beingavailable, the quality of data is still a majorconcern. That is, we cannot assume that withbig data comes higher quality. In fact, withsize comes quality issues, which needs to beeither tackled at the data pre-processing stageor by the learning algorithm. Variety is themost compelling of all V’s as it is presenting

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 2

data of different types and modalities for agiven object in consideration.

Each of the V’s is certainly not new. Machinelearning and data mining researchers havebeen tackling these issues for decades. How-ever, the emergence of Internet-based compa-nies has challenged many of the traditionalprocess-oriented companies—they now needto become knowledge-based companies drivenby data rather than by process.

The goal of this article is to share the authors’opinions about big data from their data analyt-ics perspectives. The four authors bring quitedifferent perspectives with different researchexperiences and expertise, spanning compu-tational intelligence, machine learning, datamining and science, and interdisciplinary re-search. Authors represent academia and indus-try across four different continents. This diver-sity brings together an interesting perspectiveand coverage on exploring data analytics in thecontext of today’s big data.

It is worth emphasizing that this articledoes not intend to provide a comprehensivereview about the state-of-the-art of big dataresearch, nor to provide a future big dataresearch agenda. The aim is to expose theauthors’ personal opinions and present theirperspectives of the future based on their views.As such there will be necessarily limited evi-dential argument or literary support, given therapidly changing landscape and significant lagof academic research reporting. Indeed, manyimportant issues and relevant techniques arenot specifically covered in this article, and arebest left to survey papers.

While all authors have contributed to theoverall paper, each author has focused on theirparticular specialities in the following discus-sions. Zhou covers machine learning, whileChawla brings a data mining and data scienceperspective. Jin provides a view from compu-tational intelligence and meta-heuristic globaloptimization, and Williams draws upon a ma-chine learning and data mining backgroundapplied as a practicing data scientist and con-

sultant to industry internationally.

2 MACHINE LEARNING WITH BIGDATA

Machine learning is among the core techniquesfor data analytics. In this section we will firstclarify three common but unfortunately mis-leading arguments about learning systems inthe big data era. Then we will discuss someissues that demand attention.

2.1 Three Misconceptions2.1.1 “Models are not important any more”Many people today talk about the replacementof sophisticated models by big data, where wehave massive amounts of data available. Theargument goes that in the “small data era”models were important but in the “big dataera” this might not be the case.

Such arguments are claimed to be based onempirical observations of the type illustrated inFig. 1. With small data (e.g., data size of 10), thebest model is about 𝑥% better than the worstmodel in the figure, whereas the performanceimprovement brought by big data (e.g., datasize of 104) is 𝑦% ≫ 𝑥%. Such observationscan be traced over many years, as in [7], andpredate the use of “big data.” It is interestingto see that in the “big data era”, many peopletake such a figure (or similar figures) to claimthat having big data is enough to get betterperformance. Such a superficial observation,however, neglects the fact that even with bigdata (e.g., data size of 104 in the figure), thereare still significant differences between the dif-ferent models—models are still important.

Also, we often hear such arguments: As thefigure shows, the simplest model with smalldata achieves the best performance with bigdata, and thus, one does not need to have asophisticated model because the simple modelis enough. Unfortunately, this argument is alsoincorrect.

First, there is no reason to conclude that theworst-performing model on the small data is

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 3

10 10^2 10^3 10^4

70

75

80

85

90

95

100

Data Size

Pre

dic

tive

Accu

racy(%

)

Algo_1Algo_2Algo_3

Fig. 1. Illustration: Model performance vs. Datasize.

really the “simplest”, and vice versa. Second,even if we assumed that the worst-performingmodel on the small data is really the simplestone, there is no support for the generalizationof the argument that the simplest model willdefinitely achieve the best performance withbig data in tasks other than the current em-pirical study.

If we take a look into [7] we can find thatAlgo 1 in Fig. 1 corresponds to a memory-based method, Algo 2 corresponds to Percep-tron or Naıve Bayes, and Algo 3 correspondsto Winnow. It is hard to conclude that Winnowis simpler than the memory-based method; atleast, the “simpler will be better” argumentcannot explain why the performance of a Per-ceptron is better than that of the memory-basedmethod on big data. A more reasonable expla-nation as to why the memory-based method isbetter on small data than Winnow but worseon big data may owe to its requirement of load-ing the data into memory. This is a memoryissue and not whether a model is sophisticatedor not.

The recent interest in deep learning [10],[26] provides strong evidence that on big data,sophisticated models are able to achieve muchbetter performance than simple models. Wewant to emphasize that the deep learning tech-

niques are not really new, and many ideascan be found from the 1990’s [25], [32]. How-ever, there were two serious problems that en-cumbered development at that time. First, thecomputational facilities available at that timecould hardly handle models with thousandsof parameters to tune. Current deep learningmodels involve millions or even billions ofparameters. Second, the data scale at that timewas relatively small, and thus models withhigh complexity were very likely to overfit. Wecan see that with the rapid increase of compu-tational power, training sophisticated modelsbecomes more and more feasible, whereas thebig data size greatly reduces the overfitting riskof sophisticated models. From this sense, onecan even conclude that in the big data era, so-phisticated models become more favored sincesimple models are usually incapable of fullyexploiting the data.

2.1.2 “Correlation is enough”Some popular books on big data, including[37], claim that it is enough to discover “cor-relation” from big data. The importance of“causality” will be over taken by “correlation”,with some advocating that we are entering an“era of correlation”.

Trying to discover causality represents agreat intention of searching for an in-depthunderstanding of the data. This is usually chal-lenging in many real domains [44]. However,we have to emphasize that correlation is farfrom sufficient, and the role of causality cannever be replaced by correlation. The reasonlies in the fact that one invests in data analyticsbecause one wants to get information helpfulfor making wiser decisions and/or taking suit-able actions, whereas the abuse of correlationwill be misleading or even disastrous.

We can easily find many examples, evenin classic statistical textbooks, illustrating thatcorrelation cannot replace the role of causality.For example, empirical data analysis on publicsecurity cases in a number of cities disclosedthat the number of hospitals and the number

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 4

of car thefts are highly positively correlated.Indeed, car thefts increase almost linearly withthe construction of new hospitals. With sucha correlation identified, how would the mayorreact to reduce car thefts? An “obvious” solu-tion is to cease the construction of new hospi-tals. Unfortunately, this is an abuse of the cor-relation information. It will only decrease theopportunity of patients getting timely medicalattention, whereas it is extremely unlikely tohave anything to do with the incidence of carthefts. Instead, the increase of both the inci-dence of car thefts and the number of hospitalsis actually affected by a latent variable, i.e.,the residential population. If one believes thatcorrelation is sufficient and never goes deeperinto analysis of the data, one might serve asa mayor who plans to reduce car thefts byrestricting the construction of hospitals.

Sometimes computational challenges mayencumber the discovery of causality, and insuch cases, discovering valid correlation, willbe able to provide some helpful information.However, exaggerating the importance of “cor-relation” and taking the replacement of causal-ity by correlation as a feature of the “big dataera” can be detrimental, and lead to unneces-sary, negative consequences.

2.1.3 “Previous methodologies do not workany more”Another popular argument claims that previ-ous research methodologies were designed forsmall data and they cannot work well on bigdata. This argument is often held by peoplewho are highly enthusiastic for newly pro-posed techniques, thus they seek “totally new”paradigms.

We appreciate the search for new paradigmsas this is one of the driving forces for in-novative research. However, we highlight theimportance of “past” methodologies.

Firstly, we should emphasize that re-searchers have always been trying to workwith “big” data, such that what is regardedas big data today might not be regarded as

big data in the future (e.g., in ten years). Forexample, in a famous article [31], the authorexpressed that “For learning tasks with 10,000training examples and more it becomes impossible...”. The title of the paper “Making Large-ScaleSVM Training Practical” implies that the goalof the article was “large-scale”—the experi-mental datasets in the paper mostly containedthousands of samples, and the biggest onecontained 49,749 samples. This was deemed as“amazingly big data” at that time. Nowadays,few people will regard fifty thousand samplesas big data.

Secondly, many past research methodologiesstill hold much value. We might consider [60],the proceedings of a KDD 1999 workshop.On the second page it is emphasized that“implementation ... in high-performance paralleland distributed computing ... is becoming cru-cial for ensuring system scalability and interac-tivity as data continues to grow inexorably insize and complexity”. Indeed, most of the “cur-rent” facilitation for handling big data, suchas high-performance computing, parallel anddistributed computing, high efficiency storage,etc., has been used in data analytics for manyyears and will remain popular into the future.

2.2 Opportunities and Challenges

It is difficult to identify “totally new” issuesbrought about by big data. Nonetheless, thereare always important aspects to which onehopes to see greater attention and efforts chan-neled.

First, although we have always been tryingto handle (increasingly) big data, we have usu-ally assumed that the core computation canbe held in memory seamlessly. Whereas thecurrent data size reaches to such a scale thatthe data becomes hard to store and even hardfor multiple scans. However, many importantlearning objectives or performance measuresare non-linear, non-smooth, non-convex andnon-decomposable over samples. For exam-ple, AUC (Area Under the ROC Curve) [24],

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 5

and their optimizations, inherently require re-peated scans of the entire dataset. Is it learnableby scanning the data only once, and if it needsto store something, the storage requirement issmall and independent to data size? We callthis “one-pass learning” and it is importantbecause in many big data applications, the datais not only big but also accumulated over time,hence it is impossible to know the eventualsize of the dataset. Fortunately, there are somerecent efforts towards this direction, including[22]. On the other hand, although we have bigdata, are all the data crucial? The answer isvery likely that they are not. Then, the questionbecomes can we identify valuable data subsetsfrom the original big dataset?

Second, a benefit of big data to machinelearning lies in the fact that with more andmore samples available for learning, the risk ofoverfitting becomes smaller. We all understandthat controlling overfitting is one of the centralconcerns in the design of machine learningalgorithms as well as in the application ofmachine learning techniques in practice. Theconcern with overfitting led to a natural fa-vor for simple models with less parametersto tune. However, the parameter tuning con-straints may change with big data. We can nowtry to train a model with billions of param-eters, because we have sufficiently big data,facilitated by powerful computational facilitiesthat enable the training of such models. Thegreat success of deep learning [10] during thepast few years serves as a good showcase.However, most deep learning work stronglyrelies on engineering tricks that are difficultto be repeated and studied by others, apartfrom the authors themselves. It is importantto study the mysteries behind deep learning;for example, why and when some ingredientsof current deep learning techniques, e.g., pre-training and dropout, are helpful and how theycan be more helpful? There have been somerecent efforts in this direction [6], [23], [52].Moreover, we might ask if it is possible todevelop a parameter tuning guide to replace

the current almost-exhaustive search?Third, we need to note that big data usually

contains too many “interests”, and from suchdata we may be able to get “anything wewant”; in other words, we can find supportingevidence for any argument we are in favor of.Thus, how do we judge/evaluate the “find-ings”? One important solution is to turn to sta-tistical hypothesis testing. The use of statisticaltests can help at least in two aspects: First, weneed to verify that what we have done is reallywhat we wanted to do. Second, we need toverify that what we have attained is not causedby small perturbations that exist in the data,particularly due to the non-thorough exploita-tion of the whole data. Although statisticaltests have been studied for centuries and havebeen used in machine learning for decades, thedesign and deployment of adequate statisti-cal tests is non-trivial, and in fact there havebeen misuses of statistical tests [17]. Moreover,statistical tests suitable for big data analysis,not only for the computational efficiency butalso for the concern of using only part of thedata, remain an interesting but under-exploredarea of research. Another way to check thevalidity of the analysis results is to deriveinterpretable models. Although many machinelearning models are black-boxes, there havebeen studies on improving the comprehensi-bility of models such as rule extraction [62].Visualization is another important approach,although it is often difficult with dimensionshigher than three.

Moreover, big data usually exists in a dis-tributed manner; that is, different parts of thedata may be held by different owners, and noone holds the entire data. It is often the casethat some sources are crucial for some analyticsgoal, whereas some other sources pose lessimportance. Given the fact that different dataowners might warrant the analyzer with differ-ent access rights, can we leverage the sourceswithout access to the whole data? What infor-mation must we have for this purpose? Evenif the owners agree to provide some data,

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 6

it might be too challenging to transport thedata due to its enormous size. Thus, can weexploit the data without transporting them?Moreover, data at different places may havedifferent label quality, and may have signifi-cant label noise, perhaps due to crowdsourc-ing. Can we do learning with low qualityand/or even contradictory label information?Furthermore, usually we assume that the datais identically and independently distributed;however, the fundamental i.i.d. assumption canhardly hold across different data sources. Canwe learn effectively and efficiently beyond thei.i.d. assumption? There are a few preliminarystudies on these important issues for big data,including [34], [38], [61].

In addition, given the same data, differentusers might have different demands. For exam-ple, for product recommendation, some usersmight demand that highly recommended itemsare good, and some users might demand thatall the recommended items are good, whileother users might demand all the good itemshave been returned. The computational, andstorage loads of big data may be inhibitorsto the construction of a model for each ofthe various demands separately. Can we buildone model (a “general model” which can beadapted to other demands with cheap minormodifications) to satisfy the various demands?Some efforts have been reported recently in[35].

Another long-standing but unresolved issueis, in the “big data era”, can we really avoid theviolation of privacy concerns [2]? This is actu-ally a long-standing problem that still remainsopen.

3 DATA MINING/SCIENCE WITH BIGDATA

We posit again that big data is not a newconcept. Rather, aspects of it have been studiedand considered by a number of data miningresearchers over the past decade and beyond.Mining massive data by scalable algorithms

leveraging parallel and distributed architec-tures has been a focus topic of numerousworkshops and conferences, including [1], [14],[43], [50], [60]. However, the embrace of theVolume aspect of data is coming to a realiza-tion now, largely through the rapid availabil-ity of datasets that exceed terabytes and nowpetabytes—whether through scientific simula-tions and experiments, business transactionaldata or digital footprints of individuals. As-tronomy, for example, is a fantastic applicationof big data driven by the advances in the as-tronomical instruments. Each pixel captured bythe new instruments can have a few thousandattributes and translate quickly to a peta-scaleproblem. This rapid growth in data is creatinga new field called Astro-informatics, which isforging partnerships between computer scien-tists, statisticians and astronomers. This rapidgrowth of data from various domains, whetherin business or science or humanities or engi-neering, is presenting novel challenges in scaleand provenance of data, requiring a new rigorand interest among the data mining commu-nity to translate their algorithms and frame-works for data-driven discoveries.

A similar caveat also plays with the conceptof Veracity of data. The issue of data qualityor veracity has been considered by a numberof researchers [39], including data complexity[9], missing values [19], noise [58], imbalance[13], and dataset shift [39]. The latter, datasetshift, is most profound in the case of big data asthe unseen data may present a distribution thatis not seen in the training data. This problemis tied with the problem of Velocity, whichpresents the challenge of developing streamingalgorithms that are able to cope with shocksin the distributions of the data. Again, thisis an established area of research in the datamining community in the form of learningfrom streaming data [3], [48]. A challenge hasbeen that the methods developed by the datamining community have not necessarily beentranslated to industry. But times are changing,as seen by the resurgence of deep learning in

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 7

industry.The issue with Variety is, undoubtedly,

unique and interesting. A rapid influx of un-structured and multimodal data, such as socialmedia, images, audio, video, in addition to thestructured data, is providing novel opportuni-ties for data mining researchers. We are seeingsuch data rapidly being collected into orga-nizational data hubs, where the unstructuredand structured cohabit and provide the sourcefor all data mining. A fundamental questionis related to integrating these varied streamsor inputs of data into a singular feature vectorpresentation for the traditional learning algo-rithms.

The last decade has witnessed the boomof social media websites, such as Facebook,LinkedIn, and Twitter. Together they facilitatean increasingly wide range of human interac-tions that also provide the modicums of bigdata. The ubiquity of social networks manifestsas complex relationships among individuals.It is generally believed that the research inthis field will enhance our understandings ofthe topology of social networks and the pat-terns of human interactions [8], [18], [33], [36],[41], [54]. The relations among people affectnot only social dynamics but also the broaderdynamics of a variety of physical, biological,infrastructural and economic systems. Whilenetwork theoretic techniques provide efficientmeans for analysis of data with complex un-derlying relationships, limitations in existingdiffusion models are perhaps one of the maincauses that restricts the extension of thesemethods to rather sophisticated application do-mains. However, these limitations are mainlydue to the lack of capacity to adequately rep-resent and process the imperfect data that arecharacteristic of such applications.

Our call to the community is to reconvenesome of the traditional methods and identifytheir performance benchmarks on “big data”.This is not about reinventing the wheel, butrather creating new paths and directions forgroundbreaking research built on the founda-

tions we have already developed.

3.1 From Data to Knowledge to Discoveryto ActionRecent times have greatly increased our abilityto gather massive amounts of data, presentingus with an opportunity to induce transforma-tive changes in the way we analyze and un-derstand data. These data exhibit a number oftraits that have the potential to not only com-plement hypothesis-driven research but alsoto enable the discovery of new hypotheses orphenomena from the rich data, which couldinclude spatial data, temporal data, observa-tional data, diverse data sources, text data,unstructured data, etc.

Data of such extent and longitudinal char-acter brings novel challenges for data-drivenscience for charting the path from data toknowledge to insight. This process of data-guided knowledge discovery will entail an in-tegrated plan of descriptive analysis and pre-dictive modeling for useful insights or hypothe-ses. These hypotheses are not just correlationalbut help explain an underlying phenomenonor help validate an observed phenomenon.

These discovered hypotheses or predictiveanalytics can help inform decisions, which in-clude certain actions that can be appropriatelyweighed by the cost and impact of the action.The set of alternating hypotheses leads to sce-narios that can be weighted situationally. Bryn-jolfsson et al. [11] studied 179 large companiesand found that the companies that embraceddata-driven decision making experienced a 5to 6 percent higher level of productivity. Thekey difference was that these companies reliedon data and analytics rather than solely onexperience and intuition.

Healthcare is another area witnessing a sig-nificant application of big data. United Health-care, for example, is expending effort onmining customer attitudes as gleaned fromrecorded voice files. The company is leveragingnatural language processing along with textdata to identify the customer sentiment and

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 8

satisfaction. It is a clear example of taking dis-parate big data, developing analytical models,and discovering quantifiable and actionableinsights.

Big data presents unparalleled opportunities:to accelerate scientific discovery and innova-tion; to improve health and well-being; to cre-ate novel fields of study that hitherto mightnot have been possible; to enhance decisionmaking by provisioning the power of dataanalytics; to understand dynamics of humanbehavior; and to affect commerce in a globallyintegrated economy.

3.2 Opportunities and ChallengesBig data is clearly presenting us with excitingopportunities and challenges in data miningresearch.

First, data-driven science and discoveryshould try to discover action-oriented insightsthat lead to charting new discoveries or im-pacts. Without understanding the nuances ofone’s data and the domain, one can fall intothe chasm of simple and misleading correla-tion, sometimes leading to false discovery andinsight. It is critical to fully understand andappreciate the domain that one is working in,and all observations and insights to be appro-priately structured in that domain. It requiresimmersion of an individual in a domain toconduct feature engineering, data exploration,machine learning, and to inform system designand database design, and to conduct what-ifanalysis. This is not to say that a data scientistwill be an expert in every aspect. Rather adata scientist may be an innovator in machinelearning but well-versed in system design ordatabases or visualization or quick prototyp-ing. But the data scientist cannot be divorcedfrom the domain less they risk the peril offailing.

Second, algorithms are important, but beforewe jump on a journey of novel algorithms totackle the four V’s of big data, it is importantfor the community to consider the advancesdone hitherto, conduct a thorough empirical

survey of them and then identify the poten-tial bottlenecks, challenges and pitfalls of theexisting state-of-the-art.

Third, any advances in scalable algorithmsshould be tied to the advances in architec-ture, systems, and new database constructs.We are witnessing a shift towards NoSQLdatabases, given the schema-free environmentof the new data types and the prevalence ofunstructured data. It is an opportunity forthe algorithmic researchers to collaborate withsystems/database researchers to integrate themachine learning or data mining algorithmsas part of the pipeline to naturally exploit thelower constructs of data storage and computa-tional fabric.

Fourth, a fundamental paradigm that ispresent in front of us is data-driven discovery.The data scientist must be the curious outsiderwho can ask questions of data, poke at thelimitations placed on the available data andidentify additional data that may enhance theperformance of the algorithms at a given task.The hypothesis here is that there may be dataexternal to the data captured by a given com-pany, which may provide significant value. Forexample, consider the problem of predictingreadmission for a patient on discharge. Thisproblem of reducing readmission may findsignificant value by considering lifestyle data,which is outside of the patient’s ElectronicMedical Record (EMR).

We see these as some of the key opportu-nities and challenges that are specifically pre-sented within the data mining with big dataresearch context.

4 GLOBAL OPTIMIZATION WITH BIGDATA

Another key area where big data offers oppor-tunity and challenges is global optimization.Here we aim to optimize decision variablesover specific objectives. Meta-heuristic globalsearch methods such as evolutionary algo-rithms have been successfully applied to opti-

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 9

mize a wide range of complex, large-scale sys-tems, ranging from engineering design to re-construction of biological networks. Typically,optimization of such complex systems needsto handle a variety of challenges as identifiedhere [12].

4.1 Global Optimization of Complex Sys-temsComplex systems often have a large number ofdecision variables and involve a large numberof objectives, where the correlation between thedecision variables may be highly nonlinear andthe objectives are often conflicting. Optimiza-tion problems with a large number of decisionvariables, known as large-scale optimizationproblems, are very challenging. For example,the performance of most global search algo-rithms will seriously degrade as the numberof decision variables increases, especially whenthere is a complex correlational relationshipbetween the decision variables. Divide-and-conquer is a widely adopted strategy to dealwith large-scale optimization where the keyissue is to detect the correlational relationshipsbetween the decision variables so that corre-lated relationships are grouped into the samesub-population and independent relationshipsgrouped into different sub-populations.

Over the past two decades, meta-heuristicshave been shown to be efficient in solvingmulti-objective optimization problems, wherethe objectives are often conflicting with eachother. The main reason is that for a population-based search method, different individuals cancapture different trade-off relationships be-tween the conflicting objectives, e.g., in com-plex structural design optimization [12]. As aresult, it is possible to achieve a representativesubset of the whole Pareto-optimal solution byperforming one single run, in particular for bi-or tri-objective optimization problems. Multi-objective optimization meta-heuristics devel-oped thus far can largely be divided intothree categories, namely weighted aggregationbased methods [28], Pareto-dominance based

approaches [16] and performance indicator-based algorithms [5].

Unfortunately, none of these methods canwork efficiently when the number of objec-tives becomes much higher than three. Thisis mainly because the number of total Pareto-optimal solutions becomes large and achiev-ing a representative subset of them is nolonger tractable. For the weighted aggregationapproaches, it can become difficult to createa limited number of weight combinations torepresent the Pareto-optimal solutions of avery high-dimension. For the Pareto-based ap-proaches, most solutions in a population of alimited size are non-comparable. Thus, onlyfew individuals dominate others and selec-tion pressure for better solutions is lost. Anadditional difficulty is the increasingly largecomputational cost for performing the dom-inance relations when the number of objec-tives increases. Performance indicator-basedapproaches also suffer from high computa-tional complexity, e.g., in calculating the hyper-volume.

The second main challenge associated withoptimization of complex systems is the com-putationally expensive processes of evaluatingthe quality of solutions. For most complex opti-mization problems, either time-consuming nu-merical simulations or expensive experimentsneed to be conducted for fitness evaluations.The prohibitively high computational or ex-perimental costs make it intractable to applyglobal population-based search algorithms tosuch complex optimization problems. One ap-proach that has been shown to be promisingis the use of computationally efficient models,known as surrogates, to replace part of theexpensive fitness evaluations [29]. However,constructing surrogates can become extremelychallenging for large-scale problems with verylimited data samples that are expensive tocollect.

Complex optimization problems are oftensubject to large amounts of uncertainties, suchas varying environmental conditions, system

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 10

degeneration, or changing customer demand[29]. Two basic ideas can be adopted to addressthe uncertainties in optimization. One is tofind solutions that are relatively insensitive tosmall changes in decision variables or fitnessfunctions, known as robust optimal solutions[29]. However, if the changes are large andcontinuous, meta-heuristics for tracking themoving optima will often be developed, whichis known as dynamic optimization [42]. Dif-ferent from the robustness approach to han-dling uncertainties, dynamic optimization aimsto track the optimum whenever it changes.Theoretically this sounds perfect, but practi-cally it is not desired for two reasons. First,tracking a moving optimum is computationallyintensive, particularly if the fitness evaluationsare expensive. Second, a change in the designor solution may be expensive and frequentchanges are not allowed in many cases. To takethese two factors into account, a new approachto cope with uncertainties, termed robustnessover time, has been suggested [30]. The mainidea is to reach a realistic trade-off betweenfinding a robust optimal solution and trackingthe moving optimum. That is, a design orsolution will be changed only if the solutioncurrently in use is no longer acceptable, and anew optimal solution that changes slowly overtime, which is not necessarily the best solutionin that time instant, will be sought.

4.2 Big Data in OptimizationMeta-heuristic global optimization of complexsystems cannot be accomplished without datagenerated in numerical simulations and physi-cal experiments. For example, design optimiza-tion of a racing car is extremely challengingsince it involves many subsystems such asfront wing, rear wing, chassis and tires. A hugenumber of decision variables are involved,which may seriously degrade the search per-formance of meta-heuristics. To alleviate thisdifficulty, data generated by aerodynamic engi-neers in their daily work will be very helpful todetermine which subsystem, or even as a step

further which part of the subsystem, is criticalfor enhancing the aerodynamic and drivabilityof a car. Analysis and mining of such datais, however, a challenging task, because theamount of data is huge, and the data mightbe stored in different forms and polluted withnoise. In other words, these data are fullycharacterized by the four V’s of big data. Inaddition, as fitness evaluations of racing cardesigns are highly time-consuming, surrogatesare indispensable in optimization of racing ve-hicles.

Another example is the computational re-construction of biological gene regulatory net-works. Reconstruction of gene regulatory net-works can be seen as a complex optimizationproblem, where a large number of param-eters and connectivity of the network needto be determined. While meta-heuristic opti-mization algorithms have been shown to bevery promising, the gene expression data forreconstruction is substantially big data in na-ture [51]. Data available from gene expressionis increasing at an exponential rate [59]. Thevolume of data is ever increasing with develop-ments in next generation sequence techniquessuch as high-throughput experiments. In addi-tion, data from experimental biology, such asmicroarray data, is noisy, and gene expressionexperiments rarely have the same growth con-ditions and thus produce heterogeneous datasets. Data variety is also significantly increasedthrough the use of deletion data, where a geneis deleted in order to determine its regulatorytargets. Perturbation experiments are usefulin reconstruction of gene regulatory networks,which, however, are another source of varietyin biological data. Data collected from differentlabs for the same genes in the same biologicalnetwork are often different.

It also becomes very important to developoptimization algorithms that are able to gainproblem-specific knowledge during optimiza-tion. Acquisition of problem-specific knowl-edge can help capture the problem structureto perform more efficient search. For large-

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 11

Fig. 2. Relationship between the challenges in complex engineering optimization and the natureof big data.

scale problems that have a large number ofobjectives, such knowledge can be used toguide the search through the most promisingsearch space, and to specify preferences overthe objectives so that the search will focus onthe most important trade-offs. Unfortunately,sometimes only limited a-priori knowledge isavailable for the problem to be solved. It istherefore also interesting to discover knowl-edge from similar optimization problems orobjectives that have been previously solved[20]. In this case, proper re-use of the knowl-edge can be very challenging. The relationshipbetween the challenges in complex systemsoptimization and the nature of big data isillustrated in Fig. 2.

4.3 Opportunities and Challenges

As discussed above, big data is widely seenas essential for the success of the design opti-mization of complex systems. Much effort hasbeen dedicated to the use of data to enhancethe performance of meta-heuristic optimizationalgorithms for solving large-scale problems inthe presence of large amounts of uncertainties.It is believed that the boom in big data researchcan create new opportunities as well as impose

new challenges to data driven optimization.Answering the following questions can be cen-tral to converting the challenges posed by bigdata into opportunities.

First, how can we seamlessly integrate mod-ern learning and optimization techniques?Many advanced learning techniques, suchas semi-supervised learning [63], incrementallearning [15], active learning [47] and deeplearning [10] have been developed over thepast decade. However, these techniques haverarely been taken advantage of within opti-mization with few exceptions, and they arecritical in acquiring domain knowledge froma large amount of heterogeneous and noisydata. For optimization using meta-heuristics,such knowledge is decisive in setting up aflexible and compact problem representation,designing efficient search operators, construct-ing high-quality surrogates, and refining userpreferences in multi-objective optimization.

Second, how can we formulate the optimiza-tion problem so that new techniques devel-oped in big data research can be more effi-ciently leveraged? Traditional formulation ofoptimization problems consists of defining ob-jective functions, decision variables and con-

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 12

straints. This works perfectly for small, welldefined problems. Unfortunately, the formula-tion of complex problems is itself an iterativelearning process. The new approaches to dataanalysis in big data can be of interest in sim-plifying the formation of complex optimizationproblems. For example, for surrogates to beused for rank prediction in population-basedoptimization, exact fitness prediction is less im-portant than figuring out the relative order ofthe candidate designs. Might it be also possibleto find meta-decision variables that might bemore effective in guiding the search processthan using the original decision variables?

Third, how do we visualize the high-dimensional decision space as well as the high-dimensional solution space to understand theachieved solutions and make a choice [27],[53]? How can techniques developed in bigdata analytics be used in optimization?

Overcoming these challenges in a big dataframework will deliver significant advances toglobal optimization over the coming years.

5 INDUSTRY, GOVERNMENT ANDPEOPLE WITH BIG DATAWe have presented above some of the technicalchallenges for research around the disciplinesimpacted and challenged by big data. In theend, we also must focus on the delivery ofbenefit and outcomes within industry, businessand government. Over the decades, many ofthe technologies we have covered above, inmachine learning, data mining, and global op-timization, have found their way into a varietyof large-scale applications. But what are theimpacts we see today in industry and govern-ment of big data and how is this affecting andchanging society and how might these changesaffect our research across all these disciplines?

In this section we present a perspective ondata analytics from the experiences in industryand government. The discussion is purpose-fully presented as a point of view of the futurein practice rather than presenting a scientifi-cally rigorous argument. We identify here areas

where a focus from research might deliverimpact to industry and government.

It is useful to reflect that over the past twodecades we have witnessed an era where so-ciety has seen the mass collection of personaldata by commercial interests and government.As users, we have been enticed by significantbenefits to hand our data over to these orga-nizations, and these organizations now housethe big data that we have come to understandas the concept or the marketing term of themoment.

Google, Apple and Facebook, together withmany other Internet companies that exist to-day, provide services ranging from the discov-ery of old friends to the ability to share ourthoughts, personal details and daily activitiespublicly. With much of our email, our diariesand calendars, and our photos and thoughtsand personal activities, now hosted by Google,for example, there is tremendous opportunityto identify and deal with a whole-of-clientview on a massive scale. Combine that withour web logs, updates on our location and stor-age of our documents on Google Drive, and westart to understand the massive scope of thedata collected about each of us, individually.These data can be used to better target theservices advertised to us, using an impressivevariety of algorithmic technologies to delivernew insights and knowledge.

Together, these crowdsourced data stores en-tice us to deliver our personal data to thedata collectors in return for the sophisticatedservices they offer. The enticement is, of course,amazingly attractive, evidenced by the sheernumber of users in each of the growing Inter-net ecosystems.

The customers that drive this data collectionby these Internet companies are not the usersof the services but are, for example, commercialadvertisers and government services. The datais also made available to other organizations,purposefully and/or inappropriately.

As data have been gradually collectedthrough centralized cloud services over time,

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 13

there is increasing need for broad discussionand understanding of the privacy, security, andsocietal issues that such collections of personalbig data present. We have, as a society, reducedour focus on these issues and have come tounderstand that the centralized store of datais the only way in which we can deliver thesedesirable services.

However this concept of a centralized col-lection of personal and private data should bechallenged. The centralized model primarilyserves the interests of the organizations collect-ing the data—more so than the individuals. Weare seeing a growing interest and opportunityto instead turn this centralized mode of datacollection on its head, and to serve the interestsof the individuals first, and those of the organi-zations second. The emergence of OwnClowdas a personally hosted replacement to GoogleDrive and DropBox is but one example of thistrend.

The slowly growing rediscovery of the im-portance of privacy is leading to a shakeup ofhow we store data. We will see this leadingto governments introducing new governanceregimes for the Internet companies over time.We will also begin to see a migration of datanot being stored centrally but being stored withthe person about whom the data relates—theappropriate data owner. This extreme data dis-tribution presents one of the greatest challengesto data scientists in the near future of big data.

We present three challenges presented bybig data that relate to this emerging paradigmof extreme data distribution. The two initialchallenges are the scale of our model buildingand the timeliness of our model building. Themain focus is the challenge to society overthe coming years where data are migratedfrom centralized to extremely distributed datastores. This is one of the more significant chal-lenges that will be presented in the big datafuture, and one that has major impact on howwe think about machine learning, data miningand global optimization.

5.1 Scaled Down Targeted Sub-Models

There has been considerable focus on the needto scale up our traditional machine learningalgorithms to build models over the whole ofthe population available—and to build themquickly. Indeed, since the dawn of data mining[45] a key focus has been to scale algorithms.This is what we used to identify as the dis-tinguishing characteristic between data mining(or knowledge discovery from databases) andthe home disciplines of most of the algorithmswe used: machine learning and statistics [56].Our goal was to make use of all the data avail-able, to avoid the need for sampling, and toensure we capture knowledge from the wholepopulation.

This goal remains with us today, as we con-tinue to be obsessed with and able to collectdata—masses of data—and thereby introducenew businesses and refine old ones. But ofcourse we have for the past decades talkedabout big, large, huge, enormous, massive, hu-mongous, data. The current fad is to refer to itas big data or massive data [40]. Irrespective, itis simply a lot of data.

In business and in government our datasetstoday consist of anything from 100 observa-tions of 20,000 variables, to 20 million ob-servations of 1,000 variables, to 1 billionobservations of 10,000 variables. Such largedatasets challenge any algorithm. While ourresearch focus is generally on the algorithms,the challenges presented to the data collection,storage, management, manipulation, cleansing,and transformation are often generally muchbigger (i.e., more time consuming) than pre-sented by the actual model building.

Challenged with a massive dataset, what isthe task, in practice, of the data scientist? Ina future world, the learning algorithms willtrawl through the massive datasets for us—they will slice and dice the data, and willidentify anomalies, patterns, and behaviors.But how will this be delivered?

A productive approach will be to build onthe successful early concept of ensemble model

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 14

building [55], but taken to a new massive scale.We are seeing the development of approachesthat massively partition our datasets into manyoverlapping subsets, representing many dif-ferent subspaces, often identifying behavioralarchetypes. Within these subspaces we canmore accurately understand and model themultiple behaviors exhibited by the entitiesof interest. The idea is not new [57] but hasreceived scant attention over the years untilnow when we are realizing the need to sliceand dice big data in sensible ways to uncoverthese multitudes of behavioral patterns. Todayin industry and government this approach isdelivering new models that are demonstratingsurprisingly good results based on ensemblesof thousands of smaller very different models.

The challenge here is how to identify thebehavioral subspaces within which we buildour models. Instead of building a single pre-dictive model over the whole dataset we builda community of models that operate over thewhole population as a single entity. This isdone by understanding and dealing with thenuances and idiosyncrasies of the differentsub-populations. It is within the much smallersub-populations where the predictive models,for example, are built. The final model is thenthe ensemble of individual models applied toeach new observation.

An example of this approach, deployed inpractice, has analyzed over 2 million obser-vations of 1000 variables to identify 20,000such behavioral subspaces. The subspaces arecreated using a combination of cluster analysisand decision tree induction, and each subspaceis described symbolically, each identifying andrepresenting new concepts. The subspaces canoverlap and individual observations can be-long to multiple subspaces (i.e., a single ob-servation may exhibit multiple behaviors). Foreach of the 20,000 subspaces, micro predictivemodels can be built, and by then combiningthese into an ensemble — using global opti-mization of multiple complex objective func-tions — we deliver an empirically effective

global model, deployed, for example, to riskscore the whole tax paying population as theyinteract with the taxation system.

We can (and no doubt will) continue toexplore new computational paradigms likein-database analytics to massively scale ma-chine learning and statistical algorithms to bigdata. But the big game will be in identifyingand modeling the multiple facets of the va-riety of behaviors we all individually exhibit,and share with sizeable sub-populations, overwhich the modeling itself will be undertaken.

5.2 Right Time, Real Time, Online Analyt-icsThe traditional data miner, in practice, has gen-erally been involved in batch-oriented modelbuilding, using machine learning and statis-tical algorithms. From our massive store ofhistorical data we use algorithms such as logis-tic regression, decision tree induction, randomforests, neural networks, and support vectormachines. Once we have built our model(s)we then proceed to deploy the models. In thebusiness context we migrate our models intoproduction. The models then run on new trans-actions as they arrive, perhaps scoring eachtransaction and then deciding on a treatmentfor that transaction—that is, based on the score,how should the transaction be processed byour systems?

The process is typical of how data mining isdelivered in many large organizations today,including government, financial institutions,insurance companies, health providers, mar-keting, and so on. In the Australian TaxationOffice, for example, every day a suite of datamining models risk score every transaction (taxreturn) received. The Australian ImmigrationDepartment [4], as another example, has de-veloped a risk model that assesses all passen-gers when they check-in for their internationalflight to Australia (known as Advance Pas-senger Processing) using data mining models.Such examples abound through industry andgovernment.

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 15

Today’s agile context now requires morethan this. The larger and older organizations,world wide, have tended to be much less agilein their ability to respond to the rapid changesdelivered through the Internet and our data-rich world. Organizations no longer have theluxury of spending a few months buildingmodels for scoring transactions in batch mode.We need to be able to assess each transactionas the transaction happens, and to dynamicallylearn as the model interacts with the massivevolumes of transactions that we are faced withas they occur. We need to build models in real-time to respond in real time and that learn andchange their behavior in real-time.

Research in incremental learning is certainlynot new. Incremental learning [15], [46], just-in-time or any-time learning [49], and data streammining [21] have all addressed similar issues indifferent ways over the past decades. There isnow increasing opportunity to capitalize on theapproach. The question continues as to howwe can maintain and improve our knowledgestore over time, and work to forget old, possi-bly incorrect, knowledge?

The development of dynamic, agile learnersworking in real-time—that is, as they interactwith the real world—remain quite a challengeand will remains a central challenge for our bigdata world.

5.3 Extreme Data Distribution—Privacyand OwnershipHaving considered two intermediate chal-lenges around big data, we now consider agame-changing challenge. The future holds forus the prospect of individuals regaining con-trol of their data from the hands of the nowcommon centralized massive stores. We expectto see this as an orderly evolution of ourunderstanding of what is best for society aswe progress and govern our civil society in thisage of big data collection and surveillance andof its consequent serious risk to privacy.

Data ownership has become a challengingissue in our data rich world. Data collectors

in the corporate and government spheres arelearning to efficiently collect increasingly largerholdings of big data. However, with the helpof civil libertarians, philosophers and whistleblowers, society is gradually realizing the needfor better governance over the collection anduse of data. Recent events like Wikileaks1 andEdward Snowden2 help to raise the level ofdiscussion that is providing insight into thedangers of aggregating data centrally—withlittle regard about who owns the data.

We are well-aware of the dangers of singlepoints of failure—relying on our data heldcentrally, as massive datasets, stored securely,and to be used by the data collectors only forthe benefit of our society, and the individualsof that society. Yet, a single point of failurewill mean that just one flaw or one breachcan lead to devastating consequences on amassive scale. And with increasingly sophisti-cated methods of attack, it is increasingly hap-pening. Even without sophistication, Snowdenhas demonstrated that simply having all thedata stored in one location increases the riskssignificantly, to the detriment of governments,industry, and society as a whole.

After identifying risks. we often work to-wards strategies to mitigate those risks. Anobvious strategy is to recoil from the inher-ently insecure centralization of the collectionof data. Personal data need to move back tothe individuals to whom the data belongs. Wecan increasingly collect and retain such dataourselves, under our control, as individuals,reducing the overall societal risk.

The services that are so attractive and thatwe have come to rely upon, the services pro-vided by Google, Apple, Facebook, and manyother Internet ecosystems, must still be deliv-ered. The corporations must retain their abilityto profit from the data, while the data itselfis retained by the data owners. Under thisfuture scenario, instead of centralizing all thecomputation, we can bring the computation to

1. http://en.wikipedia.org/wiki/Wikileaks2. http://en.wikipedia.org/wiki/Edward Snowden

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 16

intelligent agents running on our own personaldevices and communicating with the serviceproviders. Business models can still be prof-itable and users can regain their privacy.

Our personal data will be locked behindencryption technology, and the individuals willhold the key to unlock the data as they wish.The data will be served to our portable smartdevices where it is decrypted and providesthe services (mail, photos, music, instant mes-saging, shopping, search, web logs, etc.) werequire. The data will be hosted in thesepersonal encrypted clouds, running on mesh-network connected commodity smart devices.We see the beginnings of this change hap-pening now with projects like the FreedomBox3, OwnCloud4, and the IndieWeb, and thewidespread and massive adoption of powerfulsmartphones!

This view of distributed massive data bringswith it the challenge to develop technology andalgorithms that work over an extreme data dis-tribution. How can we distribute the computa-tion over this extreme level of data distributionand still build models that learn in a big datacontext?

The challenge of extreme distributed bigdata and learning is one that will quickly growover the coming years. It will require quitea different approach to the development ofmachine learning, data mining and global op-timization. Compare the approach to how soci-ety functions through an ensemble of humans.We are each personally a store of a massiveamount of data and we share and learn fromand use that data as we interact with the worldand with other humans. So must the learningalgorithms of the future.

5.4 Opportunities and ChallengesIdentifying key challenges of big data leadsone to question how the future of data collec-tion might evolve over the coming years. As

3. http://en.wikipedia.org/wiki/FreedomBox4. http://en.wikipedia.org/wiki/Owncloud

the pendulum reaches the limit of centralizedmassive data collection, in time and possiblysooner than we might expect, we will see thependulum begin to swing back. It must swingback to a scenario where we restore data tothe ownership and control of the individualsabout whom the data relates. The data will beextremely distributed, with individual recordsdistributed far and wide, just as the data own-ers are distributed far and wide across ourplanet.

With an extreme data distribution we willbe challenged to provide the services we havecome to expect with massive centralized datastorage. Those challenges are surely not insur-mountable, but will take considerable research,innovation, and software development to de-liver.

The first challenge presented then becomesone of appropriately partitioning big data(eventually massive extreme distributed data),to identify behavioral groups within which welearn, and to even model and learn at the indi-vidual level. The second challenge is to refocusagain on delivering learning algorithms thatself-learn (or self-modify) in real-time, or atleast at the right time, and to do this online.Finally, how do we deliver this in the context ofextreme data distribution where the databaserecords are now distributed far and wide andprivacy protected, and how we might deliverlearning agents that look after the interests oftheir “owner”.

6 WRAP UPFrom the data analytics perspectives we havepresented here, there are many new oppor-tunities and challenges brought by big data.Some of these are not necessarily new, butare issues that have not received the attentionthat they deserve. Here we recall some of theimportant/interesting issues:

∙ Data size: On one hand, we develop“one-pass learning” algorithms that re-quire only one scan of the data with lim-ited storage irrelevant to data size; on

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 17

the other hand, we try to identify smallerpartitions of the really valuable data fromthe original big data.

∙ Data variety: Data presents itself in variedforms for a given concept. It is present-ing a new notion to learning systems andcomputational intelligence algorithm forclassification, where the feature vector ismulti-modal, with structured and unstruc-tured text, and still the notion is to classifyone concept from another. How do wecreate a feature vector, and then a learningalgorithm with an appropriate objectivefunction to learn from such varied data?

∙ Data trust: While data is rapidly and in-creasingly available, it is also important toconsider the data source and if the datacan be trusted. More data is not neces-sarily correct data, and more data is notnecessarily valuable data. A keen filter forthe data is a key.

∙ Distributed existence: Owners of differentparts of the data might warrant differentaccess rights. We must aim to leveragedata sources without access to the wholedata, and exploit them without transport-ing the data. We will need to pay attentionto the fact that different sources may comewith different label quality, there may beserious noise in the data due to crowd-sourcing, and the i.i.d. assumption maynot hold across the sources.

∙ Extreme distribution: Taking this ideaeven further, the unit-level data may bewhat we see as the level of data distribu-tion, as we deal with issues of privacy andsecurity. New approaches to modeling willbe required to work with such distributeddata.

∙ Diverse demands: People may have di-verse demands whereas the high cost ofbig data processing may disable construc-tion of a separate model for each demand.Can we build one model to satisfy thevarious demands? We also need to notethat, with big data, it is possible to find

supporting evidence to any argument wewant; then, how to judge/evaluate our“findings”?

∙ Sub-Models: Diverse demands might alsorelate to diversity of the behaviors that wemight be modeling within our applicationdomains. Rather than one single modelto cover it all, the model will consist ofensembles of smaller models that togetherdeliver better understandings and predic-tions than the single, complex model.

∙ Intuition importance: Data is goingto power novel discoveries and action-oriented business insights. It is importantto still attach intuition, curiosity and do-main knowledge without which one maybecome myopic and fall in the chasmof “correlation is enough”. Computationalintelligence should be tied with humanintuition.

∙ Rapid model: As the world continues to“speed up”, decisions need to be mademore quickly because fraudsters can morequickly find new methods in an agile en-vironment, model building must becomemore agile and real-time.

∙ Big optimization: Global optimization al-gorithms such as meta-heuristics haveachieved great success in academic re-search, but have rarely been employed inindustry. One major obstacle is the hugecomputational cost required for evaluat-ing the quality of candidate designs ofcomplex engineering systems. The emerg-ing big data analytic technologies willremove the obstacle to a certain degreeby reusing knowledge extracted from thehuge amount of high-dimensional, hetero-geneous and noisy data. Such knowledgecan also be acquired with new visualiza-tion techniques. Big data driven optimiza-tion will also play a key role in reconstruc-tion of large-scale biological systems.

∙ Complex optimization: Definition of de-cision variables, setup of the objectivesand articulation of the constraints are three

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 18

main steps in formulating optimizationproblems before solving them. For opti-mization of complex systems, formulationof the optimization problem itself becomesa complex optimization problem. The bigdata approach might provide us new in-sights and methodologies for formulatingoptimization problems, thus leading to amore efficient solution.

In closing the discussion, we emphasize thatthe opportunities and challenges brought bybig data are very broad and diverse, and itis clear that no single technique can meet alldemands. In this sense, big data also brings achance of “big combination” of techniques andof research.

ACKNOWLEDGMENTS

The authors want to thank the editor andanonymous reviewers for helpful commentsand suggestions. This article presents a con-solidation of some of the presentations anddiscussion of the panel “Big Data: Real Chal-lenges and Status” hosted at IDEAL 2013; thevery positive reflections from the audience forthe panel motivated the writing of this article.The authors thank Hui Xiong, Kai Yu, XinYao and other participants for comments anddiscussion.

REFERENCES

[1] J. Abello, P. M. Pardalos, and M. G. Resende, Hand-book of Massive Data Sets, ser. Massive Computing.Springer, 2002, vol. 4.

[2] C. C. Aggarwal and P. S. Yu, Eds., Privacy-PreservingData Mining: Models and Algorithms. Springer, 2008.

[3] C. C. Aggarwal, Data streams: Models and algorithms.Springer, 2007, vol. 31.

[4] Australian Department of Immigration, “Fact sheet70 - managing the border,” Internet, 2013. [Online].Available: http://www.immi.gov.au/media/fact-sheets/70border.htm

[5] J. Bader and E. Zitzler, “HypE: An algorithmfor fast hypervolume-based manyobjective optimiza-tion,” Evolutionary Computation, vol. 19, no. 1, pp. 45–76, 2011.

[6] P. Baldi and P. J. Sadowski, “Understanding dropout,”in Advances in Neural Information Processing Systems 26.Cambridge, MA: MIT Press, 2013, pp. 2814–2822.

[7] M. Banko and E. Brill, “Scaling to very very largecorpora for natural language disambiguation,” in Pro-ceedings of the 39th Annual Meeting of the Association forComputational Linguistics, Toulouse, France, 2001, pp.26–33.

[8] A.-L. Barabasi, Linked: The New Science Of Networks.Basic Books, 2002.

[9] M. Basu and T. K. Ho, Data Complexity in PatternRecognition. London, UK: Springer, 2006.

[10] Y. Bengio, “Learning deep architectures for AI,” Foun-dations and Trends in Machine Learning, vol. 2, no. 1, pp.1–127, 2009.

[11] E. Brynjolfsson, L. Hitt, and H. Kim, “Strength innumbers: How does data-driven decision makingaffect firm performance?” Available at SSRN 1819486,2011.

[12] T. Chai, Y. Jin, and S. Bernhard, “Evolutionary com-plex engineering optimization: Opportunities andchallenges,” IEEE Computational Intelligence Magazine,vol. 8, no. 3, pp. 12–15, 2013.

[13] N. V. Chawla, “Data mining for imbalanced datasets:An overview,” in Data Mining and Knowledge DiscoveryHandbook, O. Maimon and L. Rokach, Eds. US:Springer, 2005, pp. 853–867.

[14] N. V. Chawla et al., Learning on Extremes-Size andImbalance-of Data. Florida, US: University of SouthFlorida, 2002.

[15] Q. Da, Y. Yu, and Z.-H. Zhou, “Learning with aug-mented class by exploiting unlabeled data,” in Pro-ceedings of the 28th AAAI Conference on Artificial Intel-ligence, Quebec City, Canada, 2014.

[16] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan,“A fast and elitist multiobjective genetic algorithm:NSGA-II,” IEEE Transactions on Evolutionary Computa-tion, vol. 6, no. 2, pp. 182–197, 2002.

[17] T. G. Dietterich, “Approximate statistical tests forcomparing supervised classification learning algo-rithms,” Neural Computation, vol. 10, no. 7, pp. 1895–1923, 1998.

[18] D. Easley and J. Kleinberg, Networks, crowds, andmarkets. Cambridge Univ Press, 2010.

[19] A. Farhangfar, L. Kurgan, and J. Dy, “Impact ofimputation of missing values on classification errorfor discrete data,” Pattern Recognition, vol. 41, no. 12,pp. 3692–3705, 2008.

[20] L. Feng, Y.-S. Ong, I. Tsang, and A.-H. Tan, “Anevolutionary search paradigm that learns with pastexperiences,” in IEEE Congress on Evolutionary Com-putation, Brisbane, QLD, Australia, 2012, pp. 1–8.

[21] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy,“Mining data streams: A review,” ACM SIGMODRecord, vol. 34, no. 2, pp. 18–26, 2005.

[22] W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou, “One-passAUC optimization,” in Proceedings of the 30th Inter-national Conference on Machine Learning, Atlanta, GA,2013, pp. 906–914.

[23] W. Gao and Z.-H. Zhou, “Dropout Rademachercomplexity of deep neural networks,” CORRabs/1402.3811, 2014.

[24] J. A. Hanley and B. J. McNeil, “A method of compar-ing the areas under receiver operating characteristic

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 19

curves derived from the same cases,” Radiology, vol.148, no. 3, pp. 839–843, 1983.

[25] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal,“The “wake-sleep” algorithm for unsupervised neu-ral networks,” Science, vol. 268, no. 5214, pp. 1158–1161, 1995.

[26] G. E. Hinton and R. R. Salakhutdinov, “Reducingthe dimensionality of data with neural networks,”Science, vol. 313, pp. 504–507, 2006.

[27] H. Ishibuchi, M. Yamane, N. Akedo, and Y. Nojima,“Many-objective and many-variable test problems forvisual examination of multiobjective search,” in IEEECongress on Evolutionary Computation, Cancun, Mex-ico, 2013, pp. 1491–1498.

[28] H. Ishibuchi and T. Murata, “Multi-objective geneticlocal search algorithm,” in Proceedings of IEEE Interna-tional Conference on Evolutionary Computation, Nagoya,Japan, 1996, pp. 119–124.

[29] Y. Jin and J. Branke, “Evolutionary optimization inuncertain environments - A survey,” IEEE Transactionson Evolutionary Computation, vol. 9, no. 3, pp. 303–317,2005.

[30] Y. Jin, K. Tang, X. Yu, B. Sendhoff, and X. Yao, “Aframework for finding robust optimal solutions overtime,” Memetic Computing, vol. 5, no. 1, pp. 3–18, 2013.

[31] T. Joachims, “Making large-scale SVM training prac-tical,” in Advances in Kernel Methods, B. Scholkopf,C. J. C. Burges, and A. J. Smola, Eds. Cambridge,MA: MIT Press, 1999, pp. 169–184.

[32] Y. LeCun and Y. Bengio, “Convolutional networks forimages, speech, and time series,” in The Handbook ofBrain Theory and Neural Networks, M. A. Arbib, Ed.Cambridge, MA: MIT Press, 1995.

[33] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos,J. VanBriesen, and N. Glance, “Cost-effective outbreakdetection in networks,” in Proceedings of the 13th ACMSIGKDD international conference on Knowledge discoveryand data mining. ACM, 2007, pp. 420–429.

[34] M. Li, W. Wang, and Z.-H. Zhou, “Exploiting remotelearners in internet environment with agents,” ScienceChina: Information Sciences, vol. 53, no. 1, pp. 47–76,2010.

[35] N. Li, I. W. Tsang, and Z.-H. Zhou, “Efficient opti-mization of performance measures by classifier adap-tation,” IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, vol. 35, no. 6, pp. 1370–1382, 2013.

[36] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla,“New perspectives and methods in link prediction,”in Proceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining.ACM, 2010, pp. 243–252.

[37] V. Mayer-Schonberger and K. Cukier, Big Data: ARevolution That Transforms How we Work, Live, andThink. Houghton Mifflin Harcourt, 2012. (Chinesetranslated version by Y. Sheng and T. Zhou, ZhejiangRenmin Press).

[38] M. Mohri and A. Rostamizadeh, “Stability bounds forstationary 𝜑-mixing and 𝛽-mixing processes,” Journalof Machine Learning Research, vol. 11, pp. 789–814,2010.

[39] J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodrıguez,

N. V. Chawla, and F. Herrera, “A unifying viewon dataset shift in classification,” Pattern Recognition,vol. 45, no. 1, pp. 521–530, 2012.

[40] National Research Council, Frontiers in Massive DataAnalysis. Washington D.C.: The National AcademiesPress, 2013.

[41] M. E. Newman, “The structure and function of com-plex networks,” SIAM review, vol. 45, no. 2, pp. 167–256, 2003.

[42] T. T. Nguyen, S. Yang, and J. Branke, “Evolutionarydynamic optimization: A survey of the state of theart,” Swarm and Evolutionary Computation, vol. 6, pp.1–24, 2012.

[43] B.-H. Park and H. Kargupta, “Distributed data min-ing: Algorithms, systems, and applications,” in TheHandbook of Data Mining, N. Ye, Ed. New Jersey, US:Lawrence Erlbaum Associates, 2002, pp. 341–358.

[44] J. Pearl, Causality: Models, Reasoning and Inference.Cambridge University Press, 2000.

[45] G. Piatetsky-Shapiro and W. Frawley, Eds., Proceed-ings of IJCAI89 Workshop on Knowledge Discovery inDatabases. Detroit, MI: AAAI, 1989.

[46] J. C. Schlimmer and D. Fisher, “A case study of in-cremental concept induction,” in Proceedings of the 5thNational Conference on Artificial Intelligence, Philadel-phia, PA, 1986, pp. 496–501.

[47] B. Settles, “Active learning literature survey,” Depart-ment of Computer Sciences, University of Wisconsinat Madison, Wisconsin, WI, Tech. Rep. 1648, 2009,http://pages.cs.wisc.edu/∼bsettles/pub/settles. ac-tivelearning.pdf.

[48] J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka,A. C. de Carvalho, and J. Gama, “Data stream clus-tering: A survey,” ACM Computing Surveys (CSUR),vol. 46, no. 1, p. 13, 2013.

[49] P. Smyth, “Recent advances in knowledge discoveryand data mining (invited talk),” in Proceedings ofthe 14th National Conference on Artificial Intelligence,Providence, RI, USA, 1997.

[50] A. Srivastava, E.-H. Han, V. Kumar, and V. Singh,Parallel Formulations of Decision-Tree Classification Al-gorithms. US: Springer, 2002.

[51] S. A. Thomas and Y. Jin, “Reconstructing biologicalgene regulatory networks: Where optimization meetsbig data,” Evolutionary Intelligence, vol. 7, no. 1, pp.29–47, 2013.

[52] S. Wager, S. Wang, and P. Liang, “Dropout trainingas adaptive regularization,” in Advances in NeuralInformation Processing Systems 26. Cambridge, MA:MIT Press, 2013, pp. 351–359.

[53] D. Walker, R. Everson, and J. E. Fieldsend, “Visualis-ing mutually non-dominating solution sets in many-objective optimisation,” IEEE Transactions on Evolu-tionary Computation, vol. 17, no. 2, pp. 165–184, 2013.

[54] D. J. Watts, Six degrees: The science of a connected age.WW Norton, 2004.

[55] G. J. Williams, “Combining decision trees: Initial re-sults from the MIL algorithm,” in Artificial IntelligenceDevelopments and Applications, J. S. Gero and R. B.Stanton, Eds. Elsevier Science Publishers B.V. (North-Holland), 1988, pp. 273–289.

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 20

[56] G. J. Williams and Z. Huang, “A case study in knowl-edge acquisition for insurance risk assessment usinga KDD methodology,” in Proceedings of the Pacific RimKnowledge Acquisition Workshop, P. Compton, R. Mi-zoguchi, H. Motoda, and T. Menzies, Eds., Sydney,Australia, 1996, pp. 117–129.

[57] ——, “Mining the knowledge mine: The Hot Spotsmethodology for mining large, real world databases,”in Advanced Topics in Artificial Intelligence, A. Sattar,Ed. Springer-Verlag, 1997, pp. 340–348.

[58] X. Wu and X. Zhu, “Mining with noise knowledge:error-aware data mining,” Systems, Man and Cybernet-ics, Part A: Systems and Humans, IEEE Transactions on,vol. 38, no. 4, pp. 917–932, 2008.

[59] M. Xiao, L. Zhang, B. He, J. Xie, and W. Zhang,“A parallel algorithm of constructing gene regulatorynetworks,” in Proceedings of the 3rd International Sym-posium on Optimization and Systems Biology, D.-Z. Duand X.-S. Zhang, Eds., Zhangjiajie, China, 2009, pp.184–188.

[60] M. J. Zaki and C.-T. Ho, Eds., Large-scale parallel datamining. Berlin, Germany: Springer, 2000.

[61] D. Zhou, J. C. Platt, S. Basu, and Y. Mao, “Learningfrom the wisdom of crowds by minimax entropy,” inAdvances in Neural Information Processing Systems 25,P. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou,and K. Q. Weinberger, Eds. Cambridge, MA: MITPress, 2012.

[62] Z.-H. Zhou, “Rule extraction: Using neural networksor for neural networks?” Journal of Computer Scienceand Technology, vol. 19, no. 2, pp. 249–253, 2004.

[63] X. Zhu, “Semi-supervised learning literature survey,”Department of Computer Sciences, University of Wis-consin at Madison, Madison, WI, Tech. Rep. 1530,2006, http://www.cs.wisc.edu/∼jerryzhu/pub/sslsurvey.pdf.