Comparing DBMSs and Alternative Big Data Systems: Scale Up ...ordonez/pdf/dbms-vs-nosql.pdf · 6...

Noname manuscript No.(will be inserted by the editor)

Comparing DBMSs and Alternative Big Data Systems:Scale Up versus Scale Out

Carlos Ordonez, David Sergio Matusevich,Sastry Isukapalli

the date of receipt and acceptance should be inserted later

Abstract CARLOS:Relational DBMSs remain the main data management technology, despite

the big data analytics and no-SQL waves. On the other hand, for data analyticsin a broad sense, there are plenty of non-DBMS tools including statistical lan-guages, matrix packages, generic data mining programs and large-scale parallelsystems, being the main technology for big data analytics. Such large-scale sys-tems are mostly based on the Hadoop distributed file system and MapReduce.Thus it would seem a DBMS is not a good technology to analyze big data,going beyond SQL queries, acting just as a reliable and fast data repository. Inthis survey, we argue that is not the case, explaining important research thathas enabled analytics on large databases inside a DBMS. However, we alsoargue DBMSs cannot compete with parallel systems like MapReduce to ana-lyze web-scale text data. Therefore, each technology will keep influencing eachother. We conclude with a proposal of long-term research issues, consideringthe “big data analytics” trend.

1 DAVID: Introduction

DBMS are capable of efficiently updating, storing and querying large tables.Relational DBMSs have been the dominating technology for managing andquerying structured data. In the recent past, there has been an explosion ofdata on the Internet, commonly referred to as “big data”, from collections ofweb pages, documents, transaction logs and records, etc., which are largelyunstructured. In the DBMS and big data worlds the main challenge contin-ues to be to efficiently manage, process, and analyze large data sets. Researchon this area is extensive, and has resulted in many efficient algorithms, datastructures, and optimizations for analyzing large data sets. This analysis pro-cess spans a range of activities, from querying and reporting, exploratory data

University of Houston, ATT

2 C. Ordonez, D.S. Matusevich, S. Isukapalli

analysis, to predictive modeling via machine learning. The process of knowl-edge discovery, represented by models and patterns is referred to here simplyas data mining. There has been substantial progress in DBMS and non-DBMStechnologies in the domain of analytics, but there remain many cases wherethere is a need for a hybrid approach that utilizes these tools in a comple-mentary manner, that also utilizes the advances in integration of statisticalmethods and matrix computations with database systems.

We present a survey of system infrastructure, programming mechanisms,algorithms and optimizations that enable analytics on large data sets. Thefocus of this survey is primarily on data ingestion, processing, querying, an-alytics, and data mining, mainly in relation to non-textual, array-like data.Aspects of complex data elements such as graphs, time-series, etc., transac-tional data, data security, and data privacy are not surveyed here.

In the big data world, there have been major advances on efficient datamining algorithms [?,?,?], where the data are managed outside of a dedicatedDMBS (e.g. sets of flat files on a distributed file system). However, whenthere are large data sets within a The cost of moving data from one systemPrevious research has shown it is slow to move large volumes of data due todouble I/O, change of file format and network speed [45]. Exporting data setsis, therefore, a major bottleneck to analyze large data sets outside a DBMS[45]. Moreover, the data set storage layout has a close relationship to algorithmperformance. Reducing the number of passes over a large data set resident onsecondary storage, developing incremental algorithms with faster convergence,sampling for approximation and creating efficient data structures for indexingor summarization remain fundamental algorithmic optimizations.

DBMSs [32] and Hadoop-type systems (based on MapReduce or variantssuch as Spark; and distributed streaming engines) [62]1 are currently the twomain competing technologies for data mining on large data sets, both basedon automatic data parallelism on a shared-nothing architecture. DBMSs canefficiently process transactions and SQL queries on relational tables. Thereexist parallel DBMSs for data warehousing and large-scale analytic applica-tions, but they are difficult to expand and they are mostly proprietary andexpensive. On the other hand, distributed Hadoop-type systems have beenprimarily open source and are designed from ground-up to scale to large, het-erogeneous clusters. These systems provide means to automatically distributeanalytic operations across a cluster of computers and often work with on topof a distributed file system (on disk or in memory across the cluster), providinggreat flexibility, and expandability at low cost.2

Despite the advantages of the Hadoop-type systems, there are still manywhere there are benefits in using a centralized DMBS, especially when theend-user can perform data mining inside a DBMS and reduce the risk of data

1 SASTRY: Need to remove or update citation2 efficiency may not be true – for simple data intensive tasks, it is reasonable, but for

CPU intensive tasks, scaling up via GPUs or more cores/more memory may be a reasonablealternative.

DBMSs and Alternative Systems: Scale up vs scale out 3

��

��

� ��

��

��

��

��

�� !"��#� �� $ �

%� ��

��

%� ��

%�� %� ��&��

��

%� ��

'�� &

��

��

� (� ��

� �� '��

��

)��&�

��#��

��

��#��

� ��

��

�� & ��

Fig. 1 Data mining process: phases and tasks.

inconsistency, security, etc., and also mostly avoiding the export bottleneck ifdata originates from an OLTP database or a data warehouse.

The problem of integrating statistical and machine learning methods andmodels with a DBMS has received moderate attention [32,8,46,57,58,45]. Ingeneral, such problem is considered difficult due to the relational DBMS archi-tecture, the matrix-oriented nature of statistical techniques, lack of access tothe DBMS source code, and the comprehensive set of statistical and numericallibraries already available in mathematical packages such as Matlab, statisticallanguages such as R, and additional high-performance numerical libraries likeLAPACK. As a consequence, in modern database environments, users gener-ally export data sets to a statistical or parallel system, iteratively build severalmodels outside the DBMS, and finally deploy the best model. Even thoughmany DBMS offer access to data mining algorithms via procedural languageinterfaces, data mining in the DMBS has been mainly been limited to ex-ploratory analysis using OLAP cubes. Most of the computation of statisticalmodels (e.g. regression, PCA, decision trees, Bayesian classifiers, neural nets)and pattern search (e.g. association rules) continues to be performed outsidethe DBMS.

The importance of pushing statistical and data mining computations intoa DBMS is recognized in [14]; this work emphasizes exporting large tablesoutside the DBMS is a bottleneck and it identifies SQL queries and MapReduce[53] as two complementary mechanisms to analyze large data sets.


��

��

��

��

��

��

��

��

��

��

Fig. 2 Big data analytics.

1.1 Big Data Analytics Processing

We begin by giving an overview of the entire analytic process going from datapreparation to actually deploying and managing models. Figure 1 shows mainphases and tasks in any data mining project.3

Given the integration of many diverse data sources, the first major phaseis data quality assessment and repair, which involves itself exploratory dataanalysis, explained below. Data quality involves detecting and solving issueswith incomplete and inconsistent data.

In general, the most time consuming phase is to process, transform andclean data in order to build a data set. Basic database data transformations areeither denormalization or aggregation. Denormalization includes joins, pivot-ing [16], horizontal aggregations [48,42]. Common data transformation includecoding variables (e.g. into binary variables), binning (discretizing numeric vari-ables to compute histograms), sampling [12,11] (to tackle data set size or classimbalance), rescaling and null value imputation, among other transformations.

Figure 2 illustrates the most common4 techniques applied for data analysis.In this survey we study techniques and methods that can be used with DBMStechnology, in which data sets are structured or semi-structured (e.g. tables[homogeneous or heterogenous] and matrices [sparse and dense]), putting lessemphasis on data with weaker, dynamic and more variable structure (e.g..,web pages, documents with varying degrees of completeness).

3 SASTRY: We may need to change this figure a little bit because the arrows towardsthe right indicate a sequence, but they are independent steps (e.g. coding, imputation, etc.,do not necessarily follow aggregation/denormalization). Change the arrows to dotted lineswithout arrowheads?

4 SASTRY: These are general data analytic techniques and not necessarily big data;suggest removing the figure completely


After a data set has been created and refined, the next task is to exploreits statistical properties, in the form of counts, sums, averages and variances.Exploratory analysis includes ad-hoc queries 5, data quality assessment [18,47] (i.e. to detect and solve data inconsistency and missing information), andOLAP cubes. 6. In general, at this stage no models are computed. Thus ex-ploratory analysis reveals the “what”, “where” and “when”, and some timeshints towards possible correlations, but not necessarily“how” or “why”.

The next phase is computing statistical models or discovering patterns (i.e.data mining) hidden in the data set. Statistical and machine learning mod-els include both descriptive and predictive models such as regression (linear,logistic), clustering (K-means, EM, hierarchical, spatial) classification, dimen-sionality reduction (PCA, factor analysis) statistical tests, time series trendanalysis and in some cases, estimation of parameters for a model structure de-fined a priori. Additionally, pattern recognition and search can be performedvia cubes [26], association rules [3], and their generalizations, like sequences. 7

Last but not least, we need to consider model management, a subtask whichfollows model computation, and includes scoring (model deployment, inference[31]), model exchange (with other tools using model description languageslike PMML [30]) and versioning (i.e., maintaining a history of data sets andmodels).

2 DAVID: Classification of Big Data Analytic Systems

We begin our discussion of database systems by attempting to classify themany ways data can be stored and analyzing for data mining purposes. Thefirst main division is related to system functionality. To that effect we dividethe universe of data Systems into two categories: Database Management Sys-tems (DBMS) and alternative systems, which we will term non-relational (alsoknown as No-SQL [62]). Figure 3 provides our full classification, explained be-low.

We start our discussion with DBMSs, whose main characteristic is theability to handle large amounts of structured data, seamlessly and efficientlymoving data between primary and secondary storage, while dealing with allother issues, such as indexing, concurrent access, data security and recovery.

Given the current state of the art, database systems can be subdivided intotwo main categories: Relational and Non-Relational8. Commonly, the first sub-

5 SASTRY: (i.e., each query leading to a further query) – this is probably inaccurate;interactive exploration of data via different types of queries is better – it is more like a loop(query, visualize, query, loop)

6 SASTRY: and spreadsheets [68] (which provide an interactive environment to defineformulas among data elements). – we should stay away from the really small scales such asspreadsheets for now

7 SASTRY: Graph analysis has evolved to be a unique set of algorithms with a foundationon theoretical computer science. – This statement seems out of place.

8 SASTRY: We probably need to mention that for Non-Relational, we are briefly coveringHierarchical, but not Network, or Object Oriented DBs


��

��

��

��

��

��

��

��

��

��

��

��

�� !�

��"��

��

��"��

��"��

��

��"��

#$%�

��

#$%�

%��%"��

��

%��%"��

$��

��

$��

��$��

��

��$��

Fig. 3 Classification of Big Data Analytics Systems

group comprises those systems that have a database model and data strictly or-ganized as relations (tables). According to physical storage, relational DBMSscan be further subdivided into row-based (e.g. Postgres, IBM DB2, Oracle, Ter-adata, Greenplum) and column-based (e.g. C-Store, MonetDB, Hana). OlderDBMS technology relied on row storage to accelerate transaction processing[65] and optimize simple SPJ queries [22]. Column stores [63] are more ef-ficient when the SPJA queries involve mostly small subsets of columns (i.e.projections).

From a practical standpoint we could also divide them into two categories:those that were designed from the ground up to work as part of a parallelsystem with multiple storage and computing nodes; and those that use a singleprocessing unit (usually older systems). Given the wide diversity of file systemsused by DBMSs, we do not elaborate on them, but the common characteristicis the ability to perform random read and random writes on logical blocks[22]. In almost all RDBMS systems, the database engine runs as a server andparts of the data are cached in memory. Exceptions include engines such asSQLite, that can be created on-demand, but such engines handle relativelysmaller amount of data.

Non-relational systems can be divided into those that work with file-basedstructure, such as packages in R[50], MatLab, SAS, or custom-developed code


(in C++, Java or equivalent high level language). Some of these utilize local filesystems (local disk or network mounted), and some use distributed file systemssuch as the open-source Hadoop Distributed File System (Apache HDFS), in-spired by the Google file system [34]. A significant fraction of Hadoop analyticimplementations also use the MapReduce programming model, originally in-troduced by Google [19] as well. In the MapReduce type model, there is nounderlying database process that manages the data; instead, all the relevantprocesses are created at run-time.

Data present in the flat files can be simple text data or sometimes canbe complex, disk-based data structures such as B-Trees, arrays, etc., and canbe complex, hierarchical data sets based on HDF (Hierarchical data formats).Hadoop/HDFS has recently become a system of choice for the distributed stor-age and analysis of semi-structured information, initially for documents andweb pages, but gradually towards various types of data, sometimes even struc-tured data. The primary advantage of Hadoop and HDFS is that for problemswhere operations on a large data set can be split into parallel, independentoperations on small chunks of the data, one can utilize a distributed clus-ter of machines to process the data. Additionally, the data-local computationproperty of MapReduce (where the execution code is deployed to the nodeswhere the data reside; a data-local approach), allowed for aggregation of het-erogeneous computers on a wide area network with a mix of different networkspeeds and processing power of the machines. Although originally supportingonly the MapReduce processing paradigm, Hadoop ecosystem, Hadoop gainedtraction as the main platform in most large-scale data analytics systems out-side the DBMS realm. Cheaper and faster hardware have stimulated HDFSadoption. More recently it has gained traction in processing structured data,competing with RDBMS [17,54,62], initially by providing SQL-like interfacesfor underlying MapReduce processing, and more recently via modules for dataformats (columnar formats with indexes, custom data readers, etc.) Recently,different columnar storage systems have been developed including HBase, Ac-cumulo, etc. (both of which have been modeled after Google’s BigTable), whichtake advantage of the underlying HDFS file system and a distributed DMBSlike interface. Furthermore, Hadoop currently supports multiple programmingparadigms via the YARN scheduler, thus providing the ability to run iterativeprograms, long-running containers, etc.

Non-relational databases arose from the need to analyze data that is verydifficult (or even impractical) to fit in the relational model. Non-relationalsystems, while less refined and with less functionality than their relationalcounterparts have the advantage that they require less preparation of data be-fore loading, and also support dynamic reconfiguration of the database. Thishas caused a major shift in the database usage paradigm, where users preferto trade the computational power of a DBMS for a simplified and faster setup,especially because of the scale-out support on commodity hardware for thesesystems. According to their “storage model”, we propose to classify them into:Array, Stream and Graph based. Array databases are primarily oriented to-wards the analysis of multidimensional data, such as scientific, financial and


bioinformatics data. Well known systems include SciDB [64], Rasdaman [5]and ArrayStore [61]. Stream database systems such Gigascope [15] are used toanalyze data streams (non-static data) such as network data or video, with-out even storing the particular data records on secondary storage. StreamDatabase systems support “continuous” queries such as finding the number ofusers of a certain service on a sliding time window, returning results in nearreal-time response from data ingestion to reporting stage. Graph database sys-tems are characterized as systems that provide index-free adjacency list access.That is, each element of the database contains pointers to its neighbors with-out any index lookup. Graph databases make relationships between elementsexplicit and intuitive and are crucial when analyzing social networks and theInternet (as a collection of interconnected nodes). Prominent examples includePregel [38], Giraph [4], Neo4J [37], Graphx [?], Titan [?] and Urika[33].

It is worth mentioning here that these divisions are not absolute. There hasbeen in recent years, a push by developers to integrate the best qualities of onesystem into another, such as providing native support R and Python packageson DBMS (parallel R in the Vertica DBMS, and Python/PL in Postgres forexample)

3 System Infrastructure

While the differences between DMBS and non-DBMS tools and approaches arenot absolute, they can be compared with respect to how the underlying re-source demands, system configuration and infrastructure. These are affected bythe approaches used for storage, indexing, parallel processing, and distributedprocessing.

3.1 SASTRY: Storage and Indexing

Storage in traditional DBMSs is built around blocks of rows or columns ofmainly simple data types (e.g. numbers and strings) and with similar size forelements across columns. All cells within a column are of the same type andusually are of similar type. Additionally, they provide mechanisms to storelarge objects in a character or binary format. Irrespective of the underlyingdata types, these columns need to be fully specified either at the time ofcreating the table, or via alterations to the table prior to insertion of data.

On the other hand, non-DBMS solutions range widely in terms of howthey organize underlying data in rows and columns. Major types of storagelayouts used by NoSQL databases are (a) are key-value stores that are inessence distributed dictionaries, (b) document oriented databases where thenumber of columns corresponding to different attributes of the document, (c)columnar stores, where each column is separately stored either completely orat the level of a chunk, (d) hybrid columnar stores, which are row-oriented butsupport a large number of dynamic columns, and (e) chunked storage, which is


appropriate for structured data with a regular structure. Different analyticaltasks require different storage layouts for I/O efficiency, as each approach hasdifferent strengths.

– Row storage: This is the most common storage layout, in which X is storedon a tabular format which has n rows, each having d columns. Several rowsare generally stored together in a single block. When d is relatively small,the structure is close to a traditional DMBS system.

– Document storage: This is a novel layout tailored to analytical query pro-cessing [65]. In this case the data set is stored by column across severalblocks, which enables efficient projection of subsets of columns, where eachcolumn is stored separately. When there is separate storage for each col-umn, each having the primary key (or subscript) of each record and thecorresponding column value, this type of storage is close to a document-oriented store. In document oriented databases, the number of columnscorresponding to different attributes of the document may sometimes behigher than the number of total documents in the database. This may man-ifest as tables with millions of rows and millions of columns, and mediumdense matrix. In some cases, these columns are stored separately at thelevel of individual chunks of a table (e.g. for every 10,000 rows); this ap-proach is employed in the ORC format used by Hive.

– Hybrid columnar storage: In this case the data set is stored by rows, butthe columns can be arbitrarily large (typically millions of columns) witheach row containing a small subset of the columns. In this case, contents ofa single row are stored together (either for the entire row, or at a minimumfor a “column-family”). This storage can support queries across row keys,and subsequent fetching of cells based on the column names.

– Key-value storage: (NOTE: this is NOT the default storage in HDFS) Thisis in essence a distributed dictionary, where records are accessed throughvia a key-value lookup and subsequent processing of the values. The sizes ofthese databases are usually billions of rows and two columns. This structuresupports queries that involve a small set of “get” and “put” semantics forreading and writing values corresponding to a key. In the relational worldthis layout is the so-called transaction data set, stored on a normalized tablewith one attribute value per row. In the transaction layout, for efficiencypurposes, only a subset of dimensions for each record is actually stored(e.g. excluding binary dimensions equal to zero).

– Chunk storage: This is a novel mechanism to store multidimensional arrayson a large block, pioneered by SciDB [64]. Chunk storage is ideal for denseand sparse matrices, as well as spatial data.

We must note two other major approaches of large scale data storage.First approach relates to the use of formatted files on a distributed file system(e.g. Hive ORC or Parquet formats) that provide block or chunk level indexingwithout any data management service required. Second approach relates to theuse of hierarchical formats (e.g. NetCDF and HDF) that provide random access


to distributed data through the file system interface. The second approach forstorage is primarily used for scientific data sets that are typically dense.

The choice of a specific layout mechanism impacts the indexing schemethat can be used in querying the data, and also impacts the storage andprocessing costs associated with maintaining the index when the database isupdated with heavy writes. In traditional DMBSs, indexes are maintained forimportant simple columns, while indexes over individual elements of a complexbinary structure field are impractical to maintain. In case of NoSQL databases,the underlying row keys or document IDs are sorted and/or indexed (often atthe granularity of a chunk). In the hybrid columnar databases, the columnsthemselves are sorted based on the column name, which allows for quicklysearching for a cell based on a row key and a column name. In contrast, theRDMSs typically maintain the order of the columns in a table. Recent workin this area suggests dynamic hybrid type solutions for effective use of dataschema and analysis needs. For example, the Dual Table approach recommendsusing HBase for current data (which can change substantially at a high speed;heavy writes), and use Hive ORC tables for slightly aged data (e.g. data olderthan one week, which may have infrequent changes).

3.2 Data Retrieval and Mathematical Processing

In the context of large scale data retrieval and data analysis, matrix manip-ulation can be considered as a reference problem because a large number ofmathematical modeling problems and machine learning models require someform of matrix manipulation. Given a matrix, three row storage layouts standout: a standard horizontal layout having one row per input record, a longpivoted table with one attribute value per row or a transposed table havingrecords as columns. Each layout provides tradeoffs among I/O speed, spaceand programming flexibility, depending on the required mathematical compu-tation, data set dimensionality and the sparsity of the matrices.

Column stores provide limited flexibility in layouts to store matrices. Onthe other hand, a vertical row layout is the most common format for key-valuestores.

There has been a surge on array database systems, like SciDB [9] andArrayStore [61], but with different storage and query evaluation mechanismsfrom SQL. Such systems are making some impact on the academic world, com-plementing R and Matlab, but their future impact is uncertain for corporatedatabases.

The main differences between various storage and retrieval strategies resultin differences in performance under different system configuration scenarios.For instance, a traditional DBMS system may scale well for multiple cores ona single machine or a small set of multicore servers, each with large memory. Inthis case, the DBMS can mostly utilize the “share all” properties and optimizememory usage for faster query processing. We refer to this scenario as a “scale-up” of processing. On the other hand, when distributed processing is performed


using a cluster of a large number of relatively lower scale commodity servers,there is limited flexibility in optimizing all available CPU and memory towardsone specific query because these clusters usually are tuned to work with a“share-nothing” configuration. We refer to this scenario as “scale-out”, andthis is most common in Hadoop MapReduce and variants.

In case of a “scale-up” scenario, the underlying system can allocate re-sources towards executing a given query in full and mostly provide tight timeranges for completing a task. However, in the case of “scale-out” scenario anda shared-nothing architecture, each worker processes the data independently,and the underlying system cannot guarantee tight bounds in the completiontime of all the workers. In fact, it is not unusual for jobs on Hadoop to runwith unusually long tails in the execution time. For example, a small set ofmappers/reducers, or in general containers in Yarn may take a long time tocomplete, and then entire job gets delayed. In a multicore setup (“scale-up”),the underlying system can provide additional resources (CPU, memory, etc.) toexpedite the job completion. In the shared-nothing environment, most clusterscan at most run multiple copies of the last set of processes, without substan-tial changes to the amount of memory or CPU available, so, there are noguaranteed tight ranges for job completion.

3.3 DAVID: Parallel Processing

Currently, the best parallel architecture for DBMSs and non-relational systems(prominently represented by Hadoop) is shared-nothing, having N processingunits. These processing units can be N threads on a single CPU server (in-cluding multicore CPUs sharing RAM, but different partitions on disk) or Nphysical nodes (each one with its own RAM and secondary storage). A clus-ter of N nodes is strictly shared-nothing. On the other hand, given modernmulticore CPU architectures, there is some level of shared memory amongthreads with L1 or L2 cache memory. But in general, L1 is automaticallymanipulated by the CPU via hyper-threading [36]. In row stores tables arehorizontally partitioned into sets of rows for parallel processing. Similarly, ina column store each projection is horizontally partitioned as well using animplicit row id. Thus there exist sequential and parallel versions of databasephysical operators: table scan, external merge sort and joins. Similarly, in col-umn stores columns are partitioned into blocks, which are in turn distributedamong processing units. MapReduce [19,62] trails behind DBMSs to evalu-ate such parallel physical operators on equivalent hardware, although the gaphas been gradually shrinking with RAM-based caching and processing [70].Table 1 compares the architecture of single node and multiple node systems,including the now obsolete shared-everything architecture.9.

9 SASTRY: Table is a bit too small; suggest folding it into text


Table 1 Comparing parallel architectures.

Single N -node N -nodenode shared-nothing shared-everything

Cache Shared Distributed localRAM Shared Distributed sharedDisk Shared Distributed sharedFile System Unique Distributed sharedNon-local data access NA message pass locking

3.4 SASTRY: Data Manipulation and Query Languages

Query languages are computer languages used to retrieve information fromdatabases and other information systems. SQL (Structured Query Language)is the standard language used in relational databases, since it adheres quiteclosely to the relational model [13] (although not completely). While SQL hasbeen standardized by the American National Standards Institute in 1986, itis not completely portable from one DBMS to another. This is due to someambiguities in the standard which leads to a measure of lock-in. While SQLis widely used in the industry, several alternatives have been proposed, suchas OQL (Object Query Language), JPQL (Java Persistent Query Language)and HTSQL (Hyper Text SQL).

The particular needs of non-relational database systems gave rise to a fam-ily of query languages that are commonly referred to as NoSQL (Not OnlySQL). The NoSQL systems sometimes sacrifice consistency in favor of avail-ability, and may also sacrifice ACID properties, though the systems are evolv-ing towards supporting ACID properties. Major NoSQL systems provide analternative to ACID as BASE (Basically Available, Soft state, and EventuallyConsistent – which sacrifices consistency for partition tolerance), where theresponsibility for handling stale data and for guaranteeing consistency oftenlies with the application.

The rise of the distributed file systems such as HDFS and node-local com-puting paradigm of MapReduce and variants resulted in the development ofSQL-like interfaces/languages on top of Mapreduce and variants. These lan-guages are not SQL compliant, but provide some of the functionality pro-vided by SQL (typically, select, insert/overwrite, join, filter, etc.). ApacheHive, developed originally by Facebook and currently an Apache project pro-vides HiveQL, an SQL-like interface. Pig Latin, developed originally by Yahooand currently an Apache project, provides some SQL functionality via a flowdescription language. Both Pig and Hive transform a given query or set of dataprocessing commands into MapReduce taks (and to in-memory distributed sys-tems such as Spark). There are various query languages/interfaces for access-ing different types of structured and unstructured data sets on a distributedcluster. Notable among them are Impala and Presto (Hive-like engines withdedicated processes and services), Phoneix (SQL layer for HBase with substan-tial compatibility with SQL ACID properties), Hive/Stinger (approximating


in-memory MapReduce via “warm” containers), and SPARK SQL, which runson Spark.

Finally, we must mention Google’s Pregel, a language oriented towardsthe evaluation of queries on graphs. Pregel is a model for parallel processingof graphs with added fault tolerance. It uses a Master/Slave model whereeach Slave is assigned a subset of the graph’s vertices. The Master’s job is tocoordinate and partition the graph between the Slaves, as well as assigningsections of the input. While Pregel is a proprietary language, there are opensource alternatives, such as GPS (Graph Processing System) and Apache’sGiraph, and TitanDB.

3.5 DAVID: Programming Alternatives

We distinguish three major types of approaches for programming data miningalgorithms to analyze large data sets: (1) in RDBMS with SQL queries andUDFs without modifying the RDBMS source code; (2) extending the capabil-ities of the RDBMS through systems language like C++, by modifications toRDBMS source code. (3) in an non-RDBMS via with programming languagesin external systems, including parallel systems.

The first alternative allows processing data ”in-situ”, avoiding exports out-side the DBMS. The second one is feasible for open-source DBMSs and devel-opers of industrial DBMSs. The last alternative is the most flexible, allowingmany customized optimizations as well as easily adding ad-hoc code or scriptsinto the pipeline. It also has shortest development time and therefore has beenthe most popular in large scale data analysis.

Performing data mining with SQL queries and UDFs has received moderateattention. A major SQL limitation is its lack of ability to manipulate non-tabular data structures requiring pointer manipulation. However, for specificdomains such as spatial data analyses, stored procedures have been used toprovide significant functionality for “in situ” analysis of spatial and temporaltrajectories.10

When data mining computations are evaluated with SQL queries a programautomatically creates queries combining joins, aggregations and mathemati-cal expressions. The methods to generate SQL code are based on the inputdata set, the algorithm parameters and tuning query optimizations. SQL codegeneration exploits relational query optimizations including forcing hash-joins[46], clustered storage for aggregation [46,45], denormalization to avoid joinsand reduce I/O cost [43], pushing aggregation before joins to compress tablesand balancing row distribution among processing units.

This alternative covers the scenario where most processing on large tablesis done inside the DBMS, and remaining partial processing is done on relativelysmaller tables exported outside the DBMS [57] with an external mathemat-ical library or statistical program. Given DBMS overhead and SQL lack of

10 SASTRY: May be add a bit more on machine learning modules within RDBMS?


flexibility to develop algorithmic optimizations SQL is generally slower than asystems language like C++ for straight-forward cases11. 12

Overall, SQL has been enriched with object-relational extensions like User-Defined Functions (UDFs), stored procedures, table-valued functions, andUser-Defined Types, with aggregate UDFs being the most powerful for datamining. UDFs represent a compromise between modifying the DBMS sourcecode, extending SQL with new syntax and generating SQL code. UDFs rep-resent an important extensibility feature, now widely available (although notstandardized) in most DBMSs.

UDFs include scalar, aggregate and table functions with aggregate UDFs.These UDFs provide a simplified parallel programming interface. UDFs cantake full advantage of the DBMS processing power when running in the sameaddress space as the query optimizer (i.e. unfenced [57] or unprotected mode).UDFs do not increase the computation power of SQL, but they make pro-gramming easier and many optimizations feasible since they are basically codefragments or modules natively supported on the system.

Data structures, incremental computations and matrix operations are cer-tainly difficult to program with SQL and UDFs due to the relational DBMSarchitecture, limited flexibility of the SQL language and DBMS subsystemsoverhead, but they can be optimized and they can achieve a tight integra-tion with the DBMS.13 SQL can achieve linear scalability on data set sizeand dimensionality to compute several models. It has been shown that SQLis somewhat slower (not an order of magnitude slower) than C++ (or sim-ilar language) [43], with relative performance getting better as data set sizeincreases (because overhead decreases). There is research on adapting and op-timizing numerical methods to work with large disk-resident matrices [52,69],but only a few work in the DBMS realm [45,32].

Developing fast algorithms in C++ or Java was the most popular alterna-tive in the past. This was no coincidence because a high-level programminglanguage provides freedom to manipulate RAM and minimize I/O. Modifyingand extending the DBMS source code (open source code or proprietary), yieldsalgorithms tightly coupled to the DBMS.

This alternative has been the most common for commercial DBMSs and itincludes extending SQL with new syntax. This alternative includes low-levelAPIs to directly access tables through the DBMS file system (cursors [57],blocks of records with OLEDB [41]). This alternative provides fast processing,but also has disadvantages. Given the DBMS complex architecture and lackof access to DBMS source code it is difficult for end users and researchers toincorporate new algorithms or change existing ones. Nowadays, this alternativeis not popular for parallel processing.

11 SASTRY: If a regular user tries to write custom code for optimized joins, it would notwork well12 But as hardware accelerates and memory grows such gap in speed becomes less impor-

tant. – this is very general statement, and can be removed.13 SASTRY: Looks like a lot of repetition – we have moved away from DBMS vs non-DBMS

to speeds of implementation based on language, which is a big topic in itself


The Hadoop eco system provides many means to process large data sets inparallel outside a DBMS, most common of which is the is currently MapReduce[62], either direct MapReduce code or via higher level frameworks such as Hive[70] and Pig. For the most part a cloud computing model is the norm, wherethe cloud offers hardware and software as service.14. Even though some DBMSshave achieved some level of integration between SQL and MapReduce (e.g.,Greenplum [32], HadoopDB [2], Teradata AsterData [24]), in most cases datamust be moved from the DBMS storage to the MapReduce file system. Somerecent systems that have enabled SQL capabilities on top of the distributedHadoop file system, integrating some information retrieval capabilities intoSQL, include ODYS [67] and AsterixDB [6].15

In statistics and numerical analysis16 the norm is to develop program innon-systems languages like R, Matlab, SAS and Fortran, which do not scaleto large data sets.17. Thus analysts generally split computations between suchpackages and MapReduce [17], pushing demanding computations with heavyI/O on large data sets into MapReduce. In general, the direct integration ofmathematical non-systems languages with Hadoop remains limited; however,Hadoop MapReduce provides a “streaming” interface, where custom scriptsor programs can be called within the MapReduce program. The RIOT [71]system introduces a middleware solution to provide transparent I/O in the Rstatistical package with large matrices (i.e. the user does not need to modify Rprograms to analyze large matrices). RIOT does address the scalability prob-lem, but outside the DBMS. Unfortunately, RIOT cannot efficiently analyzedata inside a DBMS.

3.6 CARLOS: Algorithm Design and Optimizations

Efficient algorithms are central on database systems. Their efficiency dependson the data set layout, iterative behavior, numerical method convergence andhow much primary storage (RAM memory) is available. There are two ma-jor goals: decreasing I/O and accelerating convergence. Most statistical al-gorithms have iterative behavior requiring one or more passes over the dataset per iteration [45]. The most common approach to accelerate an algorithmis to reduce the number of passes scanning the large data set, without de-creasing (or even increasing) model quality (e.g. log-likelihood, squared errorand node purity. The rationale is to decrease I/O cost, but also to reduceCPU time. Many algorithms attempt to reduce the number of passes to one,

14 SASTRY: Most Hadoop installations are local, in-house15 SASTRY: Hive does not belong in this place because it is a simple conversion layer that

mimics SQL and supports only a small part of SQL16 SASTRY: Numerical analysis is a big beast; we have MPI-based programming as the

main norm (e.g. for computational fluid dynamics), with data stored in HDF, NetCDF, etc.,(e.g. data for weather or climate modeling).17 SASTRY: We should remove Fortran from here – it is mostly on how we write the codes;

climate models run for months or years on large clusters


processing the data set in sufficiently large blocks. Nevertheless, additional it-erations on the entire data set are generally needed to guarantee convergenceto a stable solution. Reducing the number of passes is highly related to incre-mental algorithms, which can update the model as more records (points) areprocessed. Incremental algorithms with matrices on secondary storage are animportant solution. Algorithms updating the model online based on gradient-descent methods have shown to benefit several common models [32] and theycan be integrated as a UDF into the DBMS architecture. Stream processing[28,25,66] represents an important variation of incremental algorithms to theextreme that each point is analyzed once, or even skipped when the streamarrives at a high rate. Building data set summaries that preserve its statis-tical properties is an essential optimization to accelerate convergence. Mostalgorithms build summaries based on sufficient statistics, most commonly forthe Gaussian density and Bernoulli distribution. Sufficient statistics have beenextended and generalized considering independence and multiple subsets fromthe data set, benefiting a big family of linear models. Data set squashing keepshigh order moments (sufficient statistics beyond power 2) on binned attributesto compress a data set in a lossy manner; statistical algorithms on squasheddata need to be modified considering weighted points.

Sampling (including randomized algorithms and Monte Carlo methods) isanother acceleration mechanism that can accelerate processing by comput-ing approximate solutions (only when an efficient sampling mechanism exists,with time complexity linear in sample size), but which always requires addi-tional samples (e.g. bootstrapping) or extra passes over the entire data setto reduce approximation error: skewed probabilistic distributions and extremevalues impact sampling quality. Data structures tailored to minimize I/O costusing RAM efficiently are another important mechanism. The most commondata structures are balanced trees, tries, followed by hash tables. Trees areused to index transactions (FP-tree), itemsets (with a hash-tree combo) or tosummarize points.

4 Examples of Important Systems

Table 2 summarizes important systems, including storage, query language andparallelism. Table 4, complementing Table 2, outlines extensibility program-ming mechanisms, operating system and finest level of storage. 18.

18 SASTRY: New advances are in the table


Table 2 Important systems: storage type, query language, parallelism.

Storage Query Shared-nothingName type Language N nodes

DB2 Row Store SQL:2011 YMySQL Row Store SQL:1999 Y, shardingPostgreSQL Row Store SQL:2008 NSQLServer Row Store Transact SQL N, Y in PDWOracle Row Store SQL:2008 YGreenplum Row Store SQL:2008 YTeradata Row Store SQL:1999 and T-SQL Y

Hana Column Store SQL:92 YInfiniDB Column Store SQL:92 YMonetDB Column Store SQL:2008, SciSQL NVertica Column Store SQL:1999 Y

Accumulo Column oriented storageon top of HDFS

Thrift Java also supportsthrift interface

Y

Hbase Column oriented storageon top of HDFS

(J)Ruby’s IRB Java, alsosupport thrift and RESTinterfaces

Y

Impala Multiple formats on topof HDFS

SQL:92 Y

AllegroGraph Graph DB SPARQL 1.1, Prolog,RDFS++

N

InfiniteGraph Graph DB Gremlin N, multithreadedNeo4j Graph DB Cypher, Gremlin,

SparQLN

Oracle Spatial/Graph Graph DB Apache Jena, PL/SQL N, shared-everything

GeoRaster Array DB PL/SQL N, shared-everythingPostGIS Array DB SQL:2008 NRasdaman Array DB RaSQL NSciDB Array DB AQL Y

InfoSphere Streams Stream DB SPL NSamza Stream DB Stream processing API N, task-parallelStorm Stream DB Stream processing API Y

Table 3 More recent systems

Other Systems

Name Storage type and querying interface NotesPresto multiple formats on top of HDFS; SQL-like interface Similar to ImpalaTitanDB Graph DB ; HDFS (via HBase or Cassandra); query via Gremlin language Leverages HDFS for graph problemsGoogle Dremel Proprietary An interface for rapid queriesGoogle Spanner Proprietary Globally distributed databaseGoogle F1 Proprietary Hybrid database with NoSQL + SQL featuresApache Drill Open source alternative to DremelPivotal HAWK HDFS and uses an SQL variant for queries RDBMS on top of HDFSHadapt RDBMS on top of HDFSApache Phoenix SQL SQL layer over HBase

Elastic search Indexing of documents and data on HDFS

18

C.

Ord

on

ez,D

.S.

Matu

sevich

,S

.Isu

kap

alli

Data ManagementApproach

Main Structure Typical Data Dimen-sions

Storage Format Prevalent StorageStructure

Scaling Optimal for Poor for Not applicable for Examples

RDBMS Structured Data;Normalized Tables

Many rows, mediumsize columns

Custom Binary For-mat

Local Disk with in-memory cache

Scale up; LimitedScale out

Transactions,Querying basedon indexed columns

Queries that involvefull table scans (non-indexed columns)

Unstructured data Oracle, DB2, Ter-adata, Postgres,MySQL

Key-Value Stores Tuples Many rows, twocolumns

Custom dictionarydata structures

Distributed stor-age with local,in-memory cache

Scale up and Scale-out

Queries based onkeys/key ranges

Searches over valuesor ranges of keys

Multiple columns Memcached, Redis

Document OrientedColumnar Stores

Loosely StructuredData; Chunked stor-age with columnsstored in stripes

Many rows, Manycolumns (columnscan be more thanrows)

Custom Binary For-mat

Distributed Storagewith replication

Scale up and Scaleout

Queries for selectingsubsets of columns

Queries that returnall columns, or whenpredicates use multi-ple columns

Documents MongoDB

Hybrid Tables Loosely StructuredData; Chunked stor-age with rows storedtogether

Many rows, Manycolumns

Custom binary for-mat

Chunked storageacross differentservers with in-memory cache forwrites

scale up and scaleout

Range queries overkeys

Queries over columnvalues

HBase, Accumulo,BigTable

Distributed FileStructure

Loosely StructuredData; Directory hi-erearchies as parti-tions

Many rows,medium-sizecolumns

User defined format Folders distributedacross differentservers with no at-rest CPU resourcesused

scale up and scaleout

Large scans acrosspartitions

Interactive queries orsmall scale queries

selecting single cellat a time

Hive


4.1 DAVID: DBMSs

Relational DBMSs

Relational DBMSs supporting SQL, ACID properties and parallel processingremain the dominating technology to process transactions and ad-hoc query-ing. By far row DBMSs still dominate the analytic market, but column DBMSsare gaining traction. Since column DBMS have proven to be much faster thanrow stores for exploratory analysis, most row DBMSs have incorporated somelevel of support for column storage. However, performance cannot match acolumn DBMS designed and developed from scratch [1]. In the future it isexpected column DBMS will gain wider adoption.

Non-Relational Systems: Stream Based Databases

As previously mentioned Stream Based Databases perform continuous ana-lytics on non-static data. Streams are designed to enable aggressive analysisof information and knowledge. They process very large volumes of data fromdiverse sources with low latency, in order to extract information. Streams aredesigned to respond in real time to events and changing requirements, analyzedata continuously, adapt to changing data types, to deliver real-time comput-ing functions in order to learn patterns and predict outcomes, while providinginformation security and confidentiality.

Streams are used in the field of communications, where they are commonlyused to deal with the enormous amount of data generated by cell phone com-munications [59]; in financial services [10,20], where they are used for the fastanalysis of large volumes of data to deliver near real-time business decisions;and transportation, for intelligent traffic management [7], to name a few.

Notable examples of stream databases are IBM’s InfoSphere [55], ATTLabs’ Gigascope [15] and Borealis from MIT [65].

Non-Relational DBMSs: Graph Analytics

Graphs are an important field of research in discrete mathematics and the-oretical computer science. They allow to represent complex problems in anintuitive way. Most systems to analyze graphs are purpose built, using an ad-jacency matrix or an adjacency list, managing the whole graph in memory.However large graphs, such as those representing social networks or the worldwide web, with number of nodes ranging from hundreds of millions to billionsof nodes, require the data management mechanisms for storage and querying.Graph database systems have been proposed to deal with this problem [60].

Pregel [38] was pioneered by Google to mine graphs, inspired by the BulkSynchronous Parallel Model [23]. Apache Giraph, another iterative system, wasdeveloped as an open source counterpart to Pregel [4]. Neo4J was developed byNeo Industries as an open source project to analyze databases, incorporatingfull ACID transaction processing. It stores data in nodes connected by directed


Table 4 Important systems: extensibility programming mechanisms, operating systems andphysical storage.

Name Extensibility OS Physical StorageDB2 Support for Scalar and

Table UDFs in C++,Java and Basic

Linux, Unix,MS Windows

Data is arranged in pagesof 4, 8, 16 or 32kB in size.Mixed page sizes are al-lowed.

MySQL Support for AggregateUDFs written in C andC++

FreeBSD,Linux, Solaris,Mac OSX,Windows

Hard page size of 16 kB,but can be changed to 4,8, 32 or 64 kB

PostgreSQL Support for user de-fined operators andUDFs (aggregate andwindow) in PL/pgSQLand C

Linux, Unix,MS Windows

Data page size of 8 kB

SQLServer Full support for scalarand aggregate UDFsand TVFs written inC# and Visual Basic

Windows Data page size of 8 kB

Teradata Supports TVF andUDF written in C

Windows Variable block size up to128KB

Hana Supports Scalarand Table UDF inSQLScript

SUSE LinuxEnterpriseServer

Block size from 4 kB to16 MB

InfiniDB No support for aggre-gate UDFs yet

Linux, MSWindows

Data is arranged inextents, continuous allo-cations of space. Typicalsizes are between 8 and64 MB

MonetDB Limited support forscalar and columnUDFs in C

Linux, MacOSX, MSWindows

Block size of 512 kB

Vertica Limited UDF support Linux: Ubuntuand RHEL

Block size of 1 MB

Accumulo MapReduce support Linux Block size of 64MB or 128MB (HDFS standard)

Hbase MapReduce support Linux Block size of 64MB or 128MB (HDFS standard)

Impala Scalar and AggregateUDFs in C++ or Java

Linux Block size of 64MB or 128MB (HDFS standard)

AllegroGraph Supports stored proce-dures in LISP and Java

FreeBSD,Linux, OSX,Windows

N/A

InfiniteGraph Supports Java proce-dures

OSX, SUSELinux, Ubuntu,Windows

Page sizes range from 1 to16 Kb

Neo4j Supports unmanagedextensions in Java

Linux, HP-UX,OSX, Windows

Page size default is 1 MB

Oracle Spa-tial andGraph

Support for UDF inPL/SQL, Java andC++

Linux, HP-UX, Solaris,Windows

Block size of 2 to 8 kB

GeoRaster Support for UDF inPL/SQL, Java andC++


Variable block size. Maxis 4 GB

PostGIS Support for user de-fined operators andUDFs (aggregate andwindow) in PL/pgSQLand C

Linux, OSX,Windows

8 kB page size

Rasdaman Support for user-defined functions


Tile size ranges from 1 to4 MB

SciDB Supports user definedtypes and UDFs inC++

Ubuntu,RHEL, Cen-tOS

Chunk size is 4 to 20 MBfor optimal performance

InfoSphereStreams

Streams Studio appli-cations as well as Javaand C++ extensions

Red Hat Linux Adjustable

Samza Stream processing API Linux, Solaris,Unix

Storage is build on topof LevelDB: Block sizesrange from 64 to 256 kB

Storm Stream processing API Linux, Solaris,Unix

No storage option avail-able


Table 5 Mathematical packages.

Name RAM Limit Parallelism Programming Lan-guage

Compiled / Inter-preted

Julia Not limited Supports paralleland cloud comput-ing, but no singlenode parallelism

Julia Language JIT compilation

LAPACK Matrices up to

232 × 232 elements

Single Node super-scalar architec-tures and multiplenode parallelismvia ScaLapack

Library of func-tions called fromFORTRAN andC++

Compiled

MatLab Up to 8 Tb ofmemory in 64 bitsystems

Supports single-node and N-nodeparallelism via theParallel Comput-ing Toolbox

MatLab Interpreted

Octave Limited by physi-cal memory

Parallelism isavailable viatoolkits

Matlab (reduced) Interpreted

pbdR Not limited Implicit and ex-plicit parallelismin single nodes andHPC

R Language Interpreted

R Limited to 2 Gb Single node paral-lelism


Revolution Not limited Single and multiplenode cluster


SAS Limited by phys-ical memory sizeand paging spaceavailable at thetime SAS is started

Single node paral-lelism

SAS Language Both

relationships. Finally, we can mention Urika, a complete hardware/softwaresolution developed by Cray, using a scalable shared memory architecture thataims to solve the problem of partitioning graphs to take advantage of parallelcomputations [33]. Most of these databases are built on top of a Hadoop storagesystem, and share most of its characteristics and limitations.

4.2 SASTRY: Generic HDFS systems, not based on SQL

As previously mentioned, in many cases, it is not convenient to store data intoa DBMS sometimes. Reasons for this could be the need for quick prototypingof data mining or machine learning code, that could be more efficiently handledusing a flat file; or the nature of data could be too “unstructured” to easily fitin any of the categories mentioned in Section 4. Table 5 summarizes importantmathematical packages used today.19.

DAVID: The R package and Matlab

The R language was developed in 1993 by Ihaka and Gentleman [35] as aninteractive tool for their introductory data-analysis course. Since then R hasevolved into a full-fledged programming language for processing large and com-plex data sets. As it is an open-source project, it is maintained and developedby the statistics community. This has resulted in a package that is loaded withfeatures especially oriented to the quick generation of charts and graphs for

19 SASTRY: Table 1 has too short a list, so we can delete it


easy interpretation of data. Furthermore most machine learning and statisticsalgorithms have been incorporated into the package.

It is clear that these features make it the tool of choice for quick dataanalysis. Its main disadvantage is that it is not optimized for large data sets asthose commonly found in today’s problems. This drawback, however is beingaddressed by tools like RRE from Revolution Analytics and pbdR [50,51].The developers of pbdR (Programming with Big Data in R), aim to providedistributed data parallelism in R in order to utilize large HPC platforms.This is achieved using high performance libraries (MPI, ScaLAPACK andNetCDF4), but preserving R’s syntax with new scalable code underneath.

Matlab combines a high level language with an interactive environmentoriented towards visualization and model computation. Matlab has a largebuilt-in library of mathematical functions to solve engineering and scientificproblems, as well as exploring and modeling data. However, like R, Matlab isnot designed to be used in data sets that exceed the available memory of thecomputer.

SASTRY: Hadoop Distributed File System (HDFS)

20. HDFS and MapReduce were the two major components ot original Hadoop(version 1.x). At its core the Hadoop Distributed File System (HDFS) is adistributed file system, written in Java to store large files across multiplemachines, with added reliability achieved by replicating the data at multi-ple nodes. Hence there is no need for RAID storage in the host machines.Data nodes can communicate with each other in order to balance the load.In addition to the data hosts, HDFS include a primary NameNode, the mainmetadata server, and a secondary NameNode where snapshots of the primaryNameNode are stored. The NameNode can become a bottleneck if many smallfiles are being accessed or when the cluster size increases substantially. Thisis being addressed by allowing multiple namespaces be handled by differentnamenodes using HDFS Federation.

However, as the Hadoop ecosystem evolved, many tools have been devel-oped to support various programming paradigms and for providing differentinterfaces.

21.The major drawbacks for HDFS are that it is not suitable for systems

requiring concurrent access to data, and accessing data out of the file systemcan be difficult, since existing operating systems cannot mount it directly.All file accesses must be achieved through the native Java API. MapR [27]

20 SASTRY: HDFS continues to be the major player in distributed systems – and is acomplementary thing to MapReduce, so we need to maintain that notion in the discussion21 SASTRY: Originally a part of the Map-Reduce programming model, handling data

storage and large-scale processing in nodes, it has in recent years taken a life of its own,with applications like Cloudera, that deliver a complete distribution of Apache Hadoop,coupled with features like increased security and user interfaces. – it is a bit confusing;HDFS and MapReduce are two different pieces, and Cloudera is one of the three majordistributions – HortonWorks and MapR are two other


addresses these problems by replacing the HDFS with a full random access filesystem.

5 CUT Patterns, CUT Cubes?, Graphs and Statistical Models

We now discuss the most difficult part of big data analytics, after data (ta-bles, files, text) have been cleaned, integrated, transformed and explored withqueries or simple programs. Fundamental research problems on analyzing bigdata can be broadly classified into the following categories:

1. CUT Patterns, which involve exploratory search on a discrete combinato-rial space (association rules, itemsets, sequences). Such patterns are com-monly discovered by search algorithms exploring a lattice.

2. CUT Cubes, decision support queries3. Graphs, a formal mathematical model of objects and their relationships4. Statistical models, which involve matrices, vectors and histograms. Such

models are computed by a combination of numerical and statistical meth-ods.

5.1 CUT Patterns: Cubes and Itemsets

In the case of cubes, the data set is a fact table F with d discrete attributes(called dimensions) to aggregate by group and m numeric attributes calledmeasures.

We can also introduce k, a common additional size specified in models (e.g.k clusters, k components).

Pattern search is a different problem from statistical models because algo-rithms work in a combinatorial manner on a discrete space. Patterns includecubes, itemsets (association rules) [3], as well as their generalization to se-quences (considering an order on items) and graphs. Given their generalityand close relationship to social networks mining large graphs are currently themost important topic.

Cubes represent a mature research topic, but widely used in practice in BItools. Most approaches avoid exploring the entire cube by pruning the searchspace. Methods increase the performance of multidimensional aggregations, bycombining data structure for sparse data and cell precomputation at differentaggregation levels of the lattice. The authors of [29] puts forward the plan ofcreating smaller, indexed summary tables from the original large input tableto speed up aggregating executions. In [56] the authors explore tools to guidethe user to interesting regions in order to highlight anomalous behavior whileexploring large data cubes. Large lattices are explored with greedy algorithms.

Association rules were by far the most popular data mining topic in thepast. Frequent itemset is generally regarded as the fundamental problem is as-sociation rule mining, solved by the well-known level-wise A-priori algorithm(exploiting the downward closure property). The common approach to mine


association rules is to view frequent itemset search as optimizing multipleGROUP-BYs. From a DBMS perspective, exploiting SQL queries, UDFs andlocal caching are explored in [57]. This work studies how to write efficient SQLqueries with k-way joins and nested sub-queries to get frequent itemsets, butdid not explore UDFs deep enough as they were more limited back then. SQLis shown to be competitive, but the most efficient solution (local caching) isbased on creating a temporary file inside the DBMS, but apart from relationaltables. We believe the slightly slower solution in SQL is more valuable in thelong term. Recursive queries can search frequent itemsets, yielding a compactpiece of SQL code, but unfortunately, this solution is slower than k-way joins.Relational division and set containment can solve frequent itemsets search,providing a complementary solution to joins. The key issue is that relationaldivision requires different query optimization techniques, compared to SPJ op-erators. There has been an attempt to adapt data structures, like the FP-tree,to work in SQL, but experimental evidence over previous approaches is incon-clusive. Mining frequent itemsets over sliding windows has been achieved withaggregate UDFs, interfacing with external programs. Table UDFs can updatean entire cube in RAM in one pass for low dimensional data sets. Thus thedimension (or itemset) lattice can be maintained in main memory. The draw-back is that TVFs require a cursor interface that disables parallel processing.The lattice is the main hindrance to perform processing with SQL queries be-cause it cannot be easily represented as a tabular structure and because SQLdoes not have pointers and requires joins instead. There exists work on min-ing graphs (which represents a more general problem) with SQL, but solvingnarrow issues. Instead most work has focused on studying algorithms withMapReduce. For the most part, mining graphs with queries or UDFs remainsunexplored.

5.2 DAVID: Graphs

The input is one large directed graph G = (V,E), where V are the vertices(nodes), E the edges, n = |V | and m = |E|. The goal is to find interestingpatterns in the graph which include paths and subgraphs. Even though theproblems and algorithms explained below have been studied for a long time ingraph theory, operations research and theoretical computer science, big dataanalytics has made necessary to revisit them because G cannot fit in RAMand G must be processed in parallel. The shape of the graph and the existenceof cycles makes problem harder [44]. Simplest graphs to analyze include trees,lists and graph with small components (disconnected). Harder graphs havecliques and cycles. The hardest graphs are similar to the complete graph Kn,but they are not common in social networks or the Internet.

The most important graph algorithms can be broadly classified into non-recursive and recursive (given their most common definition). Non-recursivealgorithms with wide applicability include: shortest path (Dijkstra), node cen-


trality, edge density per group, clustering coefficient, discovering frequent sub-graphs up to subgraph isomorphism.

Recursive algorithms can broadly subclassified into those that involve a lin-ear recursion and those with a (harder) non-linear recursion. Linear recursionproblems include: transitive closure, power law (fractals) iterative spectralclustering (on Laplacian of G), discovering frequent subgraphs. Non-linearrecursion is prominently represented by clique enumeration (i.e. finding allcliques) and detecting isomorphic subgraphs.

5.3 SASTRY: Statistical Models

Here we present models in two major groups: unsupervised models (clustering,dimensionality reduction) and supervised models (classification, regression,feature selection, variable selection).

Definitions: Let X = {x1, . . . , xn} be a data set having n records, whereeach point has d attributes. When all d attributes are numeric then X canbe considered a matrix and each record is a point in d-dimensional space. Forpredictive models, X has an additional “target” attribute. This attribute canbe a numeric dimension Y (used for linear regression) or a categorical attributeG (used for classification) with m distinct values.

Clustering: This is one of the best-known unsupervised data mining tech-niques. In this approach the data set is partitioned into subsets, such thatpoints in the same subset are similar to each other. Clustering is commonlyused as a building block for more complex models. Techniques that have beenincorporated internally in a DBMS include K-means clustering and hierar-chical clustering. Scalable K-means can work incrementally in one pass per-forming iterations on weighted points in main memory. This algorithm is alsobased on sufficient statistics. The algorithm relies on a low-level direct accessto the file system which allows processing the data set in blocks. SQL has beensuccessfully used to program clustering algorithms in SQL, such as K-means[43] and the more sophisticated EM algorithm [46]. Distance is the most de-manding computation in these algorithms. Distance computation in K-means[43] is evaluated with joins between a table with centroids and the data set,computing aggregations on squared dimension differences. Sufficient statistics,denormalization (avoiding joins) and parallel query evaluation (with carefulbalanced row distribution) accelerate K-means computation.

It is feasible to develop an incremental version [43] with SQL queries, thatrequires a few passes (even two) on the data set; this approach is somewhatsimilar to iterated sampling (bootstrapping), with the basic difference thateach record is visited at least once. UDFs further accelerate distance com-putation [45] by avoiding joins altogether. The EM algorithm represents animportant maximum likelihood estimation method. Reference [46] represents afirst attempt to program EM clustering (solving a Gaussian mixture problem)in SQL, proving it was feasible to program a fairly complex algorithm entirelywith SQL code generation. In [46] several alternatives for distance computa-


tion are explored pivoting the data set and distances to reduce I/O cost. Mix-ture parameters are updated with queries combining aggregations and joins.This SQL solution requires multiple iterations and two passes per iteration.Deriving incremental algorithms, exploiting sufficient statistics and acceler-ating processing for EM with UDFs are open problems. Subspace clusteringis a grid-based approach, which incorporates several optimizations similar tofrequent itemset mining (discussed below). The data set is binned with user-defined parameters, which is easily and efficiently accomplished with SQL.SQL queries traverse the dimension lattice in a level-wise manner making onepass of each size. Acceleration is gained with aggressive pruning of dimensioncombinations.

Principal Components Analysis: This is an unsupervised techniqueand is the most popular dimensionality reduction method. PCA computationcan be pushed into a DBMS by computing the sufficient statistics for the cor-relation matrix with SQL or UDFs in one table scan. It is possible to solvePCA with a complex numerical method like SVD entirely in SQL. Such ap-proach enables fast analysis of very high dimensional data sets (e.g. microarraydata). Moreover, this solution is competitive with a state of the art statisticalsystem (R package). PCA has also been solved summarizing the data with anaggregate UDF and computing SVD with the LAPACK library [49], givingtwo orders of magnitude performance improvement.

We now turn our attention to supervised techniques.Decision Trees: This is a fundamental technique that recursively parti-

tions a data set to induce predictive rules. Sufficient statistics for decision treesare important to reduce number of passes numeric attributes are binned androw counts per bin or categorical value are computed. The splitting phase isgenerally accelerated using a condensed representation of the data set. Table-Valued Functions have been exploited to implement primitive decision treeoperations. UDFs implement primitives to perform attribute splits, reducingthe number of passes over the data set; this work also proposes a predictionjoin, with new SQL syntax, to score new records. Reducing the number ofpasses on the data set to two or one using SQL mechanisms is an open prob-lem. Exploiting MapReduce [21] to learn an ensemble of decision trees on largedata sets using a cluster of computers is studied in [53]. This work introducesoptimizations to reduce the number of passes on the data set.

Regression Models: We now discuss regression models, which are popu-lar in statistics. Sufficient statistics for linear models (including linear regres-sion) are generalized and implemented as a primitive function using aggregateUDFs, producing a tightly coupled solution. Basically sufficient statistics forclustering (as used by K-means) are extended with dimension cross-productsto estimate correlations and covariances. Matrix equations are mathemati-cally transformed into equivalent equations based on sufficient statistics asfactors. The key idea is to compute those sufficient statistics in one pass, withSQL queries or even faster with aggregate UDFs. The equations solving eachmodel can be efficiently evaluated with a math library, exporting small ma-trices whose size independent from data set size. Linear regression has been


integrated with the DBMS with UDFs, exploiting parallel execution of theUDF in unfenced mode. Moreover, it is feasible to perform variable selectionexploiting sufficient statistics [45]. Bayesian statistics are particularly challeng-ing because MCMC methods require thousands of iterations. Recent research[40] shows it is feasible to accelerate a Gibbs sampler to select variables inlinear regression under a Bayesian approach, exploiting data summarization.Database systems have been extended with model-based views in order tomanage and analyze databases with incomplete or inconsistent content, to in-crementally build regression and interpolation models, extending SQL withsyntax to define model views.

Support Vector Machines: We close our discussion with Support VectorMachines (SVMs), which are a mathematically sophisticated classifier. SVMsachieve a perfect separation of classes in a higher dimensional space, but havequadratic time complexity. Two fundamental SVM aspects need to be con-sidered: kernel functions (mapping points to a higher dimensional space) andsolving quadratic programming, a mathematically difficult problem. For in-stance, kernel functions have been integrated with a DBMS [39]. SVMs havebeen accelerated with hierarchical clustering and incremental matrix factor-ization.

6 The Future: Research Issues

DAVID: Analyzing large volumes of relational and diverse files together (i.e.,variety in big data analytics) rapidly growing adds a new level of difficulty andthus it nowadays represents a major research direction. The default storagemechanism in most DBMSs will likely remain row-based, but column stores[65] show promise to accelerate statistical processing, especially for featureselection, dimensionality reduction and cubes. Solid state memory may be analternative for small volumes of data, but it will likely remain behind disks onprice and capacity. Sharing RAM among multiple nodes in fast network canhelp loading large data sets in RAM. Shared disk is not a promising alternativedue to the I/O bottleneck.

DAVID: In parallel processing the major goal will remain the efficient eval-uation of mathematical equations independently on each table partition, bal-ancing workload, minimizing row redistribution. With the advances in fastnetwork communication, larger RAM, multicore CPUs and larger disks thefuture default system will probably remain a shared-nothing parallel architec-ture with N processing units, each one with its main memory and secondarystorage (disk or solid state memory). Matrices are used extensively in math-ematical models, computed with parallel algorithms. Thus we believe it isnecessary to extend a DBMS with matrix data types, matrix operators andlinear algebra numerical methods. In particular, it is necessary to study effi-cient mechanisms to integrate SQL with mathematical libraries like LAPACK.Array databases go in that direction, but they are incompatible with SQL andtraditional relational storage mechanisms, resulting in multiple copies of the


same data. Most likely, SQL will incorporate array storage mechanisms witha new data type and primitive operators.

SASTRY: Some common simple data mining algorithms may make theirway to DBMS source code, whereas others may keep growing as SQL queries orUDFs. But most new data mining and statistical algorithms will be developedin languages like C++ or R. Given the similarity between MapReduce jobs andaggregate UDFs it is likely DBMSs will support automated translation fromone into the other. Also, there will be faster interfaces to move data from oneplatform into the other. Extending SQL with new clauses is not a promisingalternative, because SQL syntax is overwhelming and DBMS source code isrequired (many proposals have been simply forgotten). Combining SQL codegeneration and UDFs needs further study to program and optimize complexstatistical and numerical methods. State-of-the-art data mining algorithmsgenerally have or are forced to achieve linear time complexity in data setsize. Can SQL queries callings UDFs match such time complexity exploitingrelational physical operators (scan, join, sort)? There is evidence it can.

SASTRY: Future research will keep studying incremental model computa-tion, block-based processing, data summarization, sampling and data struc-tures for secondary storage under sequential and parallel processing. Develop-ing incremental algorithms, beyond clustering and decision trees, is definitelychallenging due to the need to reconcile parallel processing and incremen-tal learning. Incorporating sampling as a primitive acceleration mechanismis a promising alternative to analyze large data sets, but approximation er-ror must be controlled. Model selection is about selecting the best model outof several competing models; this task requires building several models withslightly different parameters or exploring several models concurrently. Fromthe mathematical side the list is extensive. Overall, models involving timeremain difficult to compute with existing DBMS technology (they are bet-ter suited to mathematical packages). Hidden Markov Models (HMMs) are aprominent example. Non-linear models (e.g. non-linear regression, neural net-works) also deserve attention, but their impact is more limited. We anticipaterecursive queries and aggregate UDFs will play a central role to analyze largegraphs in a DBMS, whereas the state of the art for large graphs in socialnetworks will be MapReduce.

References

1. D.J. Abadi, S. Madden, and N. Hachem. Column-stores vs. row-stores: how differentare they really? In Proc. ACM SIGMOD Conference, pages 967–980, 2008.

2. A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D.J. Abadi, and A. Silberschatz.HadoopDB in action: building real world applications. In Proceedings of the 2010ACM SIGMOD International Conference on Management of data, SIGMOD ’10, pages1111–1114. ACM, 2010.

3. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of itemsin large databases. In ACM SIGMOD Conference, pages 207–216, 1993.

4. Ching Avery. Giraph: large-scale graph processing infrastruction on hadoop.Proceedings of Hadoop Summit. Santa Clara, USA:[sn], 2011.


5. Peter Baumann, Alex Mircea Dumitru, and Vlad Merticariu. The array database thatis not a database: file based array query answering in rasdaman. In Advances in Spatialand Temporal Databases, pages 478–483. Springer, 2013.

6. A. Behm, V.R. Borkar, M.J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch,Y. Papakonstantinou, and V.J. Tsotras. ASTERIX: towards a scalable, semistructureddata platform for evolving-world models. Distributed and Parallel Databases (DAPD),29(3):185–216, 2011.

7. Alain Biem, Eric Bouillet, Hanhua Feng, Anand Ranganathan, Anton Riabov, OlivierVerscheure, Haris Koutsopoulos, and Carlos Moran. Ibm infosphere streams for scalable,real-time, intelligent transportation services. In Proceedings of the 2010 ACM SIGMODInternational Conference on Management of data, pages 1093–1104. ACM, 2010.

8. P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases.In Proc. ACM KDD Conference, pages 9–15, 1998.

9. P.G. Brown. Overview of sciDB: large scale array storage, processing and analysis. InProc. ACM SIGMOD Conference, pages 963–968, 2010.

10. Badrish Chandramouli, Mohamed Ali, Jonathan Goldstein, Beysim Sezgin, andBalan Sethu Raman. Data stream management systems for computational finance.Computer, 43(12):45–52, 2010.

11. S. Chaudhuri, G. Das, and U. Srivastava. Effective use of block-level sampling in statis-tics estimation. In Proc. ACM SIGMOD Conference, pages 287–298, 2004.

12. J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stop SQL/MXprimitives for knowledge discovery. In ACM KDD Conference, pages 425–429, 1999.

13. E.F. Codd. A relational model of data for large shared data banks. ACM CACM,13(6):377–387, 1970.

14. J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysispractices for big data. In Proc. VLDB Conference, pages 1481–1492, 2009.

15. C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk. Gigascope: a streamdatabase for network applications. In Proc. ACM SIGMOD Conference, pages 647–651. ACM, 2003.

16. C. Cunningham, G. Graefe, and C.A. Galindo-Legaria. PIVOT and UNPIVOT: Op-timization and execution strategies in an RDBMS. In Proc. VLDB Conference, pages998–1009, 2004.

17. S. Das, Y. Sismanis, K.S. Beyer, R. Gemulla, P.J. Haas, and J. McPherson. RICARDO:integrating R and hadoop. In Proc. ACM SIGMOD Conference, pages 987–998, 2010.

18. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database struc-ture; or, how to build a data quality browser. In ACM SIGMOD Conference, pages240–251, 2002.

19. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters.Commun. ACM, 51(1):107–113, 2008.

20. Nihal Dindar, Baris Guc, Patrick Lau, Asli Ozal, Merve Soner, and Nesime Tatbul.Dejavu: declarative pattern matching over live and archived streams of events. InProceedings of the 2009 ACM SIGMOD International Conference on Management ofdata, pages 1023–1026. ACM, 2009.

21. E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approachto self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDBEndow., 2(2):1402–1413, 2009.

22. H. Garcia-Molina, J.D. Ullman, and J. Widom. Database Systems: The Complete Book.Prentice Hall, 2nd edition, 2008.

23. Louis Gesbert, Zhenjiang Hu, Frederic Loulergue, Kiminori Matsuzaki, and JulienTesson. Systematic development of correct bulk synchronous parallel programs. InParallel and Distributed Computing, Applications and Technologies (PDCAT), 2010International Conference on, pages 334–340. IEEE, 2010.

24. A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H. Jacobsen. Bigbench:towards an industry standard benchmark for big data analytics. In Proc. ACM SIGMODConference, pages 1197–1208. ACM, 2013.

25. L. Golab and M.T. Ozsu. Issues in data stream management. ACM Sigmod Record,32(2):5–14, 2003.


26. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregationoperator generalizing group-by, cross-tab and sub-total. In ICDE Conference, pages152–159, 1996.

27. Mike Gualtieri, Noel Yuhanna, Holger Kisker, and David Murphy. The forrester wave:Enterprise hadoop solutions, q1 2012, 2012.

28. S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. InFOCS Conference, page 359, 2000.

29. H. Gupta, V. Harinarayan, A. Rajaraman, and J.D. Ullman. Index selection for OLAP.In IEEE ICDE Conference, 1997.

30. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann,San Francisco, 2nd edition, 2006.

31. T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning.Springer, New York, 1st edition, 2001.

32. J. Hellerstein, C. Re, F. Schoppmann, D.Z. Wang, and et. al. The MADlib analyticslibrary or MAD skills, the SQL. Proc. of VLDB, 5(12):1700–1711, 2012.

33. Phillip Howard. Urika: An InDetail paper, 2012. [Online; accessed 1-April-2014].34. W. Hsieh, J. Madhavan, and R. Pike. Data management projects at google. In ACM

SIGMOD Conference, pages 725–726, 2006.35. Ross Ihaka and Robert Gentleman. R: a language for data analysis and graphics. Journal

of computational and graphical statistics, 5:299–314, 1996.36. Intel. Intel math kernel library documentation. http://software.intel.com/en-

us/articles/intel-math-kernel-library-documentation/, 2012.37. Patrik Larsson. Analyzing and adapting graph algorithms for large persistent graphs.

Technical report, Linkopings Universitet, Sweden, 2008.38. Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn,

Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing.In Proceedings of the 2010 ACM SIGMOD International Conference on Managementof data, pages 135–146. ACM, 2010.

39. B.L. Milenova, J. Yarmus, and M.M. Campos. SVM in Oracle database 10g: Removingthe barriers to widespread adoption of support vector machines. In VLDB Conference,2005.

40. M. Navas, C. Ordonez, and V. Baladandayuthapani. On the computation of stochas-tic search variable selection in linear regression with UDFs. In Proc. IEEE ICDMConference, pages 941 – 946, 2010.

41. A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQLdatabases: OLE DB for data mining. In Proc. IEEE ICDE Conference, pages 379–387,2001.

42. C. Ordonez. Vertical and horizontal percentage aggregations. In Proc. ACM SIGMODConference, pages 866–871, 2004.

43. C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEETransactions on Knowledge and Data Engineering (TKDE), 18(2):188–201, 2006.

44. C. Ordonez. Optimization of linear recursive queries in SQL. IEEE Transactions onKnowledge and Data Engineering (TKDE), 22(2):264–277, 2010.

45. C. Ordonez. Statistical model computation with UDFs. IEEE Transactions onKnowledge and Data Engineering (TKDE), 22(12):1752–1765, 2010.

46. C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm.In Proc. ACM SIGMOD Conference, pages 559–570, 2000.

47. C. Ordonez and J. Garcıa-Garcıa. Referential integrity quality metrics. Decision SupportSystems Journal, 44(2):495–508, 2008.

48. C. Ordonez, J. Garcıa-Garcıa, and Z. Chen. Dynamic optimization of generalized SQLqueries with horizontal aggregations. In Proc. ACM SIGMOD Conference, pages 637–640, 2012.

49. C. Ordonez, N. Mohanam, C. Garcia-Alvarado, P.T. Tosic, and E. Martinez. Fast PCAcomputation in a DBMS with aggregate UDFs and LAPACK. In Proc. ACM CIKMConference, 2012.

50. G. Ostrouchov, W.-C. Chen, D. Schmidt, and P. Patel. Programming with big data inr, 2012.


51. George Ostrouchov, Drew Schmidtb, Wei-Chen Chena, and Pragneshkumar Patelb.Combining r with scalable libraries to get the best of both for big data. Technicalreport, Oak Ridge National Laboratory (ORNL); Center for Computational Sciences,2013.

52. F. Pan, X. Zhang, and W. Wang. CRD: fast co-clustering on large datasets utilizingsampling-based matrix decomposition. In SIGMOD, pages 173–184, New York, NY,USA, 2008. ACM.

53. B. Panda, J. Herbach, S. Basu, and R.J. Bayardo. PLANET: Massively parallel learningof tree ensembles with MapReduce. In Proc. VLDB Conference, pages 1426–1437, 2009.

54. S. Pitchaimalai, C. Ordonez, and C. Garcia-Alvarado. Comparing SQL and MapReduceto compute Naive Bayes in a single table scan. In Proc. ACM CloudDB, pages 9–16,2010.

55. Roger Rea and Krishna Mamidipaka. Ibm infosphere streams, redefining real timeanalytics. IBM Software Group, 2010.

56. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP datacubes. In EDBT, pages 168–182. Springer-Verlag, 1998.

57. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining withrelational database systems: alternatives and implications. In Proc. ACM SIGMODConference, pages 343–354, NY, USA, 1998. ACM Press.

58. K. Sattler and O. Dunemann. SQL database primitives for decision tree classifiers. InProc. ACM CIKM Conference, pages 379–386, 2001.

59. Vyas Sekar, Michael K Reiter, and Hui Zhang. Revisiting the case for a minimalistapproach for network flow monitoring. In Proceedings of the 10th ACM SIGCOMMconference on Internet measurement, pages 328–341. ACM, 2010.

60. Bin Shao, Haixun Wang, and Yanghua Xiao. Managing and mining large graphs: sys-tems and implementations. In Proceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data, pages 589–592. ACM, 2012.

61. E. Soroush, M. Balazinska, and D. Wang. ArrayStore: A storage manager for complexparallel array processing. In Proc. ACM SIGMOD Conference, pages 253–264, 2011.

62. M. Stonebraker, D. Abadi, D.J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin.MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64–71, 2010.

63. M. Stonebraker, D.J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau,A. Lin, S. Madden, E.J. O’Neil, P.E. O’Neil, A. Rasin, N. Tran, and S.B. Zdonik. C-Store: A column-oriented DBMS. In Proc. VLDB Conference, pages 553–564, 2005.

64. M. Stonebraker, J. Becla, D.J. DeWitt, K.T. Lim, D. Maier, O. Ratzesberger, and S.B.Zdonik. Requirements for science data bases and SciDB. In Proc. CIDR Conference,2009.

65. M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland.The end of an architectural era: (it’s time for a complete rewrite). In VLDB, pages1150–1160, 2007.

66. H. Wang, C. Zaniolo, and C.R. Luo. ATLaS: A small but complete SQL extension fordata mining and data streams. In Proc. VLDB Conference, pages 1113–1116, 2003.

67. K.Y. Whang, T.S. Yun, Y.M. Yeo, I.Y. Song, H.Y. Kwon, and I.J. Kim. ODYS: anapproach to building a massively-parallel search engine using a DB-IR tightly-integratedparallel dbms for higher-level functionality. In Proc. ACM SIGMOD Conference, pages313–324, 2013.

68. A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N. Folkert, A. Gupta, L. Sheng,and S. Subramanian. Spreadsheets in RDBMS for OLAP. In Proc. ACM SIGMODConference, pages 52–63, 2003.

69. G. Wu, E. Chang, Y.K. Chen, and C. Hughes. Incremental approximate matrix factor-ization for speeding up support vector machines. In ACM KDD, pages 760–766, NewYork, NY, USA, 2006. ACM Press.

70. R.S. Xin, J.R., M. Zaharia, M.J. Franklin, S.S., and I. Stoica. Shark: SQL and richanalytics at scale. In Proc. ACM SIGMOD Conference, pages 13–24, 2013.

71. Y. Zhang, W. Zhang, and J. Yang. I/O-efficient statistical computing with RIOT. InProc. ICDE, 2010.

Comparing DBMSs and Alternative Big Data Systems: Scale Up ...ordonez/pdf/dbms-vs-nosql.pdf · 6...

Documents

Transcript of Comparing DBMSs and Alternative Big Data Systems: Scale Up ...ordonez/pdf/dbms-vs-nosql.pdf · 6...