Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of...

16
Learning Linear Regression Models over Factorized Joins Maximilian Schleich Dan Olteanu Radu Ciucanu Department of Computer Science, University of Oxford {max.schleich,dan.olteanu,radu.ciucanu}@cs.ox.ac.uk Cultural Learnings of Factorized Joins for Make Benefit Glorious Family of Regression Tasks 1 ABSTRACT We investigate the problem of building least squares regres- sion models over training datasets defined by arbitrary join queries on database tables. Our key observation is that joins entail a high degree of redundancy in both computation and data representation, which is not required for the end-to-end solution to learning over joins. We propose a new paradigm for computing batch gradient descent that exploits the factorized computation and repre- sentation of the training datasets, a rewriting of the regres- sion objective function that decouples the computation of cofactors of model parameters from their convergence, and the commutativity of cofactor computation with relational union and projection. We introduce three flavors of this ap- proach: F/FDB computes the cofactors in one pass over the materialized factorized join; F avoids this materializa- tion and intermixes cofactor and join computation; F/SQL expresses this mixture as one SQL query. Our approach has the complexity of join factorization, which can be exponentially lower than of standard joins. Ex- periments with commercial, public, and synthetic datasets show that it outperforms MADlib, Python StatsModels, and R, by up to three orders of magnitude. 1. INTRODUCTION There is increasing interest in academia and industry in building systems that integrate databases and machine learn- ing [17, 34, 1, 18]. This is driven by web companies, e.g., Google Brain, Twitter, Facebook’s DeepFace, or Microsoft’s platform, though it is also prominent in other industry seg- ments such as retail-planning and forecasting applications [2]. State-of-art commercial analytics systems support descrip- tive, or backward-looking analytics, predictive, or forward- 1 With apologies to Sacha Baron Cohen and Borat. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USA c 2016 ACM. ISBN 978-1-4503-3531-7/16/06. . . $15.00 DOI: http://dx.doi.org/10.1145/2882903.2882939 looking analytics such as classification and regression, and prescriptive analytics, which are also forward-looking and usually take the output of a predictive model as input. The typical data sources of interest are weekly sales data, pro- motions, and product descriptions. In these settings, the input to analytics is a relation representing the natural join of those data sources stored in a database. A typical pre- diction a retailer would like to compute is the additional demand generated for a given product due to promotion. In our exploration with such systems on a real dataset from a large US retailer and following discussions with in- dustry experts [23], we realized three major shortcomings of these systems: (1) Poor integration of analytics and databa- ses, which are traditionally confined to distinct specialized systems in the ever-growing technology stack [2]; (2) poor efficiency already for few data sources; and (3) insufficient accuracy due to omission of further relevant data sources. The use of typical data sources already stretches the scal- ability of existing systems and the current processing regime is to manually partition the data, e.g., by market and cate- gory, and run analytics independently on each partition [23]. Domain-expertise focuses on how to best partition the data, to the point where it becomes black art. Data partitioning misses relevant correlations and prohibits common forecast patterns across different categories and markets. By pro- cessing separately by market/category, today’s forecasting cannot leverage similar demand behavior across geography or time. For this reason, data partitioning is seen as a strong limitation of the state of the art and preference is given to running analytics on the entire dataset. More data sources would enable forecasting with higher accuracy and at a more granular level, such as customer re- views, basket data transactions, competitive promotions and prices, flu trends, loyalty program history, customer trans- action history, social media text related to the retailer prod- ucts sold, store attributes such as demographics, weather, or nearby competition. Incorporating more data sources would allow customer-specific promotions and separate out the im- pact of actions taken by the retailer (e.g., changing discounts and prices) from weather and environment [23]. However, it can increase the load by several orders of magnitude as the size of the input to analytics explodes. A feasible approach to address these shortcomings would at least need to support efficient joining of many relations and running in-database analytics on large join results. The main contribution of this paper is a batch gradient descent approach that can build linear regression models on training datasets defined by arbitrary join queries on tables.

Transcript of Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of...

Page 1: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

Learning Linear Regression Models over Factorized Joins

Maximilian Schleich Dan Olteanu Radu Ciucanu

Department of Computer Science, University of Oxford{max.schleich,dan.olteanu,radu.ciucanu}@cs.ox.ac.uk

Cultural Learnings of Factorized Joins for MakeBenefit Glorious Family of Regression Tasks1

ABSTRACTWe investigate the problem of building least squares regres-sion models over training datasets defined by arbitrary joinqueries on database tables. Our key observation is that joinsentail a high degree of redundancy in both computation anddata representation, which is not required for the end-to-endsolution to learning over joins.

We propose a new paradigm for computing batch gradientdescent that exploits the factorized computation and repre-sentation of the training datasets, a rewriting of the regres-sion objective function that decouples the computation ofcofactors of model parameters from their convergence, andthe commutativity of cofactor computation with relationalunion and projection. We introduce three flavors of this ap-proach: F/FDB computes the cofactors in one pass overthe materialized factorized join; F avoids this materializa-tion and intermixes cofactor and join computation; F/SQLexpresses this mixture as one SQL query.

Our approach has the complexity of join factorization,which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasetsshow that it outperforms MADlib, Python StatsModels, andR, by up to three orders of magnitude.

1. INTRODUCTIONThere is increasing interest in academia and industry in

building systems that integrate databases and machine learn-ing [17, 34, 1, 18]. This is driven by web companies, e.g.,Google Brain, Twitter, Facebook’s DeepFace, or Microsoft’splatform, though it is also prominent in other industry seg-ments such as retail-planning and forecasting applications [2].State-of-art commercial analytics systems support descrip-tive, or backward-looking analytics, predictive, or forward-

1With apologies to Sacha Baron Cohen and Borat.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

SIGMOD’16, June 26-July 01, 2016, San Francisco, CA, USAc© 2016 ACM. ISBN 978-1-4503-3531-7/16/06. . . $15.00

DOI: http://dx.doi.org/10.1145/2882903.2882939

looking analytics such as classification and regression, andprescriptive analytics, which are also forward-looking andusually take the output of a predictive model as input. Thetypical data sources of interest are weekly sales data, pro-motions, and product descriptions. In these settings, theinput to analytics is a relation representing the natural joinof those data sources stored in a database. A typical pre-diction a retailer would like to compute is the additionaldemand generated for a given product due to promotion.

In our exploration with such systems on a real datasetfrom a large US retailer and following discussions with in-dustry experts [23], we realized three major shortcomings ofthese systems: (1) Poor integration of analytics and databa-ses, which are traditionally confined to distinct specializedsystems in the ever-growing technology stack [2]; (2) poorefficiency already for few data sources; and (3) insufficientaccuracy due to omission of further relevant data sources.

The use of typical data sources already stretches the scal-ability of existing systems and the current processing regimeis to manually partition the data, e.g., by market and cate-gory, and run analytics independently on each partition [23].Domain-expertise focuses on how to best partition the data,to the point where it becomes black art. Data partitioningmisses relevant correlations and prohibits common forecastpatterns across different categories and markets. By pro-cessing separately by market/category, today’s forecastingcannot leverage similar demand behavior across geographyor time. For this reason, data partitioning is seen as a stronglimitation of the state of the art and preference is given torunning analytics on the entire dataset.

More data sources would enable forecasting with higheraccuracy and at a more granular level, such as customer re-views, basket data transactions, competitive promotions andprices, flu trends, loyalty program history, customer trans-action history, social media text related to the retailer prod-ucts sold, store attributes such as demographics, weather, ornearby competition. Incorporating more data sources wouldallow customer-specific promotions and separate out the im-pact of actions taken by the retailer (e.g., changing discountsand prices) from weather and environment [23]. However, itcan increase the load by several orders of magnitude as thesize of the input to analytics explodes.

A feasible approach to address these shortcomings wouldat least need to support efficient joining of many relationsand running in-database analytics on large join results.

The main contribution of this paper is a batch gradientdescent approach that can build linear regression models ontraining datasets defined by arbitrary join queries on tables.

Page 2: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

Our key observation is that the intermediate join step repre-sents the main bottleneck and it is unnecessarily expensive.It entails a high degree of redundancy in both computationand data representation, yet this is not required for the end-to-end solution, whose result is a list of real-valued parame-ters of the learned model. By computing a factorized join [5]instead of the standard flat join, we reduce data redundancyand improve performance for both the join and the learningsteps. The theoretical and practical gains in both requiredmemory space and time performance for factorized joins arewell-understood and can be asymptotically exponential inthe size of the join [28], which translates to orders of mag-nitude for various datasets reported in experiments [5, 4].

Our learning approach is based on several contributions:

• It uses an algebraic rewriting of the regression’s objec-tive function that decouples the computation of cofac-tors of model parameters from their convergence.

• Cofactor computation commutes with relational unionand projection. The commutativity with union enablesthe computation of the parameter cofactors for the en-tire dataset as the sum of corresponding cofactors fordisjoint partitions of the input dataset. This prop-erty is essential for efficiency in a centralized settingand also desirable for concurrent learning, where co-factors of partitions are computed on different cores ormachines. The commutativity with projection allowsus to compute all cofactors once and then explore thespace of possible models by only running convergencefor a subset of parameters. This property is desirablefor model selection, whose goal is to find a subset offeatures that best predict a test dataset.

• We introduce three flavors of our approach: F/FDBuses FDB [5] to materialize the factorized join andcomputes the cofactors in one pass over it; our baselineF avoids this materialization and blends cofactor andjoin computation; and F/SQL expresses this mixtureof joins and cofactors as one optimized SQL query.

• Given a database D and a join query Q, our approachneeds O(|D|fhtw(Q)) time (modulo log factors), whichis worst-case optimal for computing the factorized joinwhose nesting structure is defined by a hypertree de-composition of the query Q; in contrast, any relationalengine can achieve at best O(|D|ρ

∗(Q)) [3, 26, 41]. Thegap between the fractional hypertree width fhtw(Q)and the fractional edge cover number ρ∗(Q) can be aslarge as the number of relations in Q. For instance,fhtw = 1 for acyclic queries and our approach takeslinear time. Learning over traditional flat joins is thusbound to be suboptimal.

• We can also learn linear functions over arbitrary basisfunctions, which include feature interactions.

• We benchmarked F and its variants against R, whichuses QR decomposition [15], Python StatsModels [40],which uses Moore-Penrose pseudoinverse to computeclosed-form solutions (ols) [29], and MADlib [17], whichalso supports ols and Newton optimization for gener-alized linear models. In our experiments with pub-lic, commercial, and artificial datasets, F outperformsthese competitors by up to three orders of magnitudewhile preserving the same accuracy.

2. FACTORIZED DATABASES: A PRIMERIn this paper, we rely on factorized databases to compute

and represent join results that are input to learning regres-sion models. We next introduce such databases by exampleand refer to the literature [5, 4, 28] for a rigorous treatment.

Factorized databases form a representation system for re-lational data that exploits laws of relational algebra, suchas the distributivity of the Cartesian product over union, toreduce data and computation redundancy.

Example 2.1. Figure 1(a) depicts a database consistingof three relations along with their natural join: The rela-tion House records house prices and living areas (in squaredmeters) within locations given by zipcodes; TaxBand relatescity/state tax bands with house living areas; Shops list shopswith zipcode and opening hours (in our experiments, we con-sider an extended dataset from a large US retailer; the joincondition may also use a user-defined distance function).

The join result exhibits a high degree of redundancy. Thevalue z1 occurs in 24 tuples, each value h1 to h3 occurs ineight tuples and they are paired with the same combinationsof values for the other attributes. Since z1 is paired in Housewith p1 to p3 and in Shops with h1 to h3, all combinations(indeed, the Cartesian product) of the former and the lat-ter values occur in the join result. We can represent thislocal product symbolically instead of eagerly materializingit. If we systematically apply this observation, we obtain anequivalent factorized representation of the entire join resultthat is much more compact than the flat, tabular represen-tation of the join result, cf. Figure 1(c) (the attribute namesare clear from context). Each tuple in the result is repre-sented once in the factorized join and can be constructed byfollowing one branch of every union and all branches of aproduct. Whereas the flat join has 130 values (26 tuples of5 values each), the factorized join only has 18 values.

The factorized join in Figure 1(c) has the nesting structuredepicted in Figure 1(b): It is a union of Z-values occurringin the join of Shops and House on Z. For each Z-value z, werepresent separately the union of H-values paired with z inShops and the union of S-values paired with z in House andwith T -values in TaxBand. That is, given z, the H-valuesare (conditionally) independent of the S-values and can bestored separately. This is where the factorization saves com-putation and space as it avoids an explicit enumeration of allcombinations of H-values and S-values for a given Z-value.Also, under each S-value, there is a union of T -values and aunion of P -values. The factorization can be compacted fur-ther by caching subexpressions [28]: The S-value s2 occurswith its union of T -values t4 ∪ t5 from TaxBand, regardlessof which Z-values s2 is paired with in House. This T -unioncan be stored once and reused for every occurrence of s2. 2

The nesting structures of factorized joins are called d-trees [28]. They are rooted forests representing partial ordersof join variables. Each variable A in a d-tree ∆ has a subsetkey(A) of the set anc(A) of its ancestor variables on which Aand its descendant variables may depend. A d-tree satisfiesthe following constraints. (1) The variables of each relationsymbol in Q lie along the same root-to-leaf path (since theydepend on each other). (2) For any child B of a variable A,key(B) ⊆ key(A)∪{A}. Caching of factorizations rooted atunions of A-values is useful when key(A) ⊂ anc(A), since Aand its descendant variables do not depend on variables inanc(A) \ key(A) and thus factorizations rooted at A-values

Page 3: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

Shops

Z H

z1 h1

z1 h2

z1 h3

z2 h4

House

Z S P

z1 s1 p1

z1 s1 p2

z1 s2 p3

z2 s2 p4

TaxBand

S T

s1 t1s1 t2s1 t3s2 t4s2 t5

Shops 1 House 1 TaxBand

Z H S P T

z1 h1 s1 p1 t1z1 h1 s1 p1 t2z1 h1 s1 p1 t3z1 h1 s1 p2 t1z1 h1 s1 p2 t2z1 h1 s1 p2 t3z1 h1 s2 p3 t4z1 h1 s2 p3 t5

· · · · · · · · ·the above for h2 and h3

· · · · · · · · ·z2 h4 s2 p4 t4z2 h4 s2 p4 t5

(a) The three relations of database D and natural joinQ(D).

Z

H S

T P

Shops

HouseTaxBand

(b) D-tree ∆.

z1 z2

× ×

∪ ∪ ∪ ∪

h1 h2 h3 s1 s2 s2 h4

× × ×

∪ ∪ ∪ ∪ ∪

p1 p2 t1 t2 t3 p3 t4 t5 p4

(c) Factorization of Q(D) over ∆.

Figure 1: (a) Database D with relations House(Zipcode, Sqm, Price), TaxBand(Sqm, Tax), Shops(Zipcode,Hours), where the attribute names are abbreviated and the values are not necessarily distinct; (b) Nestingstructure (d-tree) ∆ for the natural join of the relations; (c) Factorization ∆(D) of the natural join over ∆.

repeat for every tuple of values for these variables. For in-stance, key(T ) = {S} ⊂ anc(T ) = {S,Z}.

The construction of d-trees is guided by the joins, car-dinalities, and join selectivities. They can lead to factor-izations of greatly varying sizes, where the size of a rep-resentation (flat or factorized) is defined as the number ofits values. Within the class of factorizations over d-trees, wecan find the worst-case optimal ones and also compute themin worst-case optimal time:

Proposition 2.2. Given a join query Q, for every databa-se D, the join result Q(D) admits

• a flat representation of size O(|D|ρ∗(Q)) [3];

• a factorization over d-trees of size O(|D|fhtw(Q)) [28].

There are classes of databases D for which the above sizebounds are tight and worst-case optimal join algorithms tocompute the join result in these representations [26, 28].

The measures ρ∗(Q) and fhtw(Q) are the fractional edgecover number and the fractional hypertree width respec-tively. We know that 1 ≤ fhtw(Q) ≤ ρ∗(Q) ≤ |Q|, where |Q|is the number of relations in query Q [28]. The gap betweenfhtw(Q) and ρ∗(Q) can be as large as |Q|. The fractionalhypertree width is fundamental to problem tractability withapplications spanning constraint satisfaction, databases, ma-trix operations, logic, and probabilistic graphical models [19].

A key observation is that aggregates defined by arithmeticexpressions over data values with operations summation andmultiplication, e.g., count and sum, can be computed in onepass over factorized joins [4]; we only need to change theunion-Cartesian product semiring to the sum-multiplicationsemiring. For instance, to count the number of tuples of arelation represented by a subtree of the factorized join, we in-terpret each value as 1 and turn unions and Cartesian prod-ucts into sums and multiplication respectively. The countat the topmost union node is then the number of tuples inthe query result. To sum over all values of an attribute A,the only difference to the previous count algorithm is thatwe now preserve the A-values. The construction of linearregression models concerns a family of such aggregates.

3. LEARNING REGRESSION MODELSOVER FACTORIZED JOINS

We are given a training dataset of size m that is computedas a join of database tables:

{(y(1), x(1)1 , . . . , x(1)

n ), . . . , (y(m), x(m)1 , . . . , x(m)

n )}.

The values y(i) are called labels and the other values arecalled features. They are all real numbers. For our trainingdataset, a natural label would be P to predict the pricegiven the other features. Our goal is to learn the parametersθ = (θ0, . . . , θn) of the linear function:

hθ(x) = θ0 + θ1x1 + . . .+ θnxn

that approximates the label y of unseen tuples (x1, . . . , xn).For uniformity, we add x0 = 1 so that hθ(x) =

∑nk=0 θkxk.

The error of our model is given by the so-called leastsquares regression objective function:

J(θ) =1

2

m∑i=1

(hθ(x(i))− y(i))2. (1)

This is a much studied objective function, since minimizingsum-of-squared errors is equivalent to finding a maximumlikelihood solution under a Gaussian noise model [6].

We consider the batch gradient descent (BGD) [6], whichrepeatedly updates the parameters of hθ in the direction ofthe gradient to decrease the error given by J(θ):

∀0 ≤ j ≤ n : θj := θj − αδ

δθjJ(θ)

:= θj − αm∑i=1

(

n∑k=0

θkx(i)k − y

(i))x(i)j .

The value α is the learning rate and can adapt with theiterations. A naıve implementation of the above expressionwould start with some initial values for parameters θk andperform one pass over the dataset to compute the value ofthe sum aggregate, followed by one approximation step forthe parameters, and repeat this process until convergence.

Page 4: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

This is not practical since it is inefficient to go over the entiredataset for each iteration. BGD can however become verycompetitive when adapted to work on factorized joins.

BGD has two logically independent tasks that are inter-twined in the above expression: The computation of the sumaggregate and the convergence of the parameters. The sumaggregate can be rewritten so that the label y becomes partof the features and has a predefined parameter θ = −1:

∀0 ≤ j ≤ n : Sj =

m∑i=1

(

n∑k=0

θkx(i)k )x

(i)j (2)

Our approach is guided by two main insights.

Our first insight is that the sum aggregates can be rewrit-ten so that we explicate the cofactor of each parameter θkin each sum aggregate Sj :

∀0 ≤ j ≤ n : Sj =

n∑k=0

θk × Cofactor[k, j] (3)

where Cofactor[k, j] =

m∑i=1

x(i)k x

(i)j (4)

This reformulation allows to decouple cofactor computationfrom parameter convergence. This is crucial for performanceas we do not require to scan the dataset for each approxima-tion step. Our system F computes the cofactors once andperforms parameter convergence directly on the matrix ofcofactors, whose size is independent of the data size m. Forconvergence, F uses an adaptation of AdaGrad [13].

Furthermore, the cofactor matrix has desirable properties:

Proposition 3.1. Let (Q,D) be a pair of join query Qand database D defining the training dataset Q(D) withschema/features σ = (A0, . . . , An). Let Cofactor be the co-factor matrix for learning the parameters θA0 , . . . , θAn of thefunction fθ =

∑nk=0(θAkxAk ) using batch gradient descent.

The cofactor matrix has the following properties:1. Cofactor is symmetric:

∀0 ≤ k, j ≤ n : Cofactor[Ak, Aj ] = Cofactor[Aj , Ak].

2. Cofactor computation commutes with union: Giventraining datasets Q(D1), . . . , Q(Dp) with cofactor matricesCofactor1, . . . , Cofactorp where D =

⋃pl=1 Dl, then

∀0 ≤ k, j ≤ n : Cofactor[Ak, Aj ] =

p∑l=1

Cofactorl[Ak, Aj ].

3. Cofactor computation commutes with projection: Givena feature set L ⊆ σ and the cofactor matrix CofactorL forthe training dataset πL(Q(D)), then

∀0 ≤ k, j ≤ n such that Ak, Aj ∈ L :

CofactorL[Ak, Aj ] = Cofactor[Ak, Aj ].

The symmetry property implies that we only need to com-pute the upper half of the cofactor matrix.

The commutativity with union means that the cofactormatrix for the union of several training datasets is the entry-wise sum of the cofactor matrices of these training datasets.This property is key to the efficiency of our approach, sincewe can locally compute partial cofactors over different par-titions of the training dataset and then add them up. Itis also desirable for concurrent computation, where partialcofactors can be computed on different cores or machines.

The commutativity with projection allows us to use thecofactor matrix to compute any subset of the parameters:All it takes is to ignore from the matrix the columns androws for the irrelevant parameters. This is beneficial if somefeatures are necessary for constructing the dataset but ir-relevant for learning, e.g., relation keys supporting the joinsuch as zipcode in our training dataset in Figure 1(a). Itis also beneficial for model selection, a key challenge in ma-chine learning centered around finding the subset of featuresthat best predict a test dataset. Model selection is a labo-rious and time-intensive process, since it requires to learnindependently parameters corresponding to subsets of theavailable features. With our reformulation, we first computethe cofactor matrix for all features and then perform con-vergence on top of the cofactor matrix for the entire latticeof parameters independently of the data. Besides choosingthe features after cofactor computation, we may also choosethe label and fix its model parameter to -1.

The two commutativity properties in Proposition 3.1 holdunder bag (SQL) semantics in the sense that the relationalprojection and union operators do not remove duplicates.This is important, since learning is sensitive to duplicates.

Our second insight is that we can compute the cofactorsin one pass over any factorized join representing the trainingdataset, which has the flat join as a special case:

Proposition 3.2. Let (Q,D,∆) be a triple of a join queryQ, a database D, and a d-tree ∆ of Q. Let the trainingdataset be the factorized join ∆(D) with features (A0, . . . , An),and let Cofactor be the cofactor matrix for learning the pa-rameters θA0 , . . . , θAn of the function

∑nk=0(θAkxAk ) using

batch gradient descent. Then, Cofactor can be computed inone pass over the factorized join ∆(D).

Section 4 gives an algorithm to compute the cofactor ma-trix over the materialized factorized join. Section 5 thenshows how to avoid this materialization by intertwining co-factor and join computation. Finally, Section 6 shows howto encode the previous algorithm in one SQL query. An im-mediate implication is that the redundancy in the flat joinresult is not necessary for learning:

Theorem 3.3. The parameters of any linear function overfeatures from a training dataset defined by a database D anda join query Q can be learned in time O(|D|fhtw(Q)).

Theorem 3.3 is a direct corollary of Propositions 2.2 and3.2. We recall our discussion in Section 2 that within theclass of factorized representations over d-trees, this timecomplexity is essentially worst-case optimal in the sense thatthere is no join algorithm that can achieve a lower worst-casetime complexity. To put this result into a broader context,any worst-case optimal join algorithm that would produceflat join results, such as NPRR [26] or LogicBlox’s LeapFrog

TrieJoin [41], would need time at least O(|D|ρ∗(Q)) to create

the training dataset, yet the gap between ρ∗(Q) and fhtw(Q)can be as large as the number of relations in the join query.

Our approach extends to linear functions∑nk=0 θkφk(x)

with arbitrary basis functions φk over a tuple x of features.We previously discussed the identity basis functions φk(xk) =xk. Further examples are polynomials and Gaussian RadialBasis Functions [6]. While basis functions over a single fea-ture xk are trivially supported, feature interactions are chal-lenging as they may restrict the structure of the factorized

Page 5: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

join. For instance, the basis function φk(xi, xj) = xi ·xj canonly be supported efficiently by d-trees where the attributesxi and xj are along the same root-to-leaf path as if they wereattributes of a same relation, since we require to computeall possible combinations of values for xi and xj . We canenforce this path constraint by enriching the database withone (not materialized) relation over the schema Rk(xi, xj)and the query Q with a natural join with Rk. The d-trees forthe enriched query will necessarily satisfy the new path con-straint and the factorization will have a new value for everycombination of values for xi and xj along a same path. Wecan thus add to the factorization a value 〈φk : φk(xi, xj)〉under each pair of values for xi and xj .

We can rephrase Theorem 3.3 for linear functions withbasis functions as follows. We say that the basis functionsφ0, . . . , φb over the sets of features S0, . . . , Sb induce a rela-tional schema σ = (R0(S0), . . . , Rb(Sb)). Given a join queryQ and the above schema σ, an extension of Q with respectto σ is a join query Qσ = Q 1 R0 1 · · · 1 Rb.

Theorem 3.4. Let Q be a join query and D a databasethat define the training dataset Q(D), and fθ a linear func-tion with basis functions that induce a relational schema σ.Let Qσ be the extension of Q with respect to σ. Then, theparameters of fθ can be learned in time O(|D|fhtw(Qσ)).

The two insights discussed above complement each other,in particular Proposition 3.1 still holds in the presence offactorized joins. The commutativity with projection is es-pecially useful in conjunction with factorization since it doesnot restrict our choice of possible d-trees for the factorizedjoin depending on the input features used for learning. Wemay choose the best possible factorization of the join resultand at learning time skip over irrelevant attributes (e.g.,join attributes). Explicitly removing join attributes fromthe factorized join may in fact lead to larger representations(this contrasts with the flat case). For instance, if we wouldeliminate from the factorized join in Figure 1(c) all Z andS-values, then the remaining attributes H, T , and P wouldbecome dependent on each other (they were independentconditioned on values for Z and S). The only permissi-ble d-trees would be paths, and the factorized join may beasymptotically as large as the flat join.

For simplicity of exposition, we assume the following (notrequired in practice). Each relation has one extra attributeI with value 1 to accommodate the intercept θ0. The d-treesfor the join of the relations have the variable I as root. Thefatorized join has the I-value 1 as root.

4. F/FDB: COFACTOR COMPUTATION ONMATERIALIZED FACTORIZED JOINS

The cofactors in Equation (4) can be rewritten to mirrorthe factorization in the factorized join. In this section, weintroduce F/FDB, an approach that relies on FDB [5] tocompute the factorized join and that computes the rewrittencofactors in one pass over the factorized join.

4.1 Cofactor Computation By FactorizationWe explain cofactor factorization by examples.

Example 4.1. For the training dataset TD in Figure 1(a),there is one sum aggregate in Equation (2) per feature (i.e.,attribute) in the dataset. For feature Z, we obtain:

SZ =(θZz1 + θHh1 + θSs1 + θP p1 + θT t1)z1+

(θZz1 + θHh1 + θSs1 + θP p1 + θT t2)z1+

(θZz1 + θHh1 + θSs1 + θP p1 + θT t3)z1+

. . . (the above block repeated for p2)

(θZz1 + θHh1 + θSs2 + θP p3 + θT t4)z1+

(θZz1 + θHh1 + θSs2 + θP p3 + θT t5)z1+

. . . (all above repeated for h2 and h3)

(θZz2 + θHh4 + θSs2 + θP p4 + θT t4)z2+

(θZz2 + θHh4 + θSs2 + θP p4 + θT t5)z2.

We can reformulate the aggregate SZ using the rewritings

n∑i=1

x→ x · n and

n∑i=1

x · ai → x ·n∑i=1

ai :

SZ = θZ [z1(z1 + · · ·+ z1︸ ︷︷ ︸|σZ=z1

(TD)|

) + z2(z2 + · · ·+ z2︸ ︷︷ ︸|σZ=z2

(TD)|

)]+

θH [z1(3∑i=1

hi + · · ·+ hi︸ ︷︷ ︸|σZ=z1,H=hi

(TD)|

) + z2( h4 + · · ·+ h4︸ ︷︷ ︸|σZ=z2,H=h4

(TD)|

)]+

θS [z1(

2∑i=1

si + · · ·+ si︸ ︷︷ ︸|σZ=z1,S=si

(TD)|

) + z2( s2 + · · ·+ s2︸ ︷︷ ︸|σZ=z2,S=s2

(TD)|

)]+

θP [z1(

3∑i=1

pi + · · ·+ pi︸ ︷︷ ︸|σZ=z1,P=pi

(TD)|

) + z2( p4 + · · ·+ p4︸ ︷︷ ︸|σZ=z2,P=p4

(TD)|

)]+

θT [z1(

5∑i=1

ti + · · ·+ ti︸ ︷︷ ︸|σZ=z1,T=ti

(TD)|

) + z2(

5∑i=4

ti + · · ·+ ti︸ ︷︷ ︸|σZ=z2,T=ti

(TD)|

)].

We then obtain the following cofactors in the sum SZ :

Q[Z,Z] = z1 · 24z1 + z2 · 2z2Q[H,Z] = z1 · 8(h1 + h2 + h3) + z2 · 2h4.

Q[S, Z] = z1 · 3(6s1 + 2s2) + z2 · 2s2.Q[P,Z] = z1 · 3[3(p1 + p2) + 2p3] + z2 · 2p4.Q[T, Z] = z1 · 3[2(t1 + t2 + t3) + t4 + t5] + z2 · (t4 + t5). 2

This arithmetic factorization is not arbitrary. It considersthe arithmetic expressions grouped by the join Z-values, asdone by the d-tree in Figure 1(b). Each cofactor in SZ isexpressed as a sum of terms with one term per each joinZ-value z1 and z2. The numerical values occurring in thecofactors represent occurrence counts, e.g., 24 in L[Z] = 24z1

states that z1 occurs in 24 tuples in the training dataset,while 8 in L[H] = 8(h1 +h2 +h3) states that each of h1, h2,and h3 occurs in 8 tuples with z1. The expressions L[Z] andL[H] represent sums of Z-values and respectively H-valuesthat occur in the same tuples with z1 and that are weightedby their occurrence counts.

The above rewritings do not capture the full spectrum ofpossible computational savings: They are sufficient for co-factors of features θX in SumY , where X and Y are from thesame input relation. Moreover, they do not bring asymptoticsavings. In case X and Y are from different input relations,then we can potentially save more computation.

Example 4.2. Consider now a rewriting of the cofactorof parameter θP in sum ST :

Q[P, T ] = 3[(p1 + p2) · (t1 + t2 + t3)] + 3[p3 · (t4 + t5)] + p4 · (t4 + t5).

The three terms in the outermost sum correspond to differ-ent pairs of join values for Z and S, namely (z1, s1), (z1, s2),

Page 6: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

f-fdb (Factorization E)if (visited) return;switch E:

〈A : a〉

×

E1. . .Ek

CE = 1;foreach j ∈ [k] do { f-fdb(Ej); CE = CE · CEj ; }LE [A] = a · CE ; QE [A,A] = a · LE [A];foreach j ∈ [k] do {

foreach B ∈ Schema[Ej ] do { LE [B] = LEj [B] · CE/CEj ; QE [A,B] = a · LEj [B] · CE/CEj ; }foreach B,D ∈ Schema[Ej ] s.t. B < D do QE [B,D] = QEj [B,D] · CE/CEj ;foreach j < l ∈ [k], B ∈ Schema[El], D ∈ Schema[Ej ] do QE [B,D] = LEl [B] · LEj [D] · CE/(CEl · CEj );

}

E1. . .Ek

CE = 0; foreach j ∈ [k], B,D ∈ Schema[Ej ] do {LE [B] = 0; QE [B,D] = 0; }foreach j ∈ [k] do { f-fdb(Ej); CE += CEj ; }foreach j ∈ [k], B ∈ Schema[Ej ] do {LE [B] += LEj [B];foreach D ∈ Schema[Ej ] do QE [B,D] += QEj [B,D];

}visited = true;

Figure 2: F/FDB: Algorithm for computing regression aggregates (constant CE, linear LE, quadratic QE) inone pass over a factorized join E.

and (z2, s2). This rewriting thus follows the same join orderas the d-tree in Figure 1(b). These terms read as follows:Each of the P -values p1 and p2 occurs in three tuples witheach of the T -values t1 to t3; the P -value p3 occurs in threetuples with each of the T -values t4 and t5; the P -value p4

occurs in one tuple with each of the T -values t4 and t5. 2

The rewritten expression for Q[P, T ] factors out sums, e.g.,p1 + p2, using a rewriting more powerful than those for thesum SZ given in Example 4.1 that can transform expressionsto exponentially smaller equivalent ones:

r∑i=1

s∑j=1

(xi · yj)→ (

r∑i=1

xi) · (s∑j=1

yj).

The above rewritings are already implemented by the fac-torized join from Figure 1(c). For instance, the sums ofvalues in the cofactors mentioned in Examples 4.1 and 4.2can be recovered via unions of their corresponding valuesin the factorization. Since Z-values are above the valuesfor the other attributes, the former are in one-to-many re-lationships with the latter. This explains the rewritings inExample 4.1 for the cofactors in sum SZ : Each of z1 andz2 are paired with the weighted sums of all H-values under-neath, namely 8(h1 +h2 +h3) and respectively 2h4. Similarpairings are with the weighted sums of values for each of theother attribute. Since P and T are on different branches inthe d-tree, a Cartesian product of a union of P -values anda union of T -values becomes a product of the sums of thecorresponding P -values and T -values.

Examples 4.1 and 4.2 show that cofactor computation re-quires three types of aggregates: Constant aggregates thatare occurrence counts; linear aggregates that are weightedsums, i.e., sums over features or products of linear and con-stant aggregates; and quadratic aggregates that are cofac-tors, i.e., products of features and/or linear aggregates, orof quadratic and constant aggregates. Constant aggregatesare real numbers and denoted by C. Linear aggregates arearrays of reals, one per feature A and denoted by L[A].Quadratic aggregates are matrices of reals, one per each pairof features (A,B) and denoted by Q[A,B].

Definition 4.3. A regression aggregate is a tuple (C,L,Q)of constant, linear, and quadratic aggregates. We use indicesE and ∆ to refer to the regression aggregates (CE , LE , QE)for a factorization E and (C∆, L∆, Q∆) for a d-tree ∆.

4.2 Computing Regression AggregatesThe algorithm f-fdb in Figure 2 computes the regression

aggregates, and in particular the cofactors, at each node inthe input factorization E in one pass over E.

Let us denote by JEK the relation represented by the fac-torization E. The schema Schema[E] of JEK is the set offeatures (or query variables) from the d-tree of E. TheCartesian product and union operators in the factorizationtranslate to multiplication and summation for regression ag-gregates. Since they commute with union, we compute themfor each child of a union and then add them entrywise.

The constant aggregate CE is the number of tuples inJEK. The constant aggregates are used to compute occur-rence counts as follows. If E is the child of a product E×, itsoccurrence count in E× (i.e., the occurrence count of eachtuple in JEK) is the product of the constant aggregates ofits siblings. This is correct, since each of the tuples repre-sented by E is extended in the relation JE×K by each tuplerepresented by each of E’s siblings.

The linear aggregate LE [A] for a feature A ∈ Schema[E] isthe sum of A-values in E, each weighted by their occurrencecounts given by constant aggregates.

The quadratic aggregate QE [B,D] for features B,D ∈Schema[E] is the sum of quadratic aggregates QEj [B,D] forthe child Ej of E if both B and D occur in Ej , or the productof linear aggregates for each of B and D weighted by theiroccurrence counts, if B and D are from different children ofE. We use the symmetry of the cofactor matrix (cf. Proposi-tion 3.1) to only compute the quadratic aggregates for B andD in case B occurs before D in the depth-first left-to-rightpreorder traversal of the d-tree (and factorization). The co-factor matrix is formed by the quadratic aggregates at theroot of the factorization.

Page 7: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

1

×

z1 z2

× ×

∪ ∪ ∪ ∪

h1 h2 h3 s1 s2 s2 h4

× × ×

∪ ∪ ∪ ∪ ∪

p1 p2 t1 t2 t3 p3 t4 t5 p4

26

242

3 82 1

6 2 2

2 3 12

1

H1 =∑3i=1 hi

P1 = p1 + p2 T1 =∑3i=1 ti

T2 = 2T1

P2 = 3P1

T4 = T3

P4 = 2P3

P3 = p3

T3 = t4 + t5

T5 = T2 + T4

P5 = P2 + P4

S5 = 6s1 + 2s2

T6 = 3T5

P6 = 3P5

S6 = 3S5

H6= 8H1

P7 = p4

T8 = T3

P8 = 2P7

H2 = h4

T9= T8

P9= P8

S9= 2s2

T10 = T9

P10 = P9

S10 = S9

H10= 2H2

Z11 = 24z1 + 2z2

T11 = T6 + T10

P11 = P6 + P10

S11 = S6 + S10

H11= H6 +H10

θI θZ θS θP θT θHΣI 26 24z1 + 2z2 3(6s1 + 2s2) + 2s2 9(p1 + p2) + 6p3 + 2p4 6(t1 + t2 + t3) + 3(t4 + t5) 8(h1 + h2 + h3) + 2h4

ΣZ ΣI/θZ 24z21 + 2z2

2 z1S6 + z2S10 z1P6 + z2P10 z1T6 + z2T10 z1H6 + z2H10

ΣS ΣI/θS ΣZ/θS 3(6s21 + 2s22) + 2s22 3(s1P2 + s2P4) + s2P8 3s1T2 + 3s2T4 + s2T8 S5H1 + S9H2

ΣP ΣI/θP ΣZ/θP ΣS/θP 9(p21 + p2

2) + 6p23 + 2p2

4 3P1T1 + 3P3T3 + P7T3 P5H1 + P9H2

ΣT ΣI/θT ΣZ/θT ΣS/θT ΣP /θT 6(t21 + t22 + t23) + 3(t24 + t25) T5H1 + T9H2

ΣH ΣI/θH ΣZ/θH ΣS/θH ΣP /θH ΣT /θH 8(h21 + h2

2 + h23) + 2h2

4

Figure 3: The factorized join in Figure 1(c) with extra root value 1 for intercept. Constant (encircled) andlinear (boxed) aggregates are shown for product and union nodes. The cofactor matrix (bottom) is formedby the quadratic aggregates for the root node. We may first compute the matrix and then explore possibleregression models by choosing different sets of features and label to predict.

To support caching, we use a Boolean flag visited for eachnode in the factorization. Initially, this flag is false. Afterthe the node is visited, the flag is toggled so that we recog-nize visited nodes and skip them should we arrive again atthem. The parent of an already visited node ν can just usethe previously computed regression aggregates for ν.

Example 4.4. Figure 3 shows the factorization in Fig-ure 1(c) annotated with constant and linear aggregates ascomputed in one bottom-up pass. The cofactor matrix iscomputed in the same pass. The top left entry in the ma-trix is the cofactor of the intercept θI in the sum SI andrepresents the number of tuples in the flat join result. Sincethe matrix is symmetric, we only compute the entries aboveand including the diagonal. 2

Our algorithm inherits the data complexity of computingthe factorized join as it is linear in its size. At each node ν,the algorithm f-fdb recurses once for each child of ν and weneed time linear in its schema to compute the constant andlinear aggregates, and quadratic in its schema to computethe quadratic aggregates. The time complexity is thus linearin the size of the input factorized data and quadratic inthe size of the schema (number of features). We requirespace to store the factorized data. For each node in thefactorization, we also require space quadratic in the schemasize to store the regression aggregates. In the next section,we show how to avoid to store the factorized data and alsoto reduce the number of regression aggregates to only havingone per variable in the d-tree.

5. F: MIXING COFACTOR AND FACTOR-IZED JOIN COMPUTATION

This section introduces two improvements to F/FDB. (1)We avoid the materialization of the factorized join. (2) Wereduce the number of regression aggregates from one pernode in the factorized join to one per node in its d-tree. Thisnew algorithm is called F. It is an in-memory monolithicengine optimized for regression aggregates over joins.

Figure 4 gives F’s core procedure. It takes as input thedatabase relations R1, . . . , Rd and a d-tree ∆ for the givenjoin query. It computes a tuple (C∆, L∆, Q∆) of regressionaggregates for the factorized join over ∆. Instead of materi-alizing the factorized join, we iterate over its paths of valuesthat are mapped to paths of variables in ∆. Such a map-ping is kept in varMap. This join approach is reminiscent ofLeapFrog TrieJoin [41] with the difference that instead of us-ing a total variable (or join) order, we use a partial variableorder defined by a d-tree. More branches in a d-tree mean ahigher factorization degree and performance improvement.

The relations are assumed sorted on their attributes fol-lowing a depth-first pre-order traversal of ∆. Each call takesa range defined by start and end indices in each relation. Ini-tially, these ranges span the entire relations. Let us assumethe variable at the root of ∆ is A. Once A is mapped toa value a in the intersection of possible A-values from therelations with attribute A, then these ranges are narroweddown to those tuples with value a for A. We may further nar-row down these ranges using mappings for variables belowA in ∆ at higher recursion depths. Each A-value a in this

Page 8: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

f (d-tree ∆, varMap, ranges[(start1, end1), . . . , (startd, endd)])

A = var(∆); keyMap = πkey(A)(varMap); reset(C∆, L∆, Q∆);

if (key(A) 6= anc(A)) { (C∆, L∆, Q∆) = cacheA[keyMap]; if (C∆ 6= 0) return; }

foreach a ∈⋂i∈[d],A∈Schema[Ri]

πA(Ri[starti, endi]) do {foreach i ∈ [d] do find ranges Ri[start

′i, end

′i] ⊆ Ri[starti, endi] such that πA(Ri[start

′i, end

′i]) = {(A : a)};

switch(∆) :leaf node A : C∆ += 1; L∆[A] += a; Q∆[A,A] += a · a;inner node A(∆1, . . . ,∆k) :

foreach j ∈ [k] do f(∆j , varMap× {(A : a)}, ranges[(start′1, end′1), . . . , (start′d, end

′d)]);

C = C∆1 · . . . · C∆k ;if(C 6= 0) {C∆ += C; L∆[A] += a · C; Q∆[A,A] += a · a · C;foreach j ∈ [k] do {

foreach B ∈ Schema[∆j ] do { L∆[B] += L∆j [B] · C/C∆j ; Q∆[A,B] += a · L∆j [B] · C/C∆j ; }foreach B,D ∈ Schema[∆j ] s.t. B ≤ D do Q∆[B,D] += Q∆j [B,D] · C/C∆j ;foreach j < l ∈ [k], B ∈ Schema[∆l], D ∈ Schema[∆j ] do Q∆[B,D] += L∆l [B] · L∆j [D] · C/(C∆l · C∆j );

}}

}if (key(A) 6= anc(A)) cacheA[keyMap] = (C∆, L∆, Q∆);

Figure 4: F: Algorithm for computing regression aggregates (constant C∆, linear L∆, quadratic Q∆) for agiven d-tree ∆ of the join query and a database (R1, . . . , Rd). The parameters of the initial call are the d-tree∆, an empty variable map, and the full range of tuples for each relation.

intersection is the root of a factorization fragment over ∆.Instead of using one regression aggregate for each A-value,we may use one running regression aggregate for all of themsince it commutes with their union. We thus only need asmany regression aggregates as variables in the d-tree.

The regression aggregate for A is computed using the A-values and the aggregates for its children if any. We firstreset it, since it might have been used for different factor-ization fragments with the same nesting structure ∆. Wenext check whether we previously computed this aggregatefor the same factorization fragment and cached it. Whereasfor F/FDB caching is done by FDB, F needs to manage thecache itself. As explained in Section 2, the key of each vari-able tells us when caching may be useful: This happens whenkey(A) is strictly contained in anc(A), since this means thatthe factorization fragments over the d-tree rooted at A arerepeated for every distinct combination of values for vari-ables in anc(A) \ key(A). In this case, we probe the cacheusing as key the mappings from varMap of the variables inkey(A). If we have already visited that factorization frag-ment, then we can just use its previously computed regres-sion aggregate and avoid recomputation. If this is the firstvisit, we insert the aggregate in the cache after we computeit. If the aggregate is not already in cache, we compute itprecisely as for F/FDB. The case when ∆ is an inner nodeis a combination of the two cases in Figure 2: We have aunion of A-values, each on top of a product of subexpres-sions. The case when ∆ is a leaf node corresponds to thefirst case in Figure 2 without subexpressions Ej .

F has the same time complexity as F/FDB (includingFDB’s computation of the factorized join). The key oper-ation needed for joins is the intersection of the arrays ofordered values defined by the relation ranges. This is done

efficiently using the unary leapfrog join [41]. Given d or-dered arrays L1, . . . , Ld, where Nmin = min{|L1|, . . . , |Ld|}and Nmax = max{|L1|, . . . , |Ld|} are the sizes of the smallestand respectively largest array, their intersection takes timeO(Nmin log(Nmax/Nmin)), which is the size of the smallestarray if we ignore log factors. The time to compute the re-gression aggregates is the same as for F/FDB. The amountof extra memory necessary for computing the join is how-ever limited: For each relation, there is one index range thatis a pair of numbers, varMap has at most one mapping pervariable at any one time, and the number of regression ag-gregates is bounded by the number of variables in the d-tree.

6. F/SQL: F IN SQLF’s cofactor computation can be expressed using one SQL

query whose result is a table with one row that contains allcofactors. This SQL encoding of F, called F/SQL, has de-sirable properties. (1) It can be computed by any relationaldatabase system and is thus readily deployable in practicewith a small implementation overhead. (2) It works fordatasets that do not fit in memory. (3) Like F, it pushesaggregates past joins for efficiency.

We generate the F/SQL query in two steps: We firstrewrite the query so that cyclic joins are isolated and therewritten query becomes (α-)acyclic. To retain the com-plexity of F, these isolated cyclic joins have to be computedusing a worst-case optimal algorithm [41]. We then traversebottom up an optimal d-tree ∆ for the rewritten query andat each node with variable g we create a query that joins ong and then computes the regression aggregates for g. Thisevaluation strategy thus intertwines the joins and the aggre-gates like in F, which is prerequisite for its good complex-ity. In contrast to F, F/SQL requires as input an extended

Page 9: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

rewrite(d-tree ∆)

QS = ∅; A = var(∆);if (∆ = A(∆1, . . . ,∆k)) QS =

⋃j∈[k] rewrite(∆j);

QA = relations(key(A) ∪ {A});if ( 6 ∃Q ∈ QS s.t. QA ⊆ Q) return QS ∪ {QA}else return QS

Figure 5: F/SQL: Query rewriting procedure.

d-tree where each input table R becomes a leaf under thelowest variable that corresponds to an attribute of R.

6.1 Rewriting Arbitrary Joins to Acyclic JoinsThe rewriting procedure is given in Figure 5 and works

on a d-tree ∆ of the input join query Qin. The set QS is adisjoint partitioning of the set of relations in Qin into setsof relations or partitions. Each partition QA correspondsto a join query that is materialized to a relation with thesame name QA. The set key(A) ∪ {A} consists of thosevariables on which the variables in ∆ depend. D-trees area different syntax for hypertree decompositions and the setkey(A) ∪ {A} represents the bag at a node in such a de-composition [28]. By materializing the joins defined by suchbags, we simplify the hypertree decomposition or d-tree ofQin to that of an acyclic query Qout equivalent to Qin. Incase Qin is already acyclic, then each partition has one re-lation and hence Qout is syntactically equal to Qin.

Example 6.1. Consider the following bowtie join queryQ over relations R1, . . . , R6 together with a d-tree for it:

A E

B D

CR2 R5

R3 R6

R1 R4 C

A

B

E

D

{C}

{A,C}

{C}

{C,E}We state the keys next to each variable in the d-tree. We

apply the rewriting algorithm. When we reach leaf B in theleft branch, we create the join query QB over the relations{R1, R2, R3} and add it to QS. When we return from re-cursion to variable A, we create the query QA over the samerelations, so we do not add it to QS. We proceed similarlyin the right branch, create the join query QD over relations{R4, R5, R6}, and add it to QS. The queries at E and Care not added to QS. Whereas the original query and thetwo subqueries QB and QD are cyclic, the rewritten queryQout is the join of QB and QD on C and is acyclic. Thetriangle queries QB and QD cannot be computed worst-caseoptimally with traditional relational query plans [3], whichcalls for specialized engines to compute them [41].

The query in Figure 1(a) is acyclic. Using its d-tree fromFigure 1(b), we obtain one identity query per relation. 2

6.2 SQL for Learning over Acyclic JoinsThe SQL construction algorithm is recursive on the struc-

ture of the extended d-tree. As we proceed bottom up, weaggregate over the input attributes A and create sets of con-stant C, linear L, and quadratic Q aggregates.

At a leaf for a table R, we create a trivial query R:

R = SELECT A(R).∗, C(R).∗,L(R).∗,Q(R). ∗ FROM R;

where A is the schema of R, there is no linear and quadraticaggregate, and we add one constant aggregate with value 1.

At an inner node with variable g, where Gi is the querycreated at the i-th child, we create the aggregate query Gover the natural join G1 of queries G1, . . . ,Gk on g.

G1 =SELECT Q(G1).∗,L(G1).∗, C(G1).∗,A(G1).∗FROM G1 NATURAL JOIN . . .Gk;

Q(G1) ={q ·#−i | q ∈ Q(Gi), i ∈ [k]} ∪ {g · 1 ·#}∪{li · lj | li ∈ L(Gi), lj ∈ L(Gj), 1 ≤ i < j ≤ k}∪{g · li ·#−i | li ∈ L(Gi), i ∈ [k]} ∪ {g2 ·#}

L(G1) ={li ·#−i | li ∈ L(Gi), i ∈ [k]} ∪ {g ·#}

C(G1) ={#} A(G1) =

k⋃i=1

A(Gi)− {g}

# =Πkj=1{c | c ∈ C(Gj)} #−i = #/(Πc∈C(Gi)c)

Two constant aggregates are used in several regression ag-gregates: # is the number of tuples in the result of G1 and#−i is # where we ignore child i from the join. The regres-sion aggregates are computed as for F. We next group by theremaining attributes A(G1) and sum over each aggregate:

G =SELECT A(G).∗, C(G).∗,L(G).∗,Q(G).∗FROM G1 GROUP BY A(G).∗;

Q(G) ={∑

q | q ∈ Q(G1)} L(G) = {∑

l | l ∈ L(G1)}

C(G) ={∑

c | c ∈ C(G1)} A(G) = A(G1)

Example 6.2. Consider the d-tree in Figure 1(b). Wecreate the queries QP at node P and then QT at node T :

SELECT Z, S, sum(House_c) as P_c, sum(P) as P_l1,sum(P*P*House_c) as P_q1

FROM (SELECT Z, S, P, 1 as House_c FROM House)GROUP BY Z, S;SELECT S, sum(TaxBand_c) as T_c, sum(T) as T_l1,

sum(T*T*TaxBand_c) as T_q1FROM (SELECT S, T, 1 as TaxBand_c FROM TaxBand)GROUP BY S;

At node S, we create the query QS using QP and QT :

SELECT Z, sum(S_c) as S_c, sum(S_l1) as S_l1,sum(S_l2) as S_l2, sum(S_l3) as S_l3,sum(S_q1) as S_q1, sum(S_q2) as S_q2,sum(S_q3) as S_q3, sum(S_q4) as S_q4,sum(S_q5) as S_q5, sum(S_q6) as S_q6

FROM (SELECT Z, P_c*T_c as S_c, P_l1*T_c as S_l1,T_l1*P_c as S_l2, S*S_c as S_l3,P_q1*T_c as S_q1, T_q1*P_c as S_q2,P_l1*T_l1 as S_q3, S*P_l1*T_c as S_q4,S*T_l1*P_c as S_q5, S*S_l3 as S_q6

FROM Q_P NATURAL JOIN Q_T)GROUP BY Z;

The nodes H and Z are treated similarly. The constructedquery exploits caching: Indeed, for each distinct S-values, QT computes once the aggregates over T in TaxBand.Without caching, we would repeat the computation of QTfor every occurrence of the S-value s in House. 2

F/SQL has the time complexity of F under the assump-tion that cyclic joins are computed using a worst-case op-timal join algorithm like LFTJ. For (input and rewritten)acyclic queries, we may use standard query plans.

Page 10: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

7. EXPERIMENTSWe report on the performance of an end-to-end solution

for learning regression models over joins, which includes: (a)Constructing the training dataset via joins; (b) importingthe dataset in the learning module; and (c) learning theparameters of regression functions. We show experimentallythat the intermediate join step is unnecessarily expensive. Itentails a high degree of redundancy in both computation anddata representation, yet this is not required for the end-to-end solution, whose result is a list of real-valued parameters.By factorizing the join we reduce data redundancy whileimproving performance for both the join and the learningsteps. A tight integration of learning and join processingalso eliminates the need for the data import step.

We benchmark F and its flavors against three open-sourcesystems: M (MADlib [17]), P (Python StatsModels [40]),and R [33]. We show that for the used datasets and learningtasks, F outperforms these systems by up to three ordersof magnitude, while maintaining their accuracy; we verifiedthat for all systems the learned parameters coincide withhigh precision. This performance boost is due to three or-thogonal optimizations: Factorization and caching of dataand computation; decoupling parameter convergence fromcofactor computation; and shared cofactor computation.Regression Learners. We implemented F and its flavorsin C++. They all use AdaGrad to adapt the learning rate(α) based on history in the convergence component [13].F/FDB uses FDB [5] to compute factorized joins via anin-memory multiway sort-merge join. F/SQL is F’s en-coding in SQL. For R we used lm (linear model) based onQR-decomposition [15]. For P we used ols (ordinary leastsquares) based on the Moore-Penrose pseudoinverse [29].For M we used ols to compute the closed-form solution andglm based on Newton optimization for generalized linearmodels. M (glm), P, and R use PostgreSQL 9.4.4 for joincomputation (done by query plans with sort-merge joins forthe Housing dataset and with in-memory hash joins for theother datasets). We tuned PostgreSQL for in-memory pro-cessing by setting its working memory to 14GB and sharedbuffers to 128MB and by turning off parameters than affectperformance (fsync, synchronous commit, full page writes,bgwriter LRU maxpages). We verified that it runs in mem-ory by monitoring IO. From all the systems, M (ols), F, andF/SQL do not require the materialization of the join query.P and R need to export data from PostgreSQL and load itinto their memory space, which requires one pass over theflat join and construction of its internal representation.Experimental Setup. All experiments were performed onan Intel(R) Core(TM) i7-4770 3.40GHz/64bit/32GB withLinux 3.13.0/g++4.8.4 (no compiler optimization flags wereused, ulimit was set to unlimited). We report wall-clocktimes by running each system once and then reporting theaverage of four subsequent runs with warm cache. For P,R, M (glm), and F/FDB we break down the times for:Computing the join in memory, importing the data, andlearning the parameters; we do not report the times to loadthe database into memory as they can differ substantiallybetween PostgreSQL and FDB and are orthogonal to thiswork. All tables are given sorted by their join attributes.Datasets and Learning Tasks. We experimented with areal-world dataset, which is used by a large US retailer forforecasting user demands and sales, with two public datasetsLastFM [10] and MovieLens [16], and a synthetic dataset

modeling the textbook example on house price market [25].We next briefly introduce the learning tasks for each dataset;a detailed description is given in Appendix.

US retailer: We considered three linear regression tasksthat predict the amount of inventory units based on all otherfeatures: L considers all features; N1 and N2 also considertwo interactions of features from the same tables (no factor-ization restructuring necessary), and respectively from dif-ferent tables (factorization restructuring necessary).

LastFM: We consider two training datasets L1 and L2 torelate how often friends listen to the same artists.

MovieLens: The regression task is to predict the ratinggiven by a user to a movie.

Housing: The regression task is to predict the housingprice based on all the features in the dataset.1. Flat vs. Factorized Joins. The compression ratio is adirect indicator of how well F fares against approaches thatrely on flat representation of the training dataset. As shownin Table 1, the compression factor brought by factorizingthe joins varies from 4.43 for MovieLens to 26.61 for USretailer and over 100 for LastFM. Caching can improve thecompression factor even further: 3x for MovieLens and 8xfor LastFM. The effect of caching for US retailer is minimal.

While the speedup of FDB over PostgreSQL is significant,it is less than the data compression ratio possibly due to theless optimized data structures to encode factorizations inFDB. As shown in Figure 6, for scale 10 the factorized joinis computed in 1.25 seconds vs. 64 seconds for the flat join.All join queries used in the experiments are acyclic, whichmeans that our F flavors can compute them in linear time(modulo log factors).

Table 1 reports the join time for the real-world datasets.For the US retailer, the flat join is too large to be handledby P and R. In practice, such large datasets are partitionedand learning happens independently for each partition. Thisentails a loss of accuracy as we miss correlations across fea-tures. We partitioned the largest table Inventory (84M tu-ples) into four (ten) disjoint partitions for R (respectively,P) of roughly equal sizes by hashing on the join values forlocation. By joining each partition with the other tables weobtain a partitioning of the join result. Table 1 reports theoverall (starred) time to compute the join for all partitions.2. Importing Datasets. P and R are the only systemsthat need one pass over the join result to load it into theirinternal data structures. This is typical for the existingsolutions based on software enterprise stacks consisting ofdozens of specialized systems (e.g., for OLAP, OLTP, andBI), where non-trivial integration effort is usually spent atthe interface between these systems. Table 1 and Figure 6report the times for importing the training datasets. ForHousing, P and R failed to import the data starting withscale 10 and respectively 11. In contrast, F can even finishfor scale 100.3. Learning. Table 1 shows that the speedup factor oflearning with F/FDB versus competitors is larger than thecompression factor, which is partially due to sharing compu-tation across cofactors and, when compared against the gra-dient descent approximation methods like M(glm), also dueto our decoupling of cofactor computation from convergence.For R and P on US retailer, we partitioned the dataset as ex-plained in Experiment 2. Table 1 reports the sum of learningtimes for all partitions. However, the learned parameters foreach partition are arbitrarily far from true ones (although

Page 11: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

US retailer L US retailer N1 US retailer N2 LastFM L1 LastFM L2 MovieLens L

# parameters 31 33 33 6 10 27Factorized 97,134,867 97,134,867 97,134,867 2,577,556 2,379,264 6,092,286

Join cached 97,134,675 97,134,675 97,134,675 376,402 315,818 2,115,610Size Flat 2,585,046,352 2,585,046,352 2,585,046,352 369,986,292 590,793,800 27,005,643

Compression 26.61× 26.61× 26.61× 143.54× 248.31× 4.43×cached 26.61× 26.61× 26.61× 982.86× 1870.68× 12.76×

Join Fact. (FDB) 36.03 36.03 36.03 4.79 9.94 12.28Time Flat (PSQL) 249.41 249.41 249.41 54.25 61.33 1.30Import R 1189.12* 1189.12* 1189.12* 155.91 276.77 10.86Time P 1164.40* 1164.40* 1164.40* 179.16 328.97 11.33

F/FDB 9.69 9.82 9.79 0.53 0.89 3.87Learn M (glm) 2671.88 2937.49 2910.23 572.88 746.50 31.91Time R 810.66* 873.14* 884.47* 268.04 466.52 6.96

P 1199.50* 1277.10* 1275.08* 35.74 148.84 10.92F 16.29 16.56 16.50 0.11 0.25 2.12F/FDB 45.72 45.85 45.82 5.32 10.83 16.15F/SQL 108.81 109.02 109.07 0.58 2.00 14.26

Total M (ols) 680.60 737.02 738.08 152.37 196.60 7.08Time M (glm) 2921.29 3186.90 3159.64 627.13 807.83 33.21

R 2249.19* 2311.67* 2323.00* 478.20 804.62 19.12P 2613.31* 2690.91* 2688.89* 269.15 539.14 23.55

F vs. M (ols) 41.78× 44.51× 44.73× 1385.18× 786.40× 3.34×Speedup F vs. M (glm) 179.33× 192.45× 191.49× 5701.18× 3231.32× 15.67×

F vs. R 138.07× 139.59× 140.79× 4347.27× 3218.48× 9.02×F vs. P 160.42× 162.49× 162.96× 2446.82× 2156.56× 11.11×

Table 1: Performance comparison for learning over joins (size in number of values, time in seconds). F,F/SQL, and M(ols) intertwine the join and learning computation and we only show their total end-to-endtimes. P and R crashed for US retailer due to memory limitation, the starred numbers are for running themon disjoint partitions of the location values for the join (four for R and ten for P) and adding up the times.

not supported by P and R, they could have been made cor-rect if the output parameters for a partition would serve asthe initial parameter values for the next partition). Never-theless, even under these simplifying assumptions for P andR, F learns the correct parameters at least two orders ofmagnitude faster than them.

The tasks N1 and N2 in Table 1 for US retailer considerfeature interactions. The time to perform the regression taskslightly increases, which is due to the fact that additionalparameters are learned.4. End-to-End Solution. Table 1 compares the perfor-mance of our end-to-end flavors of F and of our competitors,where we summed up their time components for the join,import, and learning. F performs the best for all datasets,which can be directly linked to the intermixing of the fac-torized join and cofactor computation. It is from 3x to 50xfaster than F/FDB and from 5x to 8x faster than F/FSQL.For LastFM, F exceeds three-orders of magnitude improve-ment over its competitors. For US retailer, the performancespeedup is around two orders of magnitude in comparison toR, P, and M(glm), and 40 times in comparison to the closestcompetitor M(ols). The performance gains for MovieLensare more limited, which is due to the comparatively smallersize of this dataset.

Figure 6 reports the performance of end-to-end solutionsfor Housing against M (ols), and R, which are the best com-petitors for this dataset. We stress-tested F for scale factorsbeyond 20 for Housing: The total time without data loadingis 4.1 seconds for scale 50 (compression ratio 14K) and 5.9seconds for scale 100 (compression ratio 69K).

0

100

200

300

400

500

600

700

800

900

FMR FMR FMR FMR FMR FMR FMR FMR FMR FMR

Tim

e (s

econ

ds)

Scale Factor (s)

Load dataJoin

ImportLearn

10987654321

0

2

4

6

8

F

Tim

e (s

econ

ds)

Zoom for s=10

Figure 6: Total time for the end-to-end solution(loading data, join, import, learn) for F and the bestcompetitors M(ols) and R on the Housing dataset.R runs out of memory starting with scale 11.

5. More features. We considered the Housing datasetwith scale 15 and an increasing number of attributes perrelation: 11, 18, 50, 100, and 168. This results in trainingdatasets with around 60, 100, 300, 600 and 1000 features.Figure 7 shows the performance of F and M(ols) on thesedatasets (P and R already fail to load the original datasetwith 27 features). M(ols) times out (i.e., it takes over four

Page 12: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

10-1

100

101

102

103

104

105

0 100 200 300 400 500 600 700 800 9001000

Tim

e (s

econ

ds,

log

scal

e)

Number of Features

F M(ols)

Figure 7: Total time for the end-to-end solution forF and M(ols) on the Housing dataset with scale fac-tor scale 15 and increasingly many features. M(ols)times out after 300 features.

hours) for the dataset with 300 features. F’s performance isconsistently two orders of magnitude better than of M(ols).6. Model selection. In the previous experiments, welearned over all features of the training dataset. We furtherconsidered settings with fewer features, as used for modelselection. The experiments validated that F only computesthe parameter cofactors once and that the convergence timeis consistently the smallest amongst all components of F.This contrasts with M, P, and R that need to independentlylearn over the entire dataset for each set of features.

8. RELATED WORK

Our contribution lies at the intersection of databases andmachine learning, as we look at regression learning, which isa fundamental machine learning problem, through databaseglasses. The crossover between these two communities gainedincreasing interest during the past years, as also acknowl-edged in a SIGMOD 2015 panel [34].Machine learning in databases. Our work follows a veryrecent line of research on marrying databases and machinelearning [17, 14, 11, 36, 7, 20, 1, 24, 18, 9, 38, 34, 31, 32].

While F’s learning approach is novel in this landscape,two of these works are closest in spirit to it since they in-vestigate the impact of data factorization for the purpose ofboosting learning tasks. Rendle [36] considers a limited formof factorization of the design matrix for regression. His ap-proach performs ad-hoc discovery of repeating patterns onthe flat join to save up time in the subsequent regressiontask, though with two important limitations. This discov-ery does not seek to capture join dependencies, i.e., to re-cover the knowledge that the data has been produced viajoins, since this is NP-hard. It also does not save time injoin computation, but it instead needs additional time toperform the discovery on the flat join. Our approach is dif-ferent since (i) we avoid the computation of the flat join asit is too expensive and entails redundancy; (ii) we exploitthe join structure from the query to identify the repeatingpatterns due to join dependencies. Kumar et al. [20] proposea framework for learning generalized linear models over key-foreign key joins in a distributed environment and consider,amongst other techniques, factorized computation over thenon-materialized join. F works for arbitrary join queries andfactorizations with caching, linear regression with feature in-teractions, and has theoretical performance guarantees. Weshow that F’s cofactor computation commutes with union,and can thus be trivially distributed. Prior work on factor-

ized databases discusses efficient support for joins [28] andsimple aggregates [4]. F shows that factorization can benefitmore complex user-defined aggregate functions, such as thecofactor matrix used in linear regression.

Most efforts in the database community are on design-ing systems to support large-scale scalable machine learningon top of possibly distributed architectures, with a goal onproviding a unified architecture for machine learning anddatabases [14], e.g., MLLib [1] and DeepDist [24] on Spark[42], distributed gradient descent on GLADE [32], MADlib[17] on PostgreSQL, SystemML [18, 7], system benchmark-ing [9] and sample generator for cross-validate learning [38].Our approach is more specialized as it proposes an effi-cient batch gradient descent variant for linear regression overjoins. It achieves efficiency by factorizing data and compu-tation, which enables more data to be kept in the memoryof one machine and to be processed fast without the needfor distribution. As pointed out recently [22], when bench-marked against one-machine systems, distributed systemscan have a non-trivial upfront cost that can be offset bymore expensive hardware or large problem settings.

The tight integration of the FDB query engine for factor-ized databases [5] with our regression learner F has beeninspired by LogicBlox [2], which provides a unified runtimefor the enterprise technology stack, and MADlib [17].Gradient descent optimization. The gradient descentfamily of optimization algorithms is fundamental to machinelearning [6] and very popular, cf. a ICML 2015 tutorial [37].One of the applications of gradient descent is regression,which is the focus of this paper. A popular variant is thestochastic gradient descent [8] with several recent improve-ments such as faster convergence rate via adaptive learningrate [13] (which is also used by our system F) and paral-lel or distributed versions [43, 30, 35, 12, 27]. Some of theseoptimizations have made their way in systems such as Deep-Dive [39, 21] and DeepDist [24, 12]. Our contribution is or-thogonal since it focuses on avoiding data and computationredundancy when learning over joins.

9. CONCLUSIONS AND FUTURE WORKThis paper puts forward F, a fast learner of least squares

regression models over factorized joins that enjoys both the-oretical and practical properties. The principles behind Fare applicable beyond least squares regression models, aslong as the derivatives of the objective function are express-ible in a semiring with multiplication and summation op-erations (which is the case for gradient descent and New-ton approximation methods), much in the spirit of the FAQframework [19]. This is necessary since factorization relieson commutativity and distributivity laws. For this reason,least squares models are supported while logistic regression,which uses exponential functions, is not. A non-exhaustivelist of analytics to benefit from our approach includes fac-torization machines, boosted trees, and k-nearest neighbors.

AcknowledgementsThis work benefitted from interactions with Molham Aref(on F/SQL), Tim Furche (on experiments), Arun Kumar(on related work), Ron Menich (on problem setting anddatasets), the MADlib team, and the anonymous reviewers.It was supported by a Google Research Faculty award, theERC grant 682588, and the EPSRC platform grant DBOnto.

Page 13: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

10. REFERENCES[1] Apache. MLlib: Machine learning in Spark,

https://spark.apache.org/mllib, 2015.

[2] M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld,D. Olteanu, E. Pasalic, T. L. Veldhuizen, andG. Washburn. Design and implementation of theLogicBlox system. In SIGMOD, pages 1371–1382,2015.

[3] A. Atserias, M. Grohe, and D. Marx. Size bounds andquery plans for relational joins. In FOCS, pages739–748, 2008.

[4] N. Bakibayev, T. Kocisky, D. Olteanu, andJ. Zavodny. Aggregation and ordering in factoriseddatabases. PVLDB, 6(14):1990–2001, 2013.

[5] N. Bakibayev, D. Olteanu, and J. Zavodny. FDB: Aquery engine for factorised relational databases.PVLDB, 5(11):1232–1243, 2012.

[6] C. M. Bishop. Pattern Recognition and MachineLearning (Information Science and Statistics), 2006.

[7] M. Boehm, S. Tatikonda, B. Reinwald, P. Sen,Y. Tian, D. Burdick, and S. Vaithyanathan. Hybridparallelization strategies for large-scale machinelearning in SystemML. PVLDB, 7(7):553–564, 2014.

[8] L. Bottou. Stochastic gradient descent tricks. InNeural Networks: Tricks of the Trade (2nd ed), pages421–436, 2012.

[9] Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, andC. M. Jermaine. A comparison of platforms for imple-lementing and running very large scale machine lear-ning algorithms. In SIGMOD, pages 1371–1382, 2014.

[10] I. Cantador, P. Brusilovsky, and T. Kuflik. 2ndworkshop on information heterogeneity and fusion inrecommender systems. In RecSys, pages 387–388,2011, http://grouplens.org/datasets/hetrec-2011.

[11] T. Condie, P. Mineiro, N. Polyzotis, and M. Weimer.Machine learning for big data. In SIGMOD, pages939–942, 2013.

[12] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A.Tucker, K. Yang, and A. Y. Ng. Large scaledistributed deep networks. In NIPS, pages 1232–1240,2012.

[13] J. Duchi, E. Hazan, and Y. Singer. Adaptivesubgradient methods for online learning and stochasticoptimization. JMLR, 12:2121–2159, 2011.

[14] X. Feng, A. Kumar, B. Recht, and C. Re. Towards aunified architecture for in-rdbms analytics. InSIGMOD, pages 325–336, 2012.

[15] J. G. F. Francis. The QR transformation: A unitaryanalogue to the LR transformation–Part 1. TheComputer Journal, 4(3):265–271, 1961.

[16] GroupLensiResearch. MovieLens,http://grouplens.org/datasets/movielens, 2003.

[17] J. M. Hellerstein, C. Re, F. Schoppmann, D. Z. Wang,E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng,K. Li, and A. Kumar. The MADlib analytics library orMAD skills, the SQL. PVLDB, 5(12):1700–1711, 2012.

[18] B. Huang, M. Boehm, Y. Tian, B. Reinwald,S. Tatikonda, and F. R. Reiss. Resource elasticity forlarge-scale machine learning. In SIGMOD, pages137–152, 2015.

[19] M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ:Questions Asked Frequently, CoRR:1504.04044, 2015.

[20] A. Kumar, J. F. Naughton, and J. M. Patel. Learninggeneralized linear models over normalized data. InSIGMOD, pages 1969–1984, 2015.

[21] J. Liu, S. Wright, C. Re, V. Bittorf, and S. Sridhar.An asynchronous parallel stochastic coordinatedescent algorithm. In ICML, pages 469–477, 2014.

[22] F. McSherry, M. Isard, and D. G. Murray. Scalability!but at what COST? In HotOS, 2015.

[23] R. Menich and N. Vasiloglou. The future of LogicBloxmachine learning. LogicBlox User Days, 2013.

[24] D. Neumann. Lightning-fast deep learning on Sparkvia parallel stochastic gradient updates,www.deepdist.com, 2015.

[25] A. Ng. CS229 Lecture Notes. Stanford & Coursera,http://cs229.stanford.edu/, 2014.

[26] H. Q. Ngo, E. Porat, C. Re, and A. Rudra. Worst-caseoptimal join algorithms. In PODS, pages 37–48, 2012.

[27] F. Niu, B. Recht, C. Re, and S. J. Wright. Hogwild: Alock-free approach to parallelizing stochastic gradientdescent. In NIPS, pages 693–701, 2011.

[28] D. Olteanu and J. Zavodny. Size bounds for factorisedrepresentations of query results. TODS, 40(1):2, 2015.

[29] R. Penrose. A generalized inverse for matrices. Math.Proc. Cambridge Phil. Soc., 51(03):406–413, 1955.

[30] F. Petroni and L. Querzoni. GASGD: stochasticgradient descent for distributed asynchronous matrixcompletion via graph partitioning. In RecSys, pages241–248, 2014.

[31] C. Qin and F. Rusu. Scalable i/o-bound parallelincremental gradient descent for big data analytics inglade. In DanaC, pages 16–20, 2013.

[32] C. Qin and F. Rusu. Speculative approximations forterascale distributed gradient descent optimization. InDanaC, pages 1:1–1:10, 2015.

[33] R Core Team. R: A Language and Environment forStatistical Computing. R Foundation for StatisticalComputing, www.r-project.org, 2013.

[34] C. Re, D. Agrawal, M. Balazinska, M. I. Cafarella,M. I. Jordan, T. Kraska, and R. Ramakrishnan.Machine learning and databases: The sound of thingsto come or a cacophony of hype? In SIGMOD, pages283–284, 2015.

[35] B. Recht and C. Re. Parallel stochastic gradientalgorithms for large-scale matrix completion. Math.Program. Comput., 5(2):201–226, 2013.

[36] S. Rendle. Scaling factorization machines to relationaldata. PVLDB, 6(5):337–348, 2013.

[37] P. Richtarik and M. Schmidt. Modern convexoptimization methods for large-scale empirical riskminimization. In ICML, 2015. Invited Tutorial.

[38] S. Schelter, J. Soto, V. Markl, D. Burdick,B. Reinwald, and A. V. Evfimievski. Efficient samplegeneration for scalable meta learning. In ICDE, pages1191–1202, 2015.

[39] J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, andC. Re. Incremental knowledge base construction usingDeepDive. PVLDB, 8(11):1310–1321, 2015.

Page 14: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

[40] The StatsModels development team. StatsModels:Statistics in Python,http://statsmodels.sourceforge.net, 2012.

[41] T. L. Veldhuizen. Triejoin: A simple, worst-caseoptimal join algorithm. In ICDT, pages 96–106, 2014.

[42] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica.Resilient distributed datasets: A fault-tolerantabstraction for in-memory cluster computing. InNSDI, pages 15–28, 2012.

[43] M. Zinkevich, M. Weimer, A. J. Smola, and L. Li.Parallelized stochastic gradient descent. In NIPS,pages 2595–2603, 2010.

APPENDIXA. PROOFSProof of Proposition 3.1. 1. (Symmetry). Assume thatQ(D) has m tuples. We have:

Sk =m∑i=1

(θ0x(i)0 + . . .+ θjx

(i)j + . . .+ θnx

(i)n )x

(i)k ,

Sj =

m∑i=1

(θ0x(i)0 + . . .+ θkx

(i)k + . . .+ θnx

(i)n )x

(i)j .

This implies that:

Cofactor[Aj , Ak] =

m∑i=1

x(i)j x

(i)k =

m∑i=1

x(i)k x

(i)j = Cofactor[Ak, Aj ].

2. (Commutativity with union). For 1 ≤ l ≤ p, assume thatthe training dataset Q(Dl) has ml tuples, that we denote

Q(Dl) = {(x(1)l,0 , . . . , x

(1)l,n), . . . , (x

(ml)

l,0 , . . . , x(ml)

l,n )}.

Then, Q(D) has∑pl=1 ml tuples. Take 0 ≤ k, j ≤ n. It

holds that:

p∑l=1

Cofactorl[Ak, Aj ] =

p∑l=1

(

ml∑il=1

x(il)

l,k x(il)

l,j )

=

∑pl=1

ml∑i=1

x(i)k x

(i)j = Cofactor[Ak, Aj ].

3. (Commutativity with projection). Assume that Q(D) hasm tuples. Since we are under bag semantics, πL(Q(D)) alsohas m tuples. We assume that L has nL features thatwe denote xL,0, . . . , xL,n. Take 1 ≤ k, j ≤ nL. Then,

CofactorL[Ak, Aj ] =∑mi=1 x

(i)j x

(i)k , which moreover, is equal

to Cofactor[Ak, Aj ].

Proof of Theorem 3.4. Take the basis functions φ0, . . . , φbover the sets of features S0, . . . , Sb, the induced relationalschema σ = (R0(S0), . . . , Rb(Sb)), and Qσ = Q 1 R0 1

· · · 1 Rb as the extension of Q w.r.t. σ.First, we show that the extension Dσ of D with relations

over the induced relational schema σ leads to a factorizationof the join Qσ(Dσ) of size O(|D|fhtw(Qσ)). The extensionDσ has a relation instance Ri (for 1 ≤ i ≤ b) for eachschema Ri(Si) that is the projection of Q(D) on Si i.e.,πSi(Q(D)). It then holds that Q(D) = Q(Dσ) = Qσ(Dσ).By Proposition 2.2, there exists a factorization of Dσ of sizeO(|Dσ|fhtw(Qσ)). Under data complexity, the schema σ has

constant size and thus O(|Dσ|fhtw(Qσ)) = O(|D|fhtw(Qσ)).

Let us denote by E the factorization of Qσ(Dσ) and let ∆

be a d-tree under which we obtain the size of O(|D|fhtw(Qσ)).We next show that E can be augmented with values cor-

responding to the interaction terms φk while keeping thesame asymptotic bound on its size. The query variables ofany relation atom in the query, and in particular of rela-tions over schemas R0(S0), . . . , Rb(Sb), are along the sameroot-to-leaf path in ∆. This means that E materializes allpossible combinations of values for variables (correspondingto attributes) in each schema Sk. We can thus compute theresult rk of the basis function φk on the values for variablesin Sk in one pass over E. For each tuple of values for vari-ables in Sk, we add a product with the value rk immediatelyunder the lowest value in the tuple in E. We also add a vari-able φk in ∆ immediately under the lowest variable in Sk.These additions do not modify the asymptotic size boundsof E since we add one value for each Sk-tuple of values inE. Let us denote by Eσ the factorization E extended withvalues for interaction terms. The factorization Eσ has sizeO(|D|fhtw(Qσ)) and values for each basis function φk. Learn-ing over Eσ has thus reduced to learning with identity basisfunctions and, by Proposition 3.2, this can be done in onepass over Eσ. The claim follows.

B. DESCRIPTION OF USED DATASETS• Housing is a synthetic dataset emulating the textbookexample for the house price market [25]. It consists of sixtables: House (postcode, price, number of bedrooms/bathro-oms/garages/parking lots, living room/kitchen area, etc.),Shop (postcode, opening hours, price range, brand, e.g.,Costco, Tesco, Sainsbury’s), Institution (postcode, type ofeducational institution, e.g., university or school, and num-ber of students), Restaurant (postcode, opening hours, andprice range), Demographics (postcode, average salary, rateof unemployment, criminality, and number of hospitals), andTransport (postcode, the number of bus lines, train sta-tions, and distance to the city center for the postcode). Thetraining dataset is the natural join of all relations (on post-code) and has 27 features. There are 25K postcodes, whichappear in all relations. The scale factor s determines thenumber of generated distinct tuples per postcode in eachrelation: We generate s tuples in House and Shop, log2 s tu-ples in Institution, s/2 in Restaurant, and one in eachof Demographics and Transport. The numerical values foreach attribute are randomly generated. The domains are in-tervals simulating the real-world semantics of attributes e.g.,large intervals for attributes such as price, smaller intervalsfor attributes such as nbtrainstations and Boolean valuesfor attributes such as house, flat, or bungalow indicatingthe type of housing.

We considered one linear regression tasks that predictsthe price of a house based on all other features. Since allrelations have a common attribute (postcode), the abovequery is acyclic and in particular hierarchical. An optimald-tree that we would have each root-to-leaf path consistingof query variables for one relation. Caching in F is notuseful here. The time complexity for F is linear (modulo logfactors). We present a snapshot of it in Figure 8(a).• US retailer. This dataset consists of three relations:Inventory (storing information about the inventory unitsfor products in a location, at a given date), Census (storingdemographics information per zipcode such as population,median age, repartition per ethnicities, house units and how

Page 15: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

postcode

price

house

nbofbedrooms

. . .

nbtrainstations

distancecitycentre

. . .

. . .

(a) Housing L.

movie

rating year

action

. . .

user

timestamp age

. . .

(b) MovieLens L.

user

weight1

artist1

friend

weight2

artist2

(c) LastFM L1.

user

artist1

timestamp1weight1

tag1

friend

artist2

timestamp2 weight2

tag2

(d) LastFM L2.

locn

zip ksn

inventoryunits

date

medianage

families

. . .

d1

d2

. . .

(e) US retailer L and N1.

locn

zip ksn

inventoryunits

date

d1

d2

. . .

medianage

population

houseunits

. . .

(f) US retailer N2.

Figure 8: Snapshots of the d-trees considered in the experiments (root variable for intercept omitted).

many are occupied, number of children per household, num-ber of males, females, and families), and Location (storingfor each zipcode distances d1 to d10 to several other stores).

The training dataset is the natural join of the three re-lations and has 31 features. We considered three linear re-gression tasks that predict the amount of inventory unitsbased on all other features. All join queries for these tasksare acyclic and under data-dependent functional dependen-cies (no store location can be in two zipcodes), they becomehierarchical. Caching is not useful in their cases.Task L considers the plain features. We consider an asymp-totically optimal d-tree for the natural join query, a snap-shot of which is presented in Figure 8(e). The relationInventory is encoded along the path locn/ksn/..., the re-lation Location is encoded along the path locn/zip/d1/...,while Census is encoded along the path zip/population/...

Task N1 considers two interactions of features from the samerelation and for which no restructuring is necessary: (i)Between median age and number of families (both fromCensus) and (ii) between different distances to other stores(both from Location). The first interaction looks at the ef-fect on inventory units while correlating the number of fam-ilies and the median age, since the two features are stronglyrelated. The second interaction looks at the effect of hav-ing competitors close to the store and how the interactionof the two competitors changes the effect on the inventoryunits in the given store. Since both features occurring in aninteraction are from the same table, we can use the d-treefrom the linear case, which is asymptotically optimal.

Task N2 considers two interactions: (i) Between populationand number of house units (both from Census), and (ii)between median age and distance to another store (featuresof Census and Location, respectively). The first interactionuses the insight that if we have a large population but fewhouses, then many people live in the same house. This cangive an indication of what the general population is like, andtherefore, can have a meaningful impact on our prediction.The second interaction looks at how the age of a personinfluences her tendency to look for satisfying stores at afurther distance.

The d-tree used for N2 needs to additionally satisfy theconstraint that the features from its second interaction (pop-ulation and house units) are on the same root-to-leaf path;a snapshot of it in Figure 8(f). For both tasks N1 and N2,each new interaction term can be seen as a newly-derivedfeature for learning in addition to the initial 31 features,hence for each task we have 31 + 2 = 33 features.• LastFM [10] has three relations: Userfriends (pairingfriends in the social network from the LastFM online musicsystem), Userartists (how often a user listens to a certainartist), and Usertaggedartiststimestamps (the user clas-sification of artists and the time when a user rated artists).Our regression task is to predict how often a user would lis-ten to an artist based on similar information for its friends.We consider two training datasets: L1 joins two copies ofUserartists with Userfriends to relate how often friendslisten to the same artists; L2 is L1 where we also join in the

Page 16: Learning Linear Regression Models over Factorized Joins · which can be exponentially lower than of standard joins. Ex-periments with commercial, public, and synthetic datasets show

10-1

100

101

102

103

104

105

106

2 4 6 8 10 12 14 1610-1

100

101

102

103

104

105

106S

ize (

x10

6 s

ingle

tons,

logsc

ale

)

Join

tim

e (

seco

nds,

logsc

ale

)

Scale Factor (s)

Size (Flat)Join time (Flat)

Size (Fact.)Join time (Fact.)

(a) Size and join time.

10-1

100

101

102

103

104

105

2 4 6 8 10 12 14 16Learn

ing t

ime (

seco

nds,

logsc

ale

)

Scale Factor (s)

M(glm)P

RF/FDB

(b) Learning time.

10-1

100

101

102

103

104

105

2 4 6 8 10 12 14 16Com

pre

ssio

n/s

peedup (

logsc

ale

)

Scale Factor (s)

Compression ratioJoin speedup

Learning speedup

(c) Compression ratio and speedup.

Figure 9: Housing dataset: (a) Size and join time for factorized vs. flat joins; (b) time for learning (excludingjoin time) with the four systems; (c) the learning speedup is for F/FDB relative to R, which is the fastestcompetitor in terms of learning time alone. P and R run out of memory starting at s = 11 for learning. M(glm) took longer than one hour after s = 13, and is not reported.

Usertaggedartiststimestamps copies of both friends:

L1 : UF 1UF.user=UA1.user UA1 1UF.friend=UA2.user UA2

L2 : UF 1UF.user=UA1.user UA1 1UF.friend=UA2.user UA2 1

1UF.user=UTA1.user UTA1 1UF.friend=UTA2.user UTA2

Both queries are hierarchical since the user is common toall joined relations and use optimal d-trees for both of them.We illustrate them in Figures 8(c) and 8(d).• MovieLens [16] has three relations: Users (age, gen-der, occupation, zipcode of users), Movies (movie year andits type, e.g., action, adventure, animation, children, andso on), and Ratings that users gave to movies on certaindates. The training dataset is the acyclic natural join ofthese tables and has 27 features. The regression task is topredict the rating given by a user to a movie. We depicta snapshot of the considered optimal d-tree in Figure 8(b).There are four versions of the MovieLens datasets and weonly reported experimental findings for the largest availableversion (1M records) that has complete information for allthree tables; there are two larger versions (10M and 20M)but without Users. We also experimented with the otherversions (100K and the larger ones where we syntheticallygenerated the Users relation) and found that they exhibitthe same compression ratio and performance gain.

C. DATASET PREPARATIONThe learning task requires to prepare the datasets. Firstly,

we only kept features that represent quantities or Booleanflags over which we can learn and discarded string features(except if necessary for joins). Secondly, we normalizedall number values of a feature A by mapping them to the[0, 1] range of reals as follows. Let minA and maxA bethe minimum and respectively maximum value in the ac-tive domain of A. Then, a value v for A is normalized to(v − minA)/maxA. Normalization is essential so that allfeatures have the same relative weight, e.g., avoiding thatlarge date values represented in seconds since Jan 1, 1970are more important than, say, small integer values repre-senting the number of house bedrooms. It also preserves thecardinality of the join results from the original datasets.

D. FURTHER EXPERIMENTSFigure 9 outlines the relative performance of F/FDB over

its competitors on the Housing dataset along three dimen-sions: Compression ratio, performance speedup, and learn-ing speedup. For the first two dimensions we essentiallycompare FDB against PostgreSQL, since F/FDB uses thefactorized join computed by FDB, while the competitors usethe flat join computed by PostgreSQL.

Figure 9(c) shows that Housing can be compressed by afactor of over 103 for scale s = 16. For a scale factor s,the flat join of all tables on postcode has 25K×s3/2× log2 stuples, each of 27 values, whereas the factorized join staylinear in the size of the input tables. The gap between thesizes of flat and factorized joins thus follows a quadraticfunction in s.

Figures 9(a) and 9(c) confirm the finding from the realdatasets for join computation: There is a significant speedupfor factorized over flat, which is upper bounded by the datacompression ratio.

Figure 9(b) reports the performance of learning for up toscale 16 for F/FDB, up to 13 for M(glm), and up to 10 forP and R. At scale factor 10, the compression ratio is 240 andF/FDB takes 2.2 seconds for learning. M(glm) was stoppedat scale factor 13, since it exceeded the timeout of one hour.Learning with R and P already fails for scale factor 11 dueto memory limitation. We further tried with scale factorsup to 20, where the compression ratio is 1.9K and F/FDBtakes 3.4 seconds for learning.

E. UPDATES TO CONFERENCE PAPERThis is an updated version of the conference paper with

the same title that appeared in SIGMOD 2016. We cor-rected Figure 4 as follows: we update regression aggregatesfor A, i.e. L∆[A] and Q∆[A,A], at each inner node A, weindex the cache by the variable, such as in cacheA, to clar-ify that there is a cache for each variable A, and the for-loop over B,D ∈ Schema[∆j ] now includes the case whereB = D. We also fixed references to this figure. We correctedthe worst-case optimality statements in Proposition 2.2.