Big Signal Processing for Multi-Aspect Data...

5
Big Signal Processing for Multi-Aspect Data Mining Evangelos E. Papalexakis, Carnegie Mellon University http://www.cs.cmu.edu/ ˜ epapalex What does a social graph between people who call each other look like? How does it differ from one where people instant- message or e-mail each other? Social interactions, along with many other real-word processes and phenomena, have different aspects, such as the means of communication. In the above example, the activity of people calling each other will likely differ from the activity of people instant-messaging each other. Nevertheless, each aspect of the interaction is a signature of the same underlying social phenomenon: formation of social ties and communities. Taking into account all aspects of social interaction results in more accurate social models (e.g, communities). The main thesis of my work is that many real-world problems, such as the aforementioned, benefit from jointly modeling and analyzing the multi-aspect data associated with the underlying phenomenon we seek to uncover. My research bridges Signal Processing and Data Science through designing and developing scalable and interpretable algorithms for mining big multi-aspect data, to address high-impact real-world applications. THESIS WORK My thesis work is broken down to algorithms with contributions mostly in tensor analysis as well as other fields such as control system identification [10], and multi-aspect data mining applications. A) Algorithms X ects verbs a 1 b 1 c 1 + Concept1 a 2 b 2 c 2 + Concept2 a R b R c R Concept R ... + X Figure 1: Canonical or PARAFAC decom- position into sum of R rank-one compo- nents. Each component is a latent concept or a co-cluster. The primary computational tool in my work is tensor decomposition. Tensors are multi- dimensional matrices, where each aspect of the data is mapped to one of the dimensions or modes. In order to analyze a tensor, we compute a decomposition or factorization (henceforth we use the terms interchangeably), which gives a low-dimensional embed- ding of all the aspects. I focused on the Canonical or PARAFAC decomposition, which decomposes the tensor into a sum of outer products of latent factors (see also Figure 1): X R X r=1 ar br cr where denotes outer product, i.e., [a b c](i, j, k)= a(i)b(j )c(k). Informally, each latent factor is a co-cluster of the aspects of the tensor. The advantages of this decomposition over other existing ones are interpretability (each factor corresponds to a co-cluster), and strong uniqueness guarantees for the latent factors.Tensor decompositions are very powerful tools, with numerous applications (see details below). There is increasing interest in their application to Big Data problems, both from academia and industry. However, algorithmically, there exist challenges which limit their applicability to real-world, big data problems, pertaining to scalability and quality assessment of the results. Below I outline how my work addresses those challenges, towards a broad adoption of tensor decompositions in big data science. A1) Parallel, and Scalable Tensor Decompositions Consider a multi-aspect tensor dataset that is too big to fit in the main memory of a single machine. The data may have large “ambient” dimension (e.g., a social network can have billions of users), however the observed interactions are very sparse, resulting in extremely sparse data. This data sparsity can be exploited for efficiency. In [9] we formalize the above statement by proposing the concept of a triple-sparse algorithm where 1) the input data are sparse, 2) the intermediate data that the algorithm is manipulating or creating are sparse, and 3) the output is sparse. Sparsity in the intermediate data is crucial for scalability. In [15] we show that the intermediate data created by a tensor decomposition algorithm designed for dense data can be many orders of magnitude larger than the original data, rendering the analysis prohibitive. Sparsity in the results is a great advantage, both in terms of storage but most importantly in terms of interpretability, since sparse models are easier for humans to inspect. None of the existing state of the art algorithms, before[9], fulfilled all three requirements for sparsity. In [9] we propose PARCUBE, the first triple-sparse, parallel algorithm for tensor decomposition. Figure 2(a) depicts a high- level overview of PARCUBE. Suppose that we computed a weight of how important every row, column, and “fiber” (the third mode index) of the tensor is. Given that weight, PARCUBE takes biased samples of rows, columns, and fibers, extracting a small tensor from the full data. This is done repeatedly with each sub-tensor effectively explores different parts of the data. Subsequently, PARCUBE decomposes all those sub-tensors in parallel generating partial results. Finally, PARCUBE merges the partial results ensuring that partial results corresponding to the same latent component are merged together. The power behind PARCUBE is that, even though the tensor itself might not fit in memory, we can choose the sub-tensors appropriately so that they fit in memory, and we can compensate by extracting many independent sub-tensors. PARCUBE converges to the same level of sparsity as [13] (the first tensor decomposition with latent sparsity) and furthermore PARCUBE’s approximation error converges to the one of the full decomposition (in cases where we are able to run the full decomposition). This demonstrates that PARCUBE’s sparsity maintains the useful information in the data. In [11], we extend the idea of [9], introducing TURBO-SMT, for the case of Coupled Matrix-Tensor Factorization (CMTF), where a tensor and a matrix share one of the aspects of the data, achieving up to 200 times faster execution with comparable accuracy to the baseline, on a single machine. Subsequently, in [17] we propose PARACOMP,a novel parallel architecture for tensor decomposition in similar spirit as PARCUBE. Instead of sampling, PARACOMP uses random 1

Transcript of Big Signal Processing for Multi-Aspect Data...

Big Signal Processing for Multi-Aspect Data MiningEvangelos E. Papalexakis, Carnegie Mellon University

http://www.cs.cmu.edu/˜epapalex

What does a social graph between people who call each other look like? How does it differ from one where people instant-message or e-mail each other? Social interactions, along with many other real-word processes and phenomena, have differentaspects, such as the means of communication. In the above example, the activity of people calling each other will likely differfrom the activity of people instant-messaging each other. Nevertheless, each aspect of the interaction is a signature of the sameunderlying social phenomenon: formation of social ties and communities. Taking into account all aspects of social interactionresults in more accurate social models (e.g, communities). The main thesis of my work is that many real-world problems, such asthe aforementioned, benefit from jointly modeling and analyzing the multi-aspect data associated with the underlying phenomenonwe seek to uncover. My research bridges Signal Processing and Data Science through designing and developing scalable andinterpretable algorithms for mining big multi-aspect data, to address high-impact real-world applications.

THESIS WORKMy thesis work is broken down to algorithms with contributions mostly in tensor analysis as well as other fields such as controlsystem identification [10], and multi-aspect data mining applications.

A) Algorithms

Chapter 2

Preliminaries

Our methods have a specific emphasis on Tensor analysis. Thus, here, we provide a very brief, high leveloverview of how Tensors can be used as an exploratory analysis tool, using as a motivating example that ofa Knowledge Base. Tensor analysis is by no means a new problem, however, our on-going and proposedwork is novel in the context of the applications that we are interested in, as well as the new models andalgorithms that we develop.

Matrices record dyadic properties, like “people recommending products”. Tensors are the n-mode gener-alizations, capturing 3- and higher-way relationships. For example “subject-verb-object” triplets naturallylead to a 3-mode tensor.

X

objects

subjects

verbs

a1

b1c1

+

Concept1

a2

b2c2

+

Concept2

aR

bRcR

Concept R

. . . +

Figure 2.1: PARAFAC decomposition of a three-way tensor of the NELL dataset as sum of R outer products (rank-one tensors), reminiscent of the rank-R singular value decomposition of a matrix. Each component corresponds toa latent concept, e.g. ”leaders”, ”cars”, ”tools” and so on.

Tensor decomposition as soft clustering: For instance, given a “subject-verb-object” tensor, one maydecompose it into a sum of a (usually) small number of triplets of vectors; intuitively, each one of thesetriplets corresponds to a different concept, e.g., “politicians”, “countries”, and “tools”. Each vector of thistriplet may be viewed as a soft clustering indicator: suppose that a,b, c are the vectors of the “politicians”triplet that correspond to the “subject”, “verb” and “object” dimensions (or modes) respectively. Then, awill indicate the membership of all the subjects to the “politicians” cluster, and b and c will do so for allthe verbs and objects.

For example, see Figure 2.1. The triplet of vectors a1, b1, c1 will correspond to the first concept (e.g.,“leaders-organizations”); subjects/rows with high score on a1 will be the leaders, like “obama”, “merkel”,“eric-schmidt”, objects/columns with high score on b1 will be organizations, like “usa”, “germany”,

5

X"

Figure 1: Canonical or PARAFAC decom-position into sum of R rank-one compo-nents. Each component is a latent conceptor a co-cluster.

The primary computational tool in my work is tensor decomposition. Tensors are multi-dimensional matrices, where each aspect of the data is mapped to one of the dimensionsor modes. In order to analyze a tensor, we compute a decomposition or factorization(henceforth we use the terms interchangeably), which gives a low-dimensional embed-ding of all the aspects. I focused on the Canonical or PARAFAC decomposition, whichdecomposes the tensor into a sum of outer products of latent factors (see also Figure 1):

X ≈R∑

r=1

ar ◦ br ◦ cr

where ◦ denotes outer product, i.e., [a ◦ b ◦ c](i, j, k) = a(i)b(j)c(k). Informally, each latent factor is a co-cluster of the aspectsof the tensor. The advantages of this decomposition over other existing ones are interpretability (each factor corresponds to aco-cluster), and strong uniqueness guarantees for the latent factors.Tensor decompositions are very powerful tools, with numerousapplications (see details below). There is increasing interest in their application to Big Data problems, both from academia andindustry. However, algorithmically, there exist challenges which limit their applicability to real-world, big data problems, pertainingto scalability and quality assessment of the results. Below I outline how my work addresses those challenges, towards a broadadoption of tensor decompositions in big data science.

A1) Parallel, and Scalable Tensor Decompositions

Consider a multi-aspect tensor dataset that is too big to fit in the main memory of a single machine. The data may have large“ambient” dimension (e.g., a social network can have billions of users), however the observed interactions are very sparse, resultingin extremely sparse data. This data sparsity can be exploited for efficiency. In [9] we formalize the above statement by proposing theconcept of a triple-sparse algorithm where 1) the input data are sparse, 2) the intermediate data that the algorithm is manipulatingor creating are sparse, and 3) the output is sparse. Sparsity in the intermediate data is crucial for scalability. In [15] we show thatthe intermediate data created by a tensor decomposition algorithm designed for dense data can be many orders of magnitude largerthan the original data, rendering the analysis prohibitive. Sparsity in the results is a great advantage, both in terms of storage butmost importantly in terms of interpretability, since sparse models are easier for humans to inspect. None of the existing state of theart algorithms, before[9], fulfilled all three requirements for sparsity.

In [9] we propose PARCUBE, the first triple-sparse, parallel algorithm for tensor decomposition. Figure 2(a) depicts a high-level overview of PARCUBE. Suppose that we computed a weight of how important every row, column, and “fiber” (the third modeindex) of the tensor is. Given that weight, PARCUBE takes biased samples of rows, columns, and fibers, extracting a small tensorfrom the full data. This is done repeatedly with each sub-tensor effectively explores different parts of the data. Subsequently,PARCUBE decomposes all those sub-tensors in parallel generating partial results. Finally, PARCUBE merges the partial resultsensuring that partial results corresponding to the same latent component are merged together. The power behind PARCUBE is that,even though the tensor itself might not fit in memory, we can choose the sub-tensors appropriately so that they fit in memory,and we can compensate by extracting many independent sub-tensors. PARCUBE converges to the same level of sparsity as [13](the first tensor decomposition with latent sparsity) and furthermore PARCUBE’s approximation error converges to the one ofthe full decomposition (in cases where we are able to run the full decomposition). This demonstrates that PARCUBE’s sparsitymaintains the useful information in the data. In [11], we extend the idea of [9], introducing TURBO-SMT, for the case of CoupledMatrix-Tensor Factorization (CMTF), where a tensor and a matrix share one of the aspects of the data, achieving up to 200 timesfaster execution with comparable accuracy to the baseline, on a single machine. Subsequently, in [17] we propose PARACOMP, anovel parallel architecture for tensor decomposition in similar spirit as PARCUBE. Instead of sampling, PARACOMP uses random

1

Evangelos E. Papalexakis - Research Statement

!"

!#"

!"#

#

#

$#

(a) The main idea behind PARCUBE: Using biased sampling, extract small repre-sentative sub-sampled tensors, decompose them in parallel, and carefully mergethe final results into a set of sparse latent factors.

102 103 104 10510−4

10−2

100

102

104

I = J = K

Tim

e (s

ec)

Baseline−1Baseline−2ICASSP 15

100x larger data!

Quality Assessment Scalability!

(b) Computing the decomposition quality for tensorsfor two orders of magnitude larger tensor than the stateof the art (I, J,K are the tensor dimensions).

Figure 2

projections to compress the original tensor to multiple smaller tensors. Thanks to compression, in [17] we prove that PARACOMPcan guarantee the uniqueness of the results (cf. [17] for exact bounds and conditions). This is a very strong guarantee on thecorrectness and quality of the result.

In addition to [9, 11, 17], which introduce a novel paradigm for parallelizing and scaling up tensor decomposition, in [15] wedeveloped the first scalable algorithm for tensor decompositions on Hadoop which was able to decompose problems larger by atleast two orders of magnitude than the state of the art. Subsequently [4] we developed a Distributed Stochastic Gradient Descentmethod for Hadoop that scales to billions of parameters.

A2) Unsupervised Quality Assessment of Tensor Decompositions

Real-world exploratory analysis of multi-aspect data is, to a great extent, unsupervised. Obtaining ground truth is a very expensiveand slow process, or in the worst case impossible; for instance, in Neurosemantics where we research how language is representedin the brain, most of the subject matter is uncharted territory and our analysis drives the exploration. Nevertheless, we wouldlike assess the quality of our results in absence of ground truth. There is a very effective heuristic in Signal Processing andChemometrics literature by the name of “Core Consistency Diagnostic” (cf. Bro and Kiers, Journal of Chemometrics, 2003) whichassigns a “quality” number to a given tensor decomposition and gives information about the data being inappropriate for suchanalysis, or the number of latent factors being incorrect. However, this diagnostic has been specifically designed for fully denseand small datasets, and is not able to scale to large and sparse data. In [8], exploiting sparsity, we introduce a provably exactalgorithm that operates on at least two orders of magnitude larger data than the state of the art (as shown in Figure 2(b)), whichenables quality assessment on large real datasets for the first time.

Impact - Algorithms• PARCUBE [9] is the most cited paper of ECML-PKDD 2012 with 46 citations at the time of writing, where the median

number of citations for ECML-PKDD 2012 is 5. Additionally, PARCUBE has already been downloaded more than 80 timesby universities and organizations from 23 countries.

• TURBO-SMT [11] was selected as one of the best papers of SDM 2014, and will appear in a special issue of the StatisticalAnalysis and Data Mining journal.

• PARACOMP [17] has appeared in the prestigious IEEE Signal Processing Magazine.

B) ApplicationsMy work has focused on: 1) Multi-Aspect Graph Mining, 2) Neurosemantics, and 3) Knowledge on the Web.

B1) Multi-Aspect Graph Mining

0 20 40 60 80

0

20

40

60

80

Node Groups

Node G

roups

(a) calls

0 20 40 60 80

0

20

40

60

80

Node Groups

Node G

roups

(b) proximity

0 20 40 60 80

0

20

40

60

80

Node Groups

Node G

roups

(c) sms

0 20 40 60 80

0

20

40

60

80

Node Groups

Node G

roups

(d) friends

Figure 3: Results on the four views of the REALITYMINING

multi-graph. Red dashed lines outline the clustering found byGRAPHFUSE.

In [5] we introduce GRAPHFUSE, a tensor based community detec-tion algorithm which uses different aspects of social interaction andoutperforms state of the art in community extraction accuracy. Figure3 shows GRAPHFUSE at work, identifying communities in REALI-TYMINING, a real dataset of multi-aspect interactions between stu-dents and faculty at MIT. The communities are consistent across as-pects and agree with ground truth. Another aspect of a social networkis time. In [3] we introduce COM2, a tensor based temporal commu-nity detection algorithm, which identifies social communities and their behavior over time. In [7] we consider language as anotheraspect, where we identify topical and temporal communities in a discussion forum of Turkish immigrants in the Netherlands, and in[12] we consider location (which has become extremely pervasive recently) analyzing data from Foursquare, identifying spatial and

2

Evangelos E. Papalexakis - Research Statement

temporal patterns of users’ activity in Foursquare. Not necessarily restricted to social networks, in my work I have also analyzedmulti-aspect computer network graphs, detecting anomalies and network attacks[6, 16].Impact - Multi-Aspect Graph Mining

• COM2 [3] won the best student paper award at PAKDD 2014.• GRAPHFUSE [5] has been downloaded more than 80 times from 21 countries.• Our work in [16] is deployed by the Institute for Information Industry in Taiwan, detecting real network intrusion attempts.• Our work in [6] was selected, as one of the best papers of ASONAM 2012, to appear in Springer’s Encyclopedia for Social

Network Analysis and Mining.

B2) Neurosemantics

How is knowledge represented in the human brain? Which regions have high activity and information flow, when a concept suchas “food” is shown to a human subject? Do all human subjects’ brains behave similarly in this context? Consider the followingexperimental setting, where multiple human subjects are shown a set of concrete English nouns (e.g. “dog”, “tomato” etc), and wemeasure each person’s brain activity using various techniques (e.g, fMRI or MEG). In this experiments, human subjects, semanticstimuli (i.e., the nouns), and measurement methods are all different aspects of the same underlying phenomenon: the mechanismsthat the brain uses to process language.

In [11], we seek to identify coherent regions of the brain that are activated for a semantically coherent set of stimuli. To thatend we combine fMRI measurements with semantic features (in the form of simple questions, such as Can you pick it up?) for thesame set of nouns, which provide useful information to the decomposition which might be missing from the fMRI data, as wellas constitute a human readable description of the semantic context of each latent group. A very exciting example of our resultscan be seen in Figure 4(a), where all the nouns in the “cluster” are small objects, the corresponding questions reflect holding orpicking such objects up, and most importantly, the brain region that was highly active for this set of nouns and questions was thepremotor cortex, which is associated with holding or picking small items up. This result is entirely unsupervised and agrees withNeuroscience. This gives us confidence that the same technique can be used in more complex tasks (cf. future research) and driveneuroscientific discovery.

50 100 150 200 250

50

100

150

200

250

3000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

50 100 150 200 250

50

100

150

200

250

3000

0.01

0.02

0.03

0.04

0.05

50 100 150 200 250

50

100

150

200

250

3000

0.01

0.02

0.03

0.04

0.05

Premotor Cortex

50 100 150 200 250

50

100

150

200

250

300

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Group1 Group 2 Group 4Group 3

beetle can it cause you pain?pants do you see it daily?bee is it conscious?

beetle can it cause you pain?pants do you see it daily?bee is it conscious?

Nouns Questions Nouns Questions

Nouns Questionsbeetle can it cause you pain?

Nouns Questions Nouns Questions

Nouns Questionsbeetle can it cause you pain?

Nouns Questions Nouns Questions

Nouns Questionsbeetle can it cause you pain?

Nouns Questions Nouns Questions

Nouns Questionsbeetle can it cause you pain?bear does it grow?

cow is it alive?coat was it ever alive?

bear does it grow?cow is it alive?coat was it ever alive?

glass can you pick it up?tomato can you hold it in one hand?bell is it smaller than a golfball?’

glass can you pick it up?tomato can you hold it in one hand?bell is it smaller than a golfball?’

bed does it use electricity?house can you sit on it?car does it cast a shadow?

bed does it use electricity?house can you sit on it?car does it cast a shadow?

(a) The premotor cortex,having high activation here,is associated with motionssuch as holding and pickingsmall items up.

“apple”!

0 20 400.05

0.1

0.15

0.2

0.25

0.3

0 20 400.05

0.1

0.15

0.2

0.25

0.3

0 20 400

0.1

0.2

0.3

0.4Real and predicted MEG brain activity

0 20 400

0.1

0.2

0.3

0.4

0 20 400.05

0.1

0.15

0.2

0.25

0.3

0 20 400.05

0.1

0.15

0.2

0.25

0.3

0 20 400

0.1

0.2

0.3

0.4Real and predicted MEG brain activity

0 20 400

0.1

0.2

0.3

0.4

0 20 400.05

0.1

0.15

0.2

0.25

0.3

0 20 400.05

0.1

0.15

0.2

0.25

0.3

0 20 400

0.1

0.2

0.3

0.4Real and predicted MEG brain activity

0 20 400

0.1

0.2

0.3

0.4

voxel 1!

voxel 2!

voxel 306!

MEG!

18

4 3

2324

1

16

19

3 2

84 1

Frontal lobe!(attention)!

Parietal lobe!(movement)!

Temporal lobe!(language)!

Occipital lobe!(vision)!

“Is it edible?” (y/n)!

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Real and predicted MEG brain activity

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

realGeBM

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Real and predicted MEG brain activity

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

realGeBM

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Real and predicted MEG brain activity

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

realGeBM

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Real and predicted MEG brain activity

0 20 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

realGeBM

equation in its matrix form:

Y0 �⇥A B

⇤ YS

There are a few distinct ways of formulating the optimizationproblem of finding A,B. In the next lines we show two of themost insightful ones:

• Least Squares (LS):The most straightforward approach is to express the problemas a Least Squares optimization:

minA,B

kY0 �⇥A B

⇤ YS

�k2

F

and solve for⇥A B

⇤by (pseudo)inverting

YS

�.

• Canonical Correlation Analysis (CCA): In CCA, we aresolving for the same objective function as in LS, with theadditional constraint that the rank of

⇥A B

⇤has to be equal

to r (and typically r is much smaller than the dimensions ofthe matrix we are solving for, i.e. we are forcing the solutionto be low rank). Similar to the LS case, here we minimizethe sum of squared errors, however, the solution here is lowrank, as opposed to the LS solution which is (with very highprobability) full rank.

However intuitive, the formulation of MODEL0 turns out to berather ineffective in capturing the temporal dynamics of the recordedbrain activity. As an example of its failure to model brain activitysuccessfully, Fig. 2 shows the real and predicted (using LS andCCA) brain activity for a particular voxel (results by LS and CCAare similar to the one in Fig. 2 for all voxels). By minimizingthe sum of squared errors, both algorithms that solve for MODEL0

resort to a simple line that increases very slowly over time, thushaving a minimal squared error, given linearity assumptions.

Real and predicted MEG brain activity

0 5 10 15 20 25 30 350

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

realLSCCA

Figure 2: Comparison of true brain activity and brain activity gen-erated using the LS, and CCA solutions to MODEL0. Clearly,MODEL0 is not able to capture the trends of the brain activity,and to the end of minimizing the squared error, produces an almoststraight line that dissects the real brain activity waveform.

3.2 Proposed approach: GeBMFormulating the problem as MODEL0 is not able to meet the re-

quirements for our desired solution. However, we have not ex-hausted the space of possible formulations that live within our setof simplifying assumptions. In this section, we describe GEBM,our proposed approach which, under the assumptions that we havealready made in Section 2, is able to meet our requirements remark-ably well.

In order to come up with a more accurate model, it is useful tolook more carefully at the actual system that we are attempting to

Symbol Definitionn number of hidden neuron-regionsm number of voxels we observe (306)s number of input signals (40 questions)T time-ticks of each experiment (340 ticks, of 5msec each)x(t) vector of neuron activities at time ty(t) vector of voxel activities at time ts(t) vector of input-sensor activities at time tA[n⇥n] connectivity matrix between neurons (or neuron regions)C[m⇥n] summarization matrix (neurons to voxels)B[n⇥s] perception matrix (sensors to neurons)Av connectivity matrix between voxelsREAL real part of a complex numberIMAG imaginary part of a complex numberA† Moore-Penrose Pseudoinverse of A

Table 1: Table of symbols

model. In particular, the brain activity vector y that we observe issimply the collection of values recorded by the m sensors, placedon a person’s scalp.

In MODEL0, we attempt to model the dynamics of the sensormeasurements directly. However, by doing so, we are directing ourattention to an observable proxy of the process that we are tryingto estimate (i.e. the functional connectivity). Instead, it is morebeneficial to model the direct outcome of that process. Ideally, wewould like to capture the dynamics of the internal state of the per-son’s brain, which, in turn, cause the effect that we are measuringwith our MEG sensors.

Let us assume that there are n hidden (hyper)regions of the brain,which interact with each other, causing the activity that we observein y. We denote the vector of the hidden brain activity as x ofsize n ⇥ 1. Then, by using the same idea as in MODEL0, we mayformulate the temporal evolution of the hidden brain activity as:

x(t + 1) = A[n⇥n] ⇥ x(t) + B[n⇥s] ⇥ s(t)

Having introduced the above equation, we are one step closer tomodelling the underlying, hidden process whose outcome we ob-serve. However, an issue that we have yet to address is the fact thatx is not observed and we have no means of measuring it. We pro-pose to resolve this issue by modelling the measurement procedureitself, i.e. model the transformation of a hidden brain activity vec-tor to its observed counterpart. We assume that this transformationis linear, thus we are able to write

y(t) = C[m⇥n]x(t)

Putting everything together, we end up with the following set ofequations, which constitute our proposed model GEBM:

x(t + 1) = A[n⇥n] ⇥ x(t) + B[n⇥s] ⇥ s(t)

y(t) = C[m⇥n] ⇥ x(t)

Additionally, we require the hidden functional connectivity ma-trix A to be sparse because, intuitively, not all (hidden) regions ofthe brain interact directly with each other. Thus, given the aboveformulation of GEBM, we seek to obtain a matrix A sparse enough,while obeying the dynamics dictated by model. Sparsity is key inproviding more insightful and easy to interpret functional connec-tivity matrices, since an exact zero on the connectivity matrix ex-plicitly states that there is no direct interaction between neurons; onthe contrary, a very small value in the matrix (if the matrix is notsparse) is ambiguous and could imply either that the interaction isnegligible and thus could be ignored, or that there indeed is a linkwith very small weight between the two neurons.

The key ideas behind GEBM are:

“knife”!“Can it hurt you?” (y/n)!…"

…"

…"

…"

…"

FL! PL!TL! OL!

(b) Given MEG measurements (time series of the magnetic activity of aparticular region of the brain), GEBM learns a graph between physicalbrain regions. Furthermore, GEBM simulates the real MEG activity veryaccurately.

Figure 4: Overview of results of the Neurosemantics application.In a similar experimental setting, where the human subjects are also asked to answer a simple yes/no question about the noun

they are reading, in [10] we seek to discover the functional connectivity of the brain for the particular task. Functional connectivityis an information flow graph between different regions of the brain, indicating high degree of interaction between (not necessarilyphysically connected) regions while the person is reading the noun and answering the question. [10] we propose GEBM, a novelmodel for the functional connectivity which views the brain as a control system and we propose a sparse system identificationalgorithm which solves the model. Figure 4(b) shows an overview of GEBM: given MEG measurements (time series of themagnetic activity of a particular region of the brain), we learn a model that describes a graph between physical brain regions, andsimulates real MEG activity very accurately.Impact - Neurosemantics

• GEBM [10] is taught in class CptS 595 at Washington State University.• Our work in [11] was selected as one of the best papers of SDM 2014

B3) Knowledge on the Web

Knowledge on the Web has multiple aspects: real-world entities such as Barack Obama and USA are usually linked in multipleways, e.g., is president of, was born in, and lives in. Modeling those multi-aspect relations a tensor, and computing a low rankdecomposition of the data, results in embeddings of those entities in a lower dimension, which can help discover semantically and

3

Evangelos E. Papalexakis - Research Statement

contextually similar entities, as well as discover missing links. In [9] we discover semantically similar noun-phrases in a KnowledgeBase coming from the Read the Web project at CMU: http://rtw.ml.cmu.edu/rtw/. Language is another aspect: manyweb-pages have parallel content in different languages, however, some languages have higher representation than others. Howcan we learn a high quality joint latent representation of entities and words, where we combine information from all languages?In [14] we introduce the notion of translation-invariant word embeddings where we compute multi-lingual embeddings, forcingtranslations to be “close” in the embedding space. Our approach outperforms the state of the art.

Yet another aspect of an real-world entity on the web is the set of search engine results for that entity, which is the biased viewof each search engine, as a result of their crawling, indexing, ranking, and potential personalization algorithms, for that query. In[1] we introduce TENSORCOMPARE, a tool which measures the overlap in the results of different search engines. We conduct acase study on Google and Bing, finding high overlap. Given this high overlap, how can we use different signals, potentially comingfrom social media, in order to provide diverse and useful results? In [2] we follow-up designing a Twitter based web search engine,using tweet popularity as a ranking function, which does exactly that. This result has huge potential for the future of web search,paving the way for the use of social signals in the determination and ranking of results.Impact - Knowledge on the Web

• In [14] we are the first to propose the concept of translation-invariant word embeddings.• Our work in [1] was selected to appear in a special issue of the Journal of Web Science, as one of the best papers in the Web

Science Track of WWW 2015.

FUTURE RESEARCH DIRECTIONSAs more field sciences are incorporating information technologies (with Computational Neuroscience being a prime example froma long list of disciplines), the need for scalable, efficient, and interpretable multi-aspect data mining algorithms will only increase.

1) Long-Term Vision: Big Signal Processing for Data ScienceThe process of extracting useful and novel knowledge from big data in order to drive scientific discovery is the holy grail ofdata science. Consider the case where we view the knowledge extraction process through a signal processing lens: supposethat a transmitter (a physical or social phenomenon) is generating signals which describe aspects of the underlying phenomenon.An example signal is a time-evolving graph (a time-series of graphs) which can be represented as a tensor. The receiver (thedata scientist in our case), combines all those signals with ultimate goal the reconstruction (and general understanding) of theirgenerative process. The “communication channel” wherein the data are transmitted can play the role of the measurement processwhere loss of data occurs, and thus we have to account the channel estimation into our analysis, in order to reverse its effect. Wemay consider that the best way to transmit all the data to the receiver is to find the best compression or dimensionality reductionof those signals (e.g., Compressed Sensing). There may be more than one transmitters (if we consider a setting where the data aredistributed across data centers) and multiple receivers, in which case privacy considerations come into play. In my work so far Ihave established two connections between Signal Processing and Data Science (Tensor Decompositions & System Identification),contributing new results in both communities and have already had significant impact, demonstrated, for instance, by the amountof researchers using and extending my work. These two aforementioned connections are but instances of a vast landscape ofopportunities for cross-pollination which will advance the state of the art of both fields and drive scientific discovery. I envisionmy future work as a three-way bridge between Signal Processing, Data Science, and high impact real-world applications.

2) Mid-Term Research PlanWith respect to my shorter term plan (first 3-5 years), I provide a more detailed outline of the thrusts and challenges that I amplanning to tackle, both regarding algorithms and multi-aspect data applications.

Algorithms: Triple-Sparse Multi-Aspect Data Exploration & Analysis with Quality Assessment

In addition to improving the state of the art for tensor decompositions, I plan to explore alternatives for multi-aspect data modelingand representation learning, which can be used with tensor decompositions. Some of the challenges are:Algorithms, models, and problem formulation: What is the appropriate (factorization) model? Are there any noisy aspects that mayhurt performance? As the ambient dimensions of the data grow, and the number of aspects increases, data become much sparser.Sparsity, as we saw is a blessing when it comes to scalability, however it can also be a curse when it comes to detecting meaningfulrelatively dense patterns in subspaces within the data.Scalability & Efficiency: Data size will continue to grow and we need faster and more scalable algorithms to handle the size andcomplexity of the data. This will involve adjusting existing algorithms or proposing new algorithms that adhere to the triple-sparseparadigm. Furthermore, the particular choice of a distributed system can make a huge difference depending on the application. Forinstance, Map/Reduce works well in batch tasks, whereas it has well known weaknesses in iterative algorithms. Other systems, e.g.,Spark could be more appropriate for such tasks, and in the future I plan to research the capabilities of current high-end distributedsystems, in relation to the algorithms I develop.(Unsupervised) Quality Assessment: In addition to purely algorithmic approaches, it is instrumental to work in collaboration withfield scientists and experts, and incorporate in the quality assessment elements that experts indicate as important. Finally, there area lot of exciting routes of collaboration with Human-Computer Interaction and Crowdsourcing experts, harnessing the “wisdom

4

Evangelos E. Papalexakis - Research Statement

of the crowds” in assessing the quality of our analysis, especially in applications such as knowledge on the web and multi-aspectsocial networks, where non-experts may be able to provide high quality judgements.

Application: NeurosemanticsIn the search for understanding how semantic information is processed by the brain, I am planning to broaden the scope of theNeurosemantics applications, considering aspects such as language: are same concepts in different languages mapped in the sameway in the brain? Are cultural idiosyncrasies reflected on the way that speakers of different languages represent information?Furthermore, I will consider more complex forms of stimuli (such as phrases, images, and video) and richer sources of semanticinformation, e.g, from Knowledge on the Web. There is profound scientific interest in answering the above research questions a factalso reflected on how well funded of a research area this is (e.g., see Brain Initiative http://braininitiative.nih.gov/)

Application: Urban & Social ComputingSocial and physical interactionsof people in an urban environment, is an inherent multi-aspect process, that ties the physical andthe on-line domains of human interaction. I plan to investigate human mobility patterns, e.g. through check-ins in on-line socialnetworks, in combination with their social interactions and the content they create on-line, with specific emphasis on multi-lingualcontent which is becoming very prevalent due to population mobility. I also intend to develop anomaly detection which can point tofraud (e.g., people buying fake followers for their account). Improving user experience through identifying normal and anomalouspatterns in human geo-social activity is an extremely important problem, both in terms of funding and research interest, as well asimplications on revolutionizing modern societies.

Application: Knowledge on the WebKnowledge bases are ubiquitous, providing taxonomies of human knowledge and facilitating web search. Many knowledge basesare extracted automatically, and as such they are noisy and incomplete. Furthermore, web content exists in multiple languages,which is inherently imbalanced and may result in imbalanced knowledge bases, consequently leading to skewed views of webknowledge per language. I plan to continue my work on web knowledge, devising and developing techniques that combine multiple,multilingual, structured (e.g., knowledge bases) and unstructured (e.g., plain text) sources of information on the web, aiming forhigh quality knowledge representation, as well as enrichment and curation of knowledge bases.

3) Funding, Collaborations, and Parting ThoughtsDuring my studies, I have partaken in grant proposal writing, contributing significantly to a successful NSF/NIH BIGDATA grantproposal (NSF IIS-1247489 & NIH 1R01GM108339-1) which resulted in $894,892 to CMU ($1.6M total) and has funded most ofmy PhD studies. In particular, I worked on outlining and describing the important algorithmic challenges as well as the applicationswe proposed to tackle. Being a major contributor to the proposal was a unique opportunity for me, giving me freedom to shapemy research agenda. I have also been fortunate to have collaborated with a number of stellar researchers both in academia andindustry, many of whom have been my mentors in research. Research agenda is very often shaped in wonderful ways throughsuch collaborations, as well as through mentoring and advising graduate students. To that end, I intend to keep nurturing andstrengthening my on-going collaborations, and pursue collaboration with scholars within and outside my field, always followingmy overarching theme: Bridging Signal Processing and Big Data Science for high-impact, real-world applications.

Selected References[1] Rakesh Agrawal, Behzad Golshan, and Evangelos E. Papalexakis. A

study of distinctiveness in web results of two search engines. In WWW’15Web Science Track, (author order is alphabetical).

[2] Rakesh Agrawal, Behzad Golshan, and Evangelos E. Papalexakis.Whither social networks for web search? In ACM KDD’15, (author orderis alphabetical).

[3] Miguel Araujo, Spiros Papadimitriou, Stephan Gunnemann, ChristosFaloutsos, Prithwish Basu, Ananthram Swami, Evangelos E. Papalex-akis, and Danai Koutra. Com2: Fast automatic discovery of temporal(’comet’) communities. In Advances in Knowledge Discovery and DataMining. Springer, 2014.

[4] Alex Beutel, Abhimanu Kumar, Evangelos E. Papalexakis, Partha PratimTalukdar, Christos Faloutsos, and Eric P Xing. Flexifact: Scalable flexiblefactorization of coupled tensors on hadoop. In SIAM SDM’14, 2014.

[5] Evangelos E. Papalexakis, Leman Akoglu, and Dino Ienco. Do moreviews of a graph help? community detection and clustering in multi-graphs. In IEEE FUSION’13.

[6] Evangelos E. Papalexakis, Alex Beutel, and Peter Steenkiste. Networkanomaly detection using co-clustering. In Encyclopedia of Social NetworkAnalysis and Mining. Springer, 2014.

[7] Evangelos E. Papalexakis and A. Seza Dogruoz. Understanding mul-tilingual social networks in online immigrant communities. WWW ’15Companion.

[8] Evangelos E. Papalexakis and C. Faloutsos. Fast efficient and scalablecore consistency diagnostic for the parafac decomposition for big sparsetensors. In IEEE ICASSP’15.

[9] Evangelos E. Papalexakis, Christos Faloutsos, and Nicholas D Sidiropou-los. Parcube: Sparse parallelizable tensor decompositions. In ECML-PKDD’12.

[10] Evangelos E. Papalexakis., Alona Fyshe, Nicholas D. Sidiropoulos,Partha Pratim Talukdar, Tom M. Mitchell, and Christos Faloutsos. Good-enough brain model: Challenges, algorithms and discoveries in multi-subject experiments. In ACM KDD’14.

[11] Evangelos E. Papalexakis, Tom M Mitchell, Nicholas D Sidiropoulos,Christos Faloutsos, Partha Pratim Talukdar, and Brian Murphy. Turbo-smt: Accelerating coupled sparse matrix-tensor factorizations by 200x. InSIAM SDM’14.

[12] Evangelos E. Papalexakis, Konstantinos Pelechrinis, and Christos Falout-sos. Location based social network analysis using tensors and signal pro-cessing tools. In IEEE CAMSAP’15.

[13] Evangelos E. Papalexakis, Nicholas D Sidiropoulos, and Rasmus Bro.From k-means to higher-way co-clustering: multilinear decompositionwith sparse latent factors. IEEE Transactions on Signal Processing, 2013.

[14] Matt Gardner, Kejun Huang, Evangelos E. Papalexakis., Xiao Fu, ParthaTalukdar, Christos Faloutsos, Nicholas Sidiropoulos, and Tom Mitchell.Translation invariant word embeddings. In EMNLP’15.

[15] U Kang, Evangelos E. Papalexakis, Abhay Harpale, and Christos Falout-sos. Gigatensor: scaling tensor analysis up by 100 times-algorithms anddiscoveries. In ACM KDD’12.

[16] Ching-Hao Mao, Chung-Jung Wu, Evangelos E. Papalexakis, ChristosFaloutsos, and Tien-Cheu Kao. Malspot: Multi2 malicious network be-havior patterns analysis. In PAKDD’14.

[17] N Sidiropoulos, Evangelos E. Papalexakis, and C Faloutsos. Parallel ran-domly compressed cubes: A scalable distributed architecture for big tensordecomposition. IEEE Signal Processing Magazine, 2014.

5