A Framework for Comparing Groups of Documents - … Framework for Comparing Groups of Documents ......

I N S T I T U T E F O R D E F E N S E A N A L Y S E S

A Framework for Comparing

Groups of Documents

Arun S. Maiya

September 21, 2015

Approved for public release;

distribution is unlimited.

IDA Non-Standard NS D-5543

Log: H 15-000660

Copy

INSTITUTE FOR DEFENSE ANALYSES

4850 Mark Center Drive Alexandria, Virginia 22311-1882

About This Publication

This work was conducted by the Institute for Defense Analyses (IDA) under contract HQ0034-14-D-0001, Task AK-5-3153, “DAMO Implementation of Phase 4 Operational Capabilities,” for Office of the Deputy Assistant Secretary of Defense for Systems Engineering. The views, opinions, and findings should not be construed as representing the official position of either the Department of Defense or the sponsoring organization.

Acknowledgments

Robert M. Rolfe

Copyright Notice

© 2015 Institute for Defense Analyses 4850 Mark Center Drive, Alexandria, Virginia 22311-1882 • (703) 845-2000.

This material may be reproduced by or for the U.S. Government pursuant to the copyright license under the clause at DFARS 252.227-7013 (a)(16) [Jun 2013].

A Framework for Comparing Groups of Documents

Arun S. MaiyaInstitute for Defense Analyses — Alexandria, VA, USA

[email protected]

Abstract

We present a general framework for compar-ing multiple groups of documents. A bipar-tite graph model is proposed where documentgroups are represented as one node set andthe comparison criteria are represented as theother node set. Using this model, we presentbasic algorithms to extract insights into sim-ilarities and differences among the documentgroups. Finally, we demonstrate the versatilityof our framework through an analysis of NSFfunding programs for basic research.

1 Introduction and Motivation

Given multiple sets (or groups) of documents, it is of-ten necessary tocompare the groups to identify simi-larities and differences along different dimensions. Inthis work, we present a general framework to performsuch comparisons for extraction of important insights.Indeed, many real-world tasks can be framed as a prob-lem of comparing two or moregroups of documents.Here, we provide two motivating examples.

1. Program Reviews.To better direct research efforts,funding organizations such as the National ScienceFoundation (NSF), the National Institutes of Health(NIH), and the Department of Defense (DoD), are of-ten in the position of reviewing research programs viatheir artifacts (e.g., grant abstracts, published papers,and other research descriptions). Such reviews mightinvolve identifying overlaps across different programs,which may indicate a duplication of effort. It mayalso involve the identification of unique, emerging, ordiminishing topics. A “document group” here couldbe defined either as a particular research program thatfunds many organizations, the totality of funded re-search conducted by a specific organization, or all re-search associated with a particular time period (e.g., fis-cal year). In all cases, the objective is to draw compar-isonsbetween groups by comparing the document setsassociated with them.

2. Intelligence. In the areas of defense and intelli-gence, document sets are sometimes obtained from dif-

ferent sources or entities. For instance, the U.S. ArmedForces sometimes seize documents during raids of ter-rorist strongholds.1 Similarities between two documentsets (each captured from a different source) can poten-tially be used to infer a non-obvious association be-tween the sources.

Of course, there are numerous additional examplesacross many domains (e.g., comparing different newssources, comparing the reviews for several products,etc.). Given the abundance of real-world applicationsas illustrated above, it is surprising, then, that thereare no existing general-purpose approaches for draw-ing such comparisons. While there is some previouswork on the comparison of document sets (referred toascomparative text mining), these existing approacheslack the generality to be widely applicable across dif-ferent use case scenarios with different comparison cri-teria. Moreover, much of the work in the area focuseslargely on the summarization of shared or unsharedtopics among document groups (e.g., Wan et al. (2011),Huang et al. (2011), Campr and Jezek (2013), Wang etal. (2012), Zhai et al. (2004)). That is, the problem ofdrawingmulti-faceted comparisons among the groupsthemselves is not typically addressed. This, then, moti-vates our development of ageneral-purpose model forcomparisons of document sets along arbitrary dimen-sions. We use this model for the identification of simi-larities, differences, trends, and anomalies among largegroups of documents. We begin by formally describingour model.

2 Our Formal Model forComparing Document Groups

As input, we are given several groups of documents,and our task is to compare them. We now formallydefine these document groups and the criteria used tocompare them. LetD = {d1, d2, . . . , dN} be a doc-ument collection comprising the totality of documentsunder consideration, whereN is the size. LetDP be apartition ofD representing the document groups.

1http://en.wikipedia.org/wiki/Document_Exploitation_(DOCEX)

1

Definition 1 A document group is a subset DPi ∈ DP

(where index i ∈ {1 . . . |DP |}).

Each document group inDP , for instance, mightrepresent articles associated with either a particular or-ganization (e.g., university), a research funding source(e.g., NSF or DARPA program), or a time period (e.g., afiscal year). Document groups are compared usingcomparison criteria, DC , a family of subsets ofD.

Definition 2 A comparison criterion is a subset DCi ∈

DC (where index i ∈ {1 . . . |DC |}).

Intuitively, each subset ofDC represents a set ofdocuments sharing some attribute. Our model allowsgreat flexibility in howDC is defined. For instance,DC might be defined by the named entities mentionedwithin documents (e.g., each subset contains docu-ments that mention a particular person or organizationof interest). For the present work, we defineDC by top-ics discovered using latent Dirichlet allocation or LDA(Blei et al., 2003).

LDA Topics as Comparison Criteria. Probabilis-tic topic modeling algorithms like LDA discover la-tent themes (i.e., topics) in document collections. Byusing these discovered topics as the comparison cri-teria, we can compare arbitrary groups of documentsby the themes and subject areas comprising them. LetK be the number of topics or themes inD. Eachdocument inD is composed of a sequence of words:di = 〈si1, si2, . . . , siNi

〉, whereNi is the number ofwords indi and i ∈ {1 . . .N}. V =

⋃Ni=1 f(di) is

the vocabulary ofD, wheref(·) takes a sequence ofelements and returns a set. LDA takesK andD (in-cluding its components such asV ) as input and pro-duces two matrices as output, one of which isθ. Thematrix θ ∈ R

N×K is the document-topic distributionmatrix and shows the distribution of topics within eachdocument. Each row of the matrix represents a prob-ability distribution. DC is constructed usingK sub-sets of documents, each of which represent a set ofdocuments pertaining largely to the same topic. Thatis, for t ∈ {1 . . .K} and i ∈ {1 . . .N}, each subsetDC

t ∈ DC is comprised of all documentsdi wheret = argmaxx θix.2 Having defined the documentgroupsDP and the comparison criteriaDC , we nowconstruct a bipartite graph model used to perform com-parisons.

A Bipartite Graph Model. Our objective is to com-pare thedocument groups in DP based onDC . We doso by representingDP andDC as a weighted bipartitegraph,G = (P,C,E,w), whereP andC are disjointsets of nodes,E is the edge set, andw : E → Z

+

are the edge weights. Each subset ofDP is repre-sented as a node inP , and each subset ofDC is rep-

2D

C is also a partition ofD, when defined in this way.

resented as a node inC. Let α : P → DP andβ : C → DC be functions that map nodes to the doc-ument subsets that they represent. Then, the edge setE is {(u, v) | u ∈ P, v ∈ C,α(u) ∩ β(v) 6= ∅},and the edge weight for any two nodesu ∈ P andv ∈ C is w((u, v)) = |α(u) ∩ β(v)|. Concisely, eachweighted edge in G between a document group (inP )and a topic (inC) represents the number of documentsshared among the two sets. Figure 1 shows a toy illus-tration of the model. Each node inP is shown in blackand represents a subset ofDP (i.e., a document group).Each node inC is shown in gray and represents a subsetof DC (i.e., a document cluster pertaining primarily tothe same topic). Each edge represents the intersectionof the two subsets it connects. In the next section, wewill describe basic algorithms on such bipartite graphscapable of yielding important insights into the similar-ities and differences among document groups.

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

Figure 1: [Toy Illustration of Bipartite Graph Model.]Each black node (i.e., node∈ P ) represents a documentgroup. Each gray node (i.e., node∈ C) represents a clus-ter of documents pertaining primarily to the same topic.

3 Basic Algorithms Using the Model

We focus on three basic operations in this work.

Node Entropy. Let ~w be a vector of weights for alledges incident to some nodev ∈ E. Theentropy H ofv is: H(v) = −∑

i pi log|~w|(pi), wherepi = wi∑jwj

and i, j ∈ {1 . . . |~w|}. A similar formulation was em-ployed in Eagle et al. (2010). Intuitively, ifv ∈ P ,H(v) measures the extent to which the document groupis concentrated around a small number of topics (lowervalues ofH(v) mean more concentrated). Similarly, ifv ∈ C, it is the extent to which a topic is concentratedaround a small number of document groups.

Node Similarity. Given a graphG, there are manyways to measure the similarity of two nodes based ontheir connections. Such measures can be used to infersimilarity (and dissimilarity) among document groups.However, existing methods are not well-suited for thetask of document group comparison. The well-known

2

SimRank algorithm (Jeh and Widom, 2002) ignoresedge weights, and neither SimRank nor its extension,SimRank++ (Antonellis et al., 2008), scale to largergraphs. SimRank++ and ASCOS (Chen and Giles,2013) do incorporate edge weights but in ways thatare not appropriate for document group comparisons.For instance, both SimRank++ and ASCOS incorpo-rate magnitude in the similarity computation. Con-sider the case where document groups are defined asresearch labs. ASCOS and SimRank++ will measurelarge research labs and small research labs as less simi-lar when in fact they may publish nearly identical linesof research. Finally, under these existing methods, doc-ument groups sharing zero topics in common couldstill be considered similar, which is undesirable here.For these reasons, we formulate similarity as follows.Let NG(·) be a function that returns the neighbors ofa given node inG. Given two nodesu, v ∈ P , letLu,v = NG(u) ∪ NG(v) and letx : I → Lu,v be theindexing function forLu,v.3 We construct two vectors,~a and~b, whereak = w(u, x(k)), bk = w(v, x(k)),andk ∈ I. Each vector is essentially a sequence ofweights for edges betweenu, v ∈ P and each nodein Lu,v. Similarity of two nodes is measured usingthe cosine similarity of their corresponding sequences,

~a·~b‖~a‖‖~b‖ , which we compute using a functionsim(·, ·).Thus, document groups are considered more similarwhen they have similar sets of topics in similar pro-portions. As we will show later, this simple solution,based on item-based collaborative filtering (Sarwar etal., 2001), is surprisingly effective at inferring similar-ity among document groups inG.

Node Clusters.Identifying clusters of related nodes inthe bipartite graphG can show how document groupsform larger classes. However, we find thatG is typ-ically fairly dense. For these reasons, partitioning ofthe one-mode projection ofG and other standard bipar-tite graph clustering techniques (e.g., Dhillion (2001)and Sun et al. (2009)) are rendered less effective. Weinstead employ a different tack and exploit the nodesimilarities computed earlier. We transformG intoa new weighted graphGP = (P,EP , wsim) whereEP = {(u, v) | u, v ∈ P, sim(u, v) > ξ}, ξ is a pre-defined threshold, andwsim is the edge weight function(i.e., wsim = sim). Thus,GP is the similarity graphof document groups.ξ = 0.5 was used as the thresholdfor our analyses. To find clusters inGP , we employ theLouvain algorithm, a heuristic method based on mod-ularity optimization (Blondel et al., 2008). Modularitymeasures the fraction of edges falling within clustersas compared to the expected fraction if edges were dis-tributed evenly in the graph (Newman, 2006). The al-gorithm initially assigns each node to its own cluster.

3I is the index set ofLu,v.

At each iteration, in a local and greedy fashion, nodesare re-assigned to clusters with which they achieve thehighest modularity.

4 Example Analysis: NSF Grants

As a realistic and informative case study, we utilizeour model to characterize funding programs of the Na-tional Science Foundation (NSF). This corpus consistsof 132,372 grant abstracts describing awards for basicresearch and other support funded by the NSF betweenthe years 1990 and 2002 (Bache and Lichman, 2013).4

Each award is associated with both a program element(i.e., funding source) and a date. We definedocumentgroups in two ways: by program element and by cal-endar year. For comparison criteria, we used topicsdiscovered with the MALLET implementation of LDA(McCallum, 2002) usingK = 400 as the number oftopics and200 as the number of iterations. All otherparameters were left as defaults. The NSF corpus pos-sesses unique properties that lend themselves to exper-imental evaluation. For instance, program elements arenot only associated with specific sets of research top-ics but are named based on the content of the program.This provides a measure of ground truth against whichwe can validate our model. We structure our analysesaround specific questions, which now follow.

Which NSF programs are focused on specific areasand which are not? When definingdocument groupsas program elements (i.e., each NSF program is a nodeinP ), node entropy can be used to answer this question.Table 1 shows examples of program elements most andleast associated with specific topics, as measured byentropy. For example, the program1311 Linguistics(low entropy) is largely focused on a singlelinguisticstopic (labeled by LDA with words such as “language,”“languages,” and “linguistic”). By contrast, theAus-tralia program (high entropy) was designed to supportUS-Australia cooperative research across many fields,as correctly inferred by our model.

Low Entropy Program ElementsProgram Primary LDA Topic

1311 Linguistics language languages linguistic4091 Network Infrastructure network connection internet

High Entropy Program ElementsProgram Primary LDA Topic

5912 Australia (many topics & disciplines)9130 Res. Improvements in Minority Instit. (many topics & disciplines)

Table 1:[Examples of High/Low Entropy Programs.]

Which research areas are growing/emerging?Whendefiningdocument groups as calendar years (instead ofprogram elements), low entropy nodes inC are topicsconcentrated around certain years. Concentrations in

4Data for years 1989 and 2003 in this publicly availablecorpus were partially missing and omitted in some analyses.

3

later years indicate growth. The LDA-discovered topicnanotechnology is among the lowest entropy topics(i.e., an outlier topic with respect to entropy). As shownin Figure 2, the number ofnanotechnology grants dras-tically increased in proportion through 2002. This re-sult is consistent with history, as the National Nan-otechnology Initiative was proposed in the late 1990s topromote nanotechnology R&D.5 One could also mea-sure such trends using budget allocations by incorpo-rating the award amounts into the edge weights ofG.

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

Topic: Nanotechnology

Per

cent

age

of T

otal

Gra

nts

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2: [Uptrend in Nanotechnology.] Our model cor-rectly identifies the surge in nanotechnology R&D beginningin the late 1990s.

Given an NSF program, to which other programsis it most similar? As described in Section 3, wheneach node inP represents an NSF program, our modelcan easily identify the programs most similar to agiven program. For instance, Table 2 shows the topthree most similar programs to both theTheoreticalPhysics andEcology programs. Results agree with in-tuition. For each NSF program, we identified the topn most similar programs ranked by oursim(·, ·) func-tion, wheren ∈ {3, 6, 9}. These programs were manu-ally judged for relatedness, and the Mean Average Pre-cision (MAP), a standard performance metric for rank-ing tasks in information retrieval, was computed. Wewere unsuccessful in evaluating alternative weightedsimilarity measures mentioned in Section 3 due to theiraforementioned issues with scalability and the size ofthe NSF dataset. (For instance, the implementations ofASCOS (Antonellis et al., 2008) and SimRank (Jeh andWidom, 2002) that we considered are available here.6)Recall that oursim(·, ·) function is based on measuringthe cosine similarity between two weight vectors,~a and~b, generated from our bipartite graph model. As a base-line for comparison, we evaluated two additional simi-larity implementations using these weight vectors. Thefirst measures the similarity between weight vectors us-ing weighted Jaccard similarity, which is

∑k min(ak,bk)∑k max(ak,bk)

(denoted asWtd. Jaccard). The second measure is

5http://en.wikipedia.org/wiki/National_Nanotechnology_Initiative

6https://github.com/hhchen1105/networkx_addon

implemented by taking the Spearman’s rank correla-tion coefficient of~a and~b (denoted asRank). Figure3 shows the Mean Average Precision (MAP) for eachmethod and each value ofn. With the exception ofthe difference betweenCosine and Wtd. Jaccard forMAP@3, all other performance differentials were sta-tistically significant, based on a one-way ANOVA andpost-hoc Tukey HSD at a 5% significance level. This,then, provides some validation for our choice.

1245 Theoretical Physics 1182 Ecology

1286 Elementary Particle Theory 1128 Ecological Studies1287 Mathematical Physics 1196 Environmental Biology1284 Atomic Theory 1195 Ecological Research

Table 2:[Similarity Queries.] Three most similar programsto theTheoretical Physics andEcology programs.

�

��

��

��

��

��

��

��

��

��

�

� ��

��

��

��

��

��

��

��

Figure 3:[Mean Average Precision (MAP).]Cosine simi-larity outperforms alternative approaches.

How do NSF programs join together to form largerprogram categories?As mentioned, by using the sim-ilarity graphGP constructed fromG, clusters of re-lated NSF programs can be discovered. Figure 4, forinstance, shows a discovered cluster of NSF programsall related to the field of neuroscience. Each NSF pro-gram (i.e., node) is composed of many documents.

Figure 4: [Neuroscience Programs.]A discovered clusterof program elements all related toneuroscience.

Which pairs of grants are the most similar in theresearch they describe?Although the focus of thispaper is on drawing comparisons amonggroups ofdocuments, it is often necessary to draw comparisonsamongindividual documents, as well. For instance,in the case of this NSF corpus, one may wish to iden-tify pairs of grants from different programs describinghighly similar lines of research. One common approach

4

to this is to exploit the low-dimensional representa-tions of documents returned by LDA (Blei et al., 2003).Any given documentdi ∈ D (wherei ∈ {1 . . .N})can be represented by a K-dimensional probability vec-tor of topic proportions given byθi∗, the ith row ofthe document-topic matrixθ. The similarity betweenany two documents, then, can be measured using thedistance between their corresponding probability vec-tors (i.e., probability distributions). We quantify thesimilarity between probability vectors using the com-plement of Hellinger distance:HS(dx, dy) = 1 −1√2

√

∑Ki=1(

√θxi −

√

θyi)2, wherex, y ∈ {1 . . .N}.Unfortunately, identifying the set ofmost similar doc-ument pairs in this way can be computationally ex-pensive, as the number of pairwise comparisons scalesquadratically with the size of the corpus. For themoderately-sized NSF corpus, this amounts to wellover 8 billion comparisons. To address this issue, ourbipartite graph model can be exploited as ablockingheuristic using either the document groups or the com-parison criteria. In the latter case, one can limit thepairwise comparisons to only those documents that re-side in the same subset ofDC . For the former case,node similarity can be used. Instead of comparing eachdocument with every other document, we can limit thecomparisons to only those document groups of interestthat are deemed similar by our model. As an illustrativeexample, out of the665 different NSF programs cov-ering these132, 372 grant abstracts, the program1271Computational Mathematics and the program2865 Nu-meric, Symbolic, and Geometric Computation are in-ferred as being highly similar by our model. Thus, wecan limit the pairwise comparisons to only suchdocu-ment groups that are similar and likely tocontain sim-ilar documents. In the case of these two programs, thefollowing two grants are easily identified as being themost similar with a Hellinger similarity (HS) score of0.73 (only text snippets are shown due to space con-straints):

Grant#1Program: 1271 Computational Mathematics

Title: Analyses of StructuredComputationalProblemsandParallel IterativeAlgorithms .

Abstract: The main objectives of the re-search planned is the analysis oflarge scalestructuredcomputational problems and of the

convergence ofparallel iterative methods forsolving linear systemsand applications of thesetechniques to the solution of largesparse anddense structured systems oflinear equations

Grant#2Program: 2865 Numeric, Symbolic, andGeometric Computation

Title: Sparse Matrix Algorithms on Dis-tributed Memory Multiprocessors.

Abstract: The design, analysis, and imple-mentation of algorithms for the solution ofsparse matrixproblems ondistributed memorymultiprocessors will be investigated. Thedevelopment of theseparallel sparse matrixalgorithms should have an impact of challeng-ing large-scale computational problems inseveral scientific, econometric, and engineeringdisciplines.

Some key terms in each grant are manually highlightedin bold. As can be seen, despite some differences interminology, the two lines of research are related, asmatrices (studied in Grant #2) are used to compactlyrepresent and work with systems of linear equations(studied in Grant #1). That is, despite such differencesin terminology (e.g., “matrix” vs. “linear systems”,“parallel” vs. “distributed”), document similarity canstill be accurately inferred by taking the Hellinger sim-ilarity of the LDA-derived low-dimensional represen-tations for the two documents. In this way, by exploit-ing thegroup-level similarities inferred by our model incombination with such document-level similarities, wecan more effectively “zero in” on such highly relateddocument pairs.

5 Conclusion

We have presented a bipartite graph model for draw-ing comparisons among largegroups of documents.We showed how basic algorithms using the model canidentify trends and anomalies among the documentgroups. As an example analysis, we demonstrated howour model can be used to better characterize and eval-uate NSF research programs. For future work, we planon employing alternative comparison criteria in ourmodel such as those derived from named entity recog-nition and paraphrase detection.

5

ReferencesIoannis Antonellis, Hector G. Molina, and Chi C.

Chang. 2008. Simrank++: Query RewritingThrough Link Analysis of the Click Graph.Proc.VLDB Endow., 1(1):408–421, August.

K. Bache and M. Lichman. 2013. UCI machine learn-ing repository.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent Dirichlet Allocation.J. Mach. Learn.Res., 3(4-5):993–1022, March.

Vincent D. Blondel, Jean-Loup Guillaume, RenaudLambiotte, and Etienne Lefebvre. 2008. Fast un-folding of communities in large networks.Journalof Statistical Mechanics: Theory and Experiment,2008(10):P10008+, July.

Michal Campr and Karel Jezek. 2013. Topic Mod-els for Comparative Summarization. In Ivan Haber-nal and Vaclav Matousek, editors,Text, Speech, andDialogue, volume 8082 ofLecture Notes in Com-puter Science, pages 568–574. Springer Berlin Hei-delberg.

Hung H. Chen and C. Lee Giles. 2013. ASCOS: AnAsymmetric Network Structure COntext SimilarityMeasure. InProceedings of the 2013 IEEE/ACM In-ternational Conference on Advances in Social Net-works Analysis and Mining, ASONAM ’13, pages442–449, New York, NY, USA. ACM.

Inderjit S. Dhillion. 2001. Co-clustering Documentsand Words Using Bipartite Spectral GraphPartition-ing. Technical report, Austin, TX, USA.

Nathan Eagle, Michael Macy, and Rob Claxton. 2010.Network diversity and economic development.Sci-ence, 328(5981):1029–1031, May.

Xiaojiang Huang, Xiaojun Wan, and Jianguo Xiao.2011. Comparative News Summarization UsingLinear Programming. InProceedings of the 49thAnnual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies:Short Papers - Volume 2, HLT ’11, pages 648–653,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Glen Jeh and Jennifer Widom. 2002. SimRank: ameasure of structural-context similarity. InProceed-ings of the eighth ACM SIGKDD international con-ference on Knowledge discovery and data mining,KDD ’02, pages 538–543, New York, NY, USA.ACM.

Andrew K. McCallum. 2002. MALLET: A MachineLearning for Language Toolkit.

M. E. J. Newman. 2006. Modularity and communitystructure in networks.Proceedings of the NationalAcademy of Sciences, 103(23):8577–8582, June.

Badrul Sarwar, George Karypis, Joseph Konstan, andJohn Riedl. 2001. Item-based Collaborative Filter-ing Recommendation Algorithms. InProceedingsof the 10th International Conference on World WideWeb, WWW ’01, pages 285–295, New York, NY,USA. ACM.

Yizhou Sun, Yintao Yu, and Jiawei Han. 2009.Ranking-based Clustering of Heterogeneous Infor-mation Networks with Star Network Schema. InProceedings of the 15th ACM SIGKDD Interna-tional Conference on Knowledge Discovery andData Mining, KDD ’09, pages 797–806, New York,NY, USA. ACM.

Xiaojun Wan, Houping Jia, Shanshan Huang, and Jian-guo Xiao. 2011. Summarizing the Differencesin Multilingual News. InProceedings of the 34thInternational ACM SIGIR Conference on Researchand Development in Information Retrieval, SIGIR’11, pages 735–744, New York, NY, USA. ACM.

Dingding Wang, Shenghuo Zhu, Tao Li, and YihongGong. 2012. Comparative Document Summariza-tion via Discriminative Sentence Selection.ACMTrans. Knowl. Discov. Data, 6(3), October.

ChengXiang Zhai, Atulya Velivelli, and Bei Yu. 2004.A Cross-collection Mixture Model for ComparativeText Mining. In Proceedings of the Tenth ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, KDD ’04, pages 743–748, New York, NY, USA. ACM.

6

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std, Z39.18

REPORT DOCUMENTATION PAGE Form Approved

OMB No. 0704-0188

Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing this collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.

1. REPORT DATE (DD-MM-YY) 2. REPORT TYPE 3. DATES COVERED (From – To)

21-09-15 Non-Standard

4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER

A Framework for Comparing Groups of Documents HQ0034-14-D-0001

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBERS

6. AUTHOR(S) 5d. PROJECT NUMBER

Arun S. Maiya AK-5-3153

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESSES 8. PERFORMING ORGANIZATION REPORT NUMBER

NS D-5543

H 15-000660

Institute for Defense Analyses

4850 Mark Center Drive

Alexandria, VA 22311-1882

9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR’S / MONITOR’S ACRONYM

ODASD Brian Hughes, JAPEC Director

Office of the Deputy Assistant Secretary of Defense for Systems Engineering

ODASD (SE) Room 3B888A, Washington, DC 20310 11. SPONSOR’S / MONITOR’S REPORT

NUMBER(S)

12. DISTRIBUTION / AVAILABILITY STATEMENT

Approved for public release; distribution is unlimited.

13. SUPPLEMENTARY NOTES

Project Leader: Arun S. Maiya

14. ABSTRACT

We present a general framework for comparing multiple groups of documents. A bipartite graph model is proposed where

document groups are represented as one node set and the comparison criteria are represented as the other node set. Using this

model, we present basic algorithms to extract insights into similarities and differences among the document groups. Finally,

we demonstrate the versatility of our framework through an analysis of NSF funding programs for basic research.

15. SUBJECT TERMS

text mining, LDA, topic modeling, bipartite graphs, network science, NSF, research grants, document similarity

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF

ABSTRACT

Unlimited

18. NUMBER OF PAGES

6

19a. NAME OF RESPONSIBLE PERSON

Brian Hughes, JAPEC Director

a. REPORT b. ABSTRACT c. THIS PAGE 19b. TELEPHONE NUMBER (Include Area Code)

703-578-2790 Unclassified Unclassified Unclassified

A Framework for Comparing Groups of Documents - … Framework for Comparing Groups of Documents ......

Documents

Transcript of A Framework for Comparing Groups of Documents - … Framework for Comparing Groups of Documents ......