Empirical Patterns in Google Scholar Citation Counts (CyberPatterns 2014)

Empirical Patternsin

Google Scholar Citation Counts

Peter T. BreuerJonathan P. Bowen

University of BirminghamBirmingham City University

http://bham.academia.edu/PeterBreuer/Talks

Peter Breuer's

Google Scholar Page

AlanTuring's

Google ScholarPage

Although these pages are very different ...

● ...they both share an underlying pattern

(and so does everybody else's we've examined)

To see the pattern ...

● … graph the citation numbers that appear going down the page against their rank

– A 108

– B 73

– C 65

– D 55

– E 44

– ….....

Alan Turing's ..

● … has the same shape, but sharper– A 8093

– B 8093– C 6902

– D 6684

– E 803

– …

– Z 1

Graphs look scale invariant.

– if we see part of a cites graph in isolation

– and don't know what the numbers are

• Can we tell which part of the graph the part we are looking at is from?

The answer is “no”

● Citations graphs are scale-invariant– Count the number of citation counts that

begin with the digit “1”, “2”, “3”, …, “9”

It's “Benford's Law”

● For scale-invariant distributions– C(1) > C(2) > .. > C(9)

What kind of scale-invariance?

● We believe cite counts against ranking are

xn v x

0 e -P Xn

log(xn/x

0) v - P Xn

log(-log(xn/x

0)) v log(P) + 0.5 log(n)

Log-log cites graph should be a straight line!

● It is!(my data)

But Alan Turing's log-log data..

● … follows a different straight line

slope 0.4not 0.5

The generic cites curve

log(log(x0/x

n)) v log(P) + 0.5\A

log(n)

But why?

We need a statistical model

Statistical Model

● log(X/X0) is normally distributed

– σ = µ = λ ∼ 0.2● Trials X ranked in order look like cite counts!● Log-log slope A depends on number of trials N

Some Related Work!

● Others report citation counts distribution is– Log normal

– But investigate across whole fields● Normalization WRT average across field

– Radicci et al, 2008● Universality of Citation Distributions

● Single researcher's work defines own “field”– We normalize WRT max cite, not average

– But same underlying statistical model fits well

Understand

● Individual papers have worth distributed as

– Poisson/normal σ = µ ∼ 0.2 cites/paper

● Cite count depends exponentally on worth– Being cited earns more citations

Predictions

● xn v x

0 e -P Xn

– good estimate for P: X(2X0/S)

S = total number of citations

– at ~150 cited papers A=Xn → A=n0.4

● log(10/X0)/log(20/X

0) ~ (i

10/ i

20)A

– Ik is number of papers cited > k times

● I have i10

/ i20

= 1.98, predicted 2.08

● Alan Turing has 1.44, predicted 1.32

More Predictions

● h-index h ~ ih

– Number h of papers cited at least h times

● Predict (I10

/h)A = log(X0/10)/log(X

0/h)

– For me, solves for h=16.67, have h=16

Conclusion

● Three parameters X0, A, P

– Determine graph of cites against rank

– For most individuals, A=0.5● Drops with greater number N of publications

● Predictions based on the statistical curve …

– Relate i10

, h, etc measures

● Independently of the parameters (except A)

● Everybody's #cite/rank graph is “the same”

Empirical Patterns in Google Scholar Citation Counts (CyberPatterns 2014)

Data & Analytics

Transcript of Empirical Patterns in Google Scholar Citation Counts (CyberPatterns 2014)