Presenter: Hsini Huang Co-authors: Li Tang and John P. Walsh
Georgia institute of Technology ESF-APE-INV 2 nd Name Game
workshop, Dec 9, 2010 Madrid, spain 1
Slide 2
Authorship identification has been the Achilles' heel of
bibliometric analyses at the individual level, e.g. citation impact
analysis (Tang and Walsh, 2010). Raffo and Lhuillery (2009) also
warned, the reliability of the statistical results regarding
patenting inventors highly depends on the accuracy rate derived
from a fine matching heuristic. 2
Slide 3
Several reasons why name-matching is probably not a good idea:
Cleaning typos of names (inventor, assignee, etc.) is a difficult
task Those matching criteria are often used as dependent variables
too, e.g. co-authorship, knowledge flows and geographical spillover
(Singh, 2004) Name plus affiliation plus address could be effective
if inventors are not mobile 3 Why solve the John Smith problem
differently?
Slide 4
Cognitive map A process of a series of psychological
transformations by which an individual acquires, codes, stores,
recalls, and decodes information in spatial/information environment
Structural equivalence In a single-relation network, two actors are
structurally equivalent if they have identical ties to and from all
the other actors Approximate Structural Equivalence (A.S.E.) Actors
within a structural equivalent cluster are more similar to each
other than those outside the cluster 4 ASE Method: Key
Concepts
Slide 5
The references in a publication or patent should reflect the
cognitive map of the author or inventor If two documents share one
or more references, they are more likely to be by the same creator
--> This is especially true if they share a rare reference
Therefore, ASE of reference networks should partition documents by
creators, especially if we weight the matrix by how rare the
references are, and by how many references are in the documents
Validated on publication data (70-80% accuracy), (Tang and Walsh,
2010) 5 ASE Method: Intuition
Slide 6
6 Source: Tang and Walsh (2010)
Slide 7
- w 1 and w 2 are two weights w 1 = Popularity of the cited
references w 2 = Number of references in a patent document - D[ i,
j] is the patent-reference matrix defined as [inventorsIDs X cited
references] 7 Mathematically, the score of similarity between
authors is calculated as:
Slide 8
In the EPO, patent references are added by patent examiners.
The concept of citation is to indicate the most technically
relevant information with minimum references In the USPTO,
inventors or applicants should provide a complete list of all
prior-arts they are aware of Thus, USPTO data should more
accurately reflect the cognitive maps of inventors H: The A.S.E
algorithm performs better in US patents than in EPO patents -In
fact, should perform poorly in EPO case 8
Slide 9
The golden rule dataset: The French Benchmark Dataset (APE-INV
project, Lissoni et al., 2009) Exp1&2: EPO citation vs. USPTO
citation We retrieved reference data from PATSTAT Exp3: A.S.E vs.
Multi-stage matching method Thanks to the open access dataset
provided by Lai and his colleagues (2009), the careers and
co-authorship networks of U.S. patent-holders since 1975 9
Slide 10
10
Slide 11
InventorsCorrect groupPredict groupfalse group John Smith110
12False negative John Smith110 Joan Smith230 21False positive Joan
Smith230 Joe Smith340 Accuracy rate((7 2) / 7) * 100 = 71% 11 Over-
clustering Misclassified as a singleton
Slide 12
Among all the 1850 patents in the French Benchmark dataset
(incl. patents with no cites) Using EPO references data, the A.S.E
method can reach 77% accuracy Using USPTO references data, the
A.S.E method can reach 78% accuracy 12
Slide 13
Among all the patents with at lease one patent reference, Using
EPO references data, the A.S.E method can reach 79% accuracy
(N=1051) Using USPTO references data, the A.S.E method can reach
82% accuracy (N=361) 13
Slide 14
Among the 361 US patents, 299 records were found in Lai, DAmour
and Flemings inventor dataset the A.S.E method can reach 80%
accuracy (on either EPO or USPTO data) The multistage name-
matching method reaches 61% accuracy 14
Slide 15
15
Slide 16
The finding is not completely support our hypothesis, the
A.S.E. method performs slightly better for the US patents than the
European patents. The French Benchmark dataset has many singletons
The EPO examiners did very good job reviewing each inventors prior
works? The A.S.E method reaches a higher accuracy rate than the
more elaborate multi-stages method Thus, our method works, but
perhaps not for the reasons we think, company benchmark data should
be applied to double check this method in the future. 16
Slide 17
Advantages: 1. Researchers using the A.S.E method will have
less worry about the mobility issue because the algorithm is
insensitive to the change of address and/or affiliations. The only
thing A.S.E. captures is the trajectory of the knowledge footprint
2. Less time consuming and less computational resources. The A.S.E
method requires only a few pieces of information, i.e. patent no.,
patent references and the popularity of the cited references 3.
A.S.E does not use affiliation or co-inventors in the
disambiguation, so that these can be used to track mobility or
collaboration 17
Slide 18
Negatives: 1. The A.S.E method can only be applied if the
inventors patent has at least one linkage with the rest of his
patents. Patents with no references will be treated as singletons
automatically 2. EPO examiners cite less references. Around 50% of
the EPO patents in this study are singletons (vs. 5% in the USPTO)
- In this experiment, although even including these, the result
still yields nearly 80% accuracy, since many are in fact singletons
using the French scientist data) 18
Slide 19
Limitations: 1. The A.S.E method may not be able to relate
inventors if someone radically changes project from one technical
field to the other (although if they shift over time, the method
will capture this with a transitivity rule) 2. Although the A.S.E
method requires less parameters in the algorithm, it might be hard
to apply to an X million by X million matrix. Some level of simple
classification could help. 19
Slide 20
Thanks for your attention. Comments or suggestions? 20