Download - Presenter: Hsini Huang Co-authors: Li Tang and John P. Walsh Georgia institute of Technology ESF-APE-INV 2 nd “Name Game” workshop, Dec 9, 2010 Madrid,

Presenter: Hsini Huang Co-authors: Li Tang and John P. Walsh Georgia institute of Technology ESF-APE-INV 2 nd Name Game workshop, Dec 9, 2010 Madrid, spain 1

Authorship identification has been the Achilles' heel of bibliometric analyses at the individual level, e.g. citation impact analysis (Tang and Walsh, 2010). Raffo and Lhuillery (2009) also warned, the reliability of the statistical results regarding patenting inventors highly depends on the accuracy rate derived from a fine matching heuristic. 2

Several reasons why name-matching is probably not a good idea: Cleaning typos of names (inventor, assignee, etc.) is a difficult task Those matching criteria are often used as dependent variables too, e.g. co-authorship, knowledge flows and geographical spillover (Singh, 2004) Name plus affiliation plus address could be effective if inventors are not mobile 3 Why solve the John Smith problem differently?

Cognitive map A process of a series of psychological transformations by which an individual acquires, codes, stores, recalls, and decodes information in spatial/information environment Structural equivalence In a single-relation network, two actors are structurally equivalent if they have identical ties to and from all the other actors Approximate Structural Equivalence (A.S.E.) Actors within a structural equivalent cluster are more similar to each other than those outside the cluster 4 ASE Method: Key Concepts

The references in a publication or patent should reflect the cognitive map of the author or inventor If two documents share one or more references, they are more likely to be by the same creator --> This is especially true if they share a rare reference Therefore, ASE of reference networks should partition documents by creators, especially if we weight the matrix by how rare the references are, and by how many references are in the documents Validated on publication data (70-80% accuracy), (Tang and Walsh, 2010) 5 ASE Method: Intuition

6 Source: Tang and Walsh (2010)

- w 1 and w 2 are two weights w 1 = Popularity of the cited references w 2 = Number of references in a patent document - D[ i, j] is the patent-reference matrix defined as [inventorsIDs X cited references] 7 Mathematically, the score of similarity between authors is calculated as:

In the EPO, patent references are added by patent examiners. The concept of citation is to indicate the most technically relevant information with minimum references In the USPTO, inventors or applicants should provide a complete list of all prior-arts they are aware of Thus, USPTO data should more accurately reflect the cognitive maps of inventors H: The A.S.E algorithm performs better in US patents than in EPO patents -In fact, should perform poorly in EPO case 8

The golden rule dataset: The French Benchmark Dataset (APE-INV project, Lissoni et al., 2009) Exp1&2: EPO citation vs. USPTO citation We retrieved reference data from PATSTAT Exp3: A.S.E vs. Multi-stage matching method Thanks to the open access dataset provided by Lai and his colleagues (2009), the careers and co-authorship networks of U.S. patent-holders since 1975 9

InventorsCorrect groupPredict groupfalse group John Smith110 12False negative John Smith110 Joan Smith230 21False positive Joan Smith230 Joe Smith340 Accuracy rate((7 2) / 7) * 100 = 71% 11 Over- clustering Misclassified as a singleton

Among all the 1850 patents in the French Benchmark dataset (incl. patents with no cites) Using EPO references data, the A.S.E method can reach 77% accuracy Using USPTO references data, the A.S.E method can reach 78% accuracy 12

Among all the patents with at lease one patent reference, Using EPO references data, the A.S.E method can reach 79% accuracy (N=1051) Using USPTO references data, the A.S.E method can reach 82% accuracy (N=361) 13

Among the 361 US patents, 299 records were found in Lai, DAmour and Flemings inventor dataset the A.S.E method can reach 80% accuracy (on either EPO or USPTO data) The multistage name- matching method reaches 61% accuracy 14

The finding is not completely support our hypothesis, the A.S.E. method performs slightly better for the US patents than the European patents. The French Benchmark dataset has many singletons The EPO examiners did very good job reviewing each inventors prior works? The A.S.E method reaches a higher accuracy rate than the more elaborate multi-stages method Thus, our method works, but perhaps not for the reasons we think, company benchmark data should be applied to double check this method in the future. 16

Advantages: 1. Researchers using the A.S.E method will have less worry about the mobility issue because the algorithm is insensitive to the change of address and/or affiliations. The only thing A.S.E. captures is the trajectory of the knowledge footprint 2. Less time consuming and less computational resources. The A.S.E method requires only a few pieces of information, i.e. patent no., patent references and the popularity of the cited references 3. A.S.E does not use affiliation or co-inventors in the disambiguation, so that these can be used to track mobility or collaboration 17

Negatives: 1. The A.S.E method can only be applied if the inventors patent has at least one linkage with the rest of his patents. Patents with no references will be treated as singletons automatically 2. EPO examiners cite less references. Around 50% of the EPO patents in this study are singletons (vs. 5% in the USPTO) - In this experiment, although even including these, the result still yields nearly 80% accuracy, since many are in fact singletons using the French scientist data) 18

Limitations: 1. The A.S.E method may not be able to relate inventors if someone radically changes project from one technical field to the other (although if they shift over time, the method will capture this with a transitivity rule) 2. Although the A.S.E method requires less parameters in the algorithm, it might be hard to apply to an X million by X million matrix. Some level of simple classification could help. 19

Thanks for your attention. Comments or suggestions? 20