Author paper midterm
-
Upload
pooja-mishra -
Category
Education
-
view
130 -
download
2
Transcript of Author paper midterm
![Page 1: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/1.jpg)
Author- Paper Identification Problem
Team :
Karthik Reddy Vakati
Nachammai C
Pooja Mishra
Guided ByProf Duc Tran
![Page 2: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/2.jpg)
Problem Statement
• To determine the correct author from the author’s dataset for a particular paper.
• Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles
• This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author
![Page 3: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/3.jpg)
Type of data Data provided by KDD challenge is in csv format.
Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)
Author -( Id, Name, Affiliation) Paper-Author -( PaperId , AuthorId, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage) Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds) Test - (AuthorId , PaperIds) Validation -(AuthorId,PaperIds,Usage)
![Page 4: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/4.jpg)
Data Points The data points include all papers written by an
author, his affliation (University, Technical Society, Groups). Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and conferences attended by an author. Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords) Author -( Id, Name, Affiliation) Conference-(Id, ShortName,FullName,HomePage) Journal -(Id, ShortName, FullName, HomePage)
![Page 5: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/5.jpg)
Issues with data
Issues with data The csv files needed cleaning Few had attributes spilled over 3 rows Some rows had more attributes than the
required number of attributes Special characters caused issue
Wrote a Perl script to Clean data and format it
![Page 6: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/6.jpg)
Issues with data-I
![Page 7: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/7.jpg)
Issues with data-II
![Page 8: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/8.jpg)
Predictions & IntuitionsPrediction: Given a paper and an author, one should be able to identify
whether the given paper was written by the author.
Intuition: We initially identified this problem as a Clustering problem. We
chose clustering because a set of papers written by one author can be grouped together and then for a given paper and author we can identify if the paper is from author’s cluster.
The features PaperId, AuthorId, PaperTitle, AuthorName play a significant role in the prediction.
![Page 9: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/9.jpg)
Feature selection
We used following features from Train dataset while building the model :
ConfirmedPaperIds DeletedPaperIds
![Page 10: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/10.jpg)
Tools Used & Model Trained
Tools Used: Weka R Apache Mahout
Model Trained: Simple K-Means J-48 ZeroR
![Page 11: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/11.jpg)
K-means clustering using Weka Training the data
![Page 12: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/12.jpg)
Visualization of k-means clustering result
![Page 13: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/13.jpg)
Simple K-means clustering using R
![Page 14: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/14.jpg)
Error in R for Clustering
> y=read.table("Paper_fixed.csv",header=TRUE,sep=',')
> y[1:10,]
> km3 <- kmeans(x,3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(x, 3) : NAs introduced by coercion
![Page 15: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/15.jpg)
Conclusion
Why clustering does not work for this problem? Handling of mixed set of attributes is an issue in R Simple Kmeans clustering works on calculating the distance
from centroids and thus needs numeric attributes and distances. Hence clustering is not a best approach for our problem
To overcome the problem we are trying to convert the data into numeric integer values and then numeric distance measures are applied for computing
However, this problem looks more like a classification problem - to classify whether a paper is written by an author
![Page 16: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/16.jpg)
Moving on to Classification algorithms..
ZeroR
Tree J-48
Naïve Bayes
![Page 17: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/17.jpg)
Results using Tree-J48 algorithm
![Page 18: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/18.jpg)
Results using ZeroR algorithm
![Page 19: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/19.jpg)
Visualization of ZeroR results for Precision
![Page 20: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/20.jpg)
Next Steps
We are working on the feature engineering - feature transformation – work on the Author name attribute and transform it into a common format for all Author names.
Once we have the feature engineering done - We will working principally on Naïve Bayes and other classification algorithms that we think will suit our problem
And fine tune the model…
![Page 21: Author paper midterm](https://reader036.fdocuments.in/reader036/viewer/2022081603/558fcd2e1a28ab0e398b47e6/html5/thumbnails/21.jpg)
Thank you!!