CS315A : Project Report Slide Selector and Merger Guide ...aayushmudgal.github.io/cs315Report.pdfIt...

7
CS315A : Project Report Slide Selector and Merger Guide: Prof. T.V. Prabhakar Aayush Mudgal [12008], Anusha Chowdhury [12148], Ayush Shekari [12185] Indian Institute of Technology, Kanpur 2014-15, II nd Semester Abstract. The project is aimed at implementing a web-based personal presentation man- ager. The web-based interface allows the user to create custom presentations from existing presentations, stored by him or others in the database. Instead of browsing through huge presentations, the users can directly search for slides related to a particular topic by putting in the keywords in the search box. 1 1 Introduction Often, slides for classroom lectures, require the need to merge already prepared slides from previous work of self or others. It is generally required to mix slides over multiple presentations, to create a good presentation. Slide selector is a web-based utility service that tries to aid this manual process. It allows searching, mixing and merging of desired slides from different presentations, and PDF documents. The user can upload any number of presentations into the database. Once the presentations are validated for insertion into the database, the user can search for these slides by specifying the choice of words he needs. Fig. 1. Overview of the Model 2 Document Insertion The user has to upload the documents (presentation / PDF documents) over which he desires to create new presentations. It might happen that the user uploads a file that is already been uploaded to the database, such type of duplicities in the 1 The Web Service can be accessed at : http://172.27.22.218/doodle/

Transcript of CS315A : Project Report Slide Selector and Merger Guide ...aayushmudgal.github.io/cs315Report.pdfIt...

CS315A : Project Report

Slide Selector and Merger

Guide: Prof. T.V. Prabhakar

Aayush Mudgal [12008], Anusha Chowdhury [12148], Ayush Shekari [12185]

Indian Institute of Technology, Kanpur2014-15, IInd Semester

Abstract. The project is aimed at implementing a web-based personal presentation man-ager. The web-based interface allows the user to create custom presentations from existingpresentations, stored by him or others in the database. Instead of browsing through hugepresentations, the users can directly search for slides related to a particular topic by puttingin the keywords in the search box.1

1 Introduction

Often, slides for classroom lectures, require the need to merge already preparedslides from previous work of self or others. It is generally required to mix slides overmultiple presentations, to create a good presentation. Slide selector is a web-basedutility service that tries to aid this manual process. It allows searching, mixing andmerging of desired slides from different presentations, and PDF documents. The usercan upload any number of presentations into the database. Once the presentationsare validated for insertion into the database, the user can search for these slides byspecifying the choice of words he needs.

Fig. 1. Overview of the Model

2 Document Insertion

The user has to upload the documents (presentation / PDF documents) over whichhe desires to create new presentations. It might happen that the user uploads afile that is already been uploaded to the database, such type of duplicities in the

1 The Web Service can be accessed at : http://172.27.22.218/doodle/

documents is resolved by comparing the SHA1 hash of the to be inserted documentwith respect to the ones already in the stored in the database. This information isstored as a table in the form of a 1-1 mapping of the document and its SHA1 hash.SHA1 hash is expected to work very well, since there is a very very low probabilityof 2−60 for a collision. (There has not been a single SHA1 clash observed till now) Itis expected that different files will give different SHA1 hash values, we use this ideato check for duplicates in the database. The following is the procedure that happenswhen a slide is uploaded to be inserted in the database.

Fig. 2. Document Insertion

The users uploads one or more files (presentation or PDF in a batch) throughhis account on the web server. If the uploaded file matches with some file in thedatabase, then such a file doesn’t any other further processing, as this file presentsnothing new. If the uploaded file is a new file, then it is split into individual slides asPDF documents. Duplicates for individual files is checked at this stage. A duplicateslide is not stored with the database. For each individual file, important features areextracted. Heading in a slide is generally the most important piece of text, in a doc-ument. It is mostly present at the top of the slide with a larger text-size. Headingfrom the slide are extracted by comparing text sizes of different text blocks. Ex-tracted information from each slide is stored in the database, for use in the future.The fact that users can upload either Microsoft Powerpoint presentations(.ppt or.pptx) and also beamer presentations(.pdf) makes our Slide Selector applicationavailable for use to a huge no. of customers (both Windows and Linux users). Mor-eiver, in one go he can upload multiple files (batch upload) which indeed reduceseffort.

3 Document Search

On his account page the user has an option to search for slides related to a topic.Currently we have kept a very eay-to-use interface. He only needs to enter thekeywords he is searching for. As soon as he clicks the search button, the slides whichcontain those keywords, appear as single page pdfs, in a scroll down view (no. ofcolumns is fixed but no. of rows is dynamic).

2

3.1 Deciding Priority

Suppose the user is searching with n keywords. Then there are some slides whichmay contain all n keywords, there are some which contain only k of those n keywords(k less than n). The higher the value of k, the more important that slide will be (thisseemed to be quite logical). So, we display the slides as per their priority order. Theslides with the highest no. of matches appear at the top, followed by the slides withlower number of matches. Also, the slides that have those keywords in its headingsare more important than those which have these keywords in their body. We use alot of other heuristics to get a satisfactory search over oyou are slide space.

3.2 Implementation Details

In our opinion the heading of a slide is the most important feature when one hasto do selection by content-based information retrieval. The next important featurewould be the text body of the slide, followed by images and meta-information.In our SQL database we stored the slides for each document with a unique SHA-1key which can act as a primary key in our relational database schema.Then we have three tables: slide vs heading, slide vs body and slide vs meta. Wesplit it up into three to avoid redundancy while storing information. Here one heuris-tic that works quite well is the font size. In general the heading has the largest fontsize in a slide followed by the text body which is slightly smaller. For the meta-information we did an experiment with a no. of slides and found that it was gen-erally around 0.7 × minsize+0.3 × maxsize. Also the meta-information generallyappears in all slides in a presentation. For example, in a presentation on databasethe name of the book (Elmasri,Navathe) appeared in each slide at the bottom, andin each slide the heading had distinctly bigger fontsize than the rest of the contents.Whenever the user gives a search string, we traverse the tables in the SQL databaseand store the no. of matches for the keywords in the content of the slides. Highestpriority is given to the heading so slide vs heading table is searched first followedby the other tables. Then the no. of matches for each slide is stored in a list, whichis then sorted in descending order to get the most related slide at the top.

Optimisation A single presentation can contain huge no. of slides in real life(maybe hundreds or thousands). So, if we store the whole content of each slide therewould be tremendous load on the database. Moreover the database is currently ona server, and in presence of too much load there are chances that the applicationwould become slow. So, we decided to do some optimizations before storing the data:

– Using the nltk package of python, we removed the stopwords (like a, and, the) sothat the useless part of the content is not stored. These words are of no interestto us during search as they cannot be the keywords of importance.

– Even after doing this, there would be still several words in a slide. There is atool called Rake in python which can be used to get only the important words(phrases) given a huge paragraph. When we apply it the size of the content perslide gets reasonably reduced and now it can be easily stored on the database.

3

4 Features in our Slide Selector Application

After the slides which were being searched for, are displayed one after another assingle page pdfs, the user gets the option to mark a page for download, or see thenext and previous slides.

– Mark - The users can mark the pdfs which he would like to be present in his cus-tomized presentation. As soon as he clicks Mark, the slide appears in thumbnailview at the top right corner.

– Previous and Next- Whenever a slide matches a certain no. of keywords duringsearch, it is possible that the user also wants to see the slide previous or next tothis slide, even though those did not contain the keywords of search, they meybe context-wise related. So, we provide the option to view the previous and nextslide for any displayed slide, and then he can mark either of those slides as wellif he wants.

– Download- After all the required slides have been marked, they all appear in athumbnail view. The user can still delete some slide if he thinks it is not necessary.Once he clicks Download, all the slides which he had chosen, will get downloadedas a merged pdf.

Searching more than one context Let us consider a real-life example. Supposethe user initially searches for the keyword ”Database”. Then some n slides will getdisplayed which contain this keyword (in heading or body,etc.) Now the user willselect some m slides out of the n matches in the thumbnail view for download. Afterthis suppose the user wants to search for the slides containing keyword ”Locking”and he finally wants all slides after both these searches together in a single pdf.So, our application supports this feature of keeping the thumbnail as it is, acrossmultiple searches. When the user has marked all the various slides he wants aftersome k searches, he can download all of them together as a single pdf.So, our application Slide Selector allows the user to get results of any no. of searcheshe wants together. The main idea was to support union and intersection of individualsearches.

Tools used

The Slide Selector application is hosted on an Apache server.•For creation of user account (login/register) we used HTML, PHP and styles weredone in CSS.• We found that the xml for the slides in a ppt or pptx presentation can be obtainedby just making it a .zip in apple macbook and then extracting. In other platformsthe same can be done by manually saving the zip and then executing unzip commandfrom terminal. However, instead of parsing xml, we explored and found out that thepdf format would provide us with easier ways to parse.• For conversion of uploaded presentation to pdf, we used unoconv command inLinux.•The data was stored on an SQL database (we used phpmyadmin for executing thequeries)• For extracting text from pdfs, we used the standard pdftotext command.

4

•For coding the implementations of search, we used Python. We used the standardnltk libraries for pruning the corpus and the Rake tool for further optimisation.•We used the object tag for the dynamic display of the single page pdfs after search.•We also explored the word2vec model which can be used to find synonyms for aword. The idea is to search with the keywords given by the user and also theirsynonyms so that all the slides related to that topic can be found.

5 Future Work

– Currently the application is hosted on a server and is provided as a web service.We are thinking of converting into a product so that users can just download itas an app and then use it as required. This would give faster results.

– Also the wordnet or Word2Vec model can be used for finding all similar slides,given the context. This would improve our search results. We would also tryto give a lot of advanced search features like search by Author, Date Added,University/Industrial Slide. We would improve our data-extraction methods sothat data patterns across slide in a presentation are also considered.

– We are also thinking to add a Tag Cloud correspoding to our search results,this would contribute in enriching the user experience. The user interface can beimproved further to drag and drop instead of click.

– We will also incorporate a feature where the users can change the order of theslides after they have marked a certain no. of slides for download.

We hope to continue our work during the next semester under the guidance ofProf. T.V. Prabhakar.

Acknowledgement

I take this opportunity to express my profound gratitude and deep regards to Prof.T.V. Prabhakar for his exemplary guidance, monitoring and constant encouragementthroughout the course of this project. The blessing, help and guidance given by himtime to time shall carry us a long way in the journey of life on which I am about toembark. I am also thankful to the course TA’s, specially Sumit Kalra and SaurabhSrivastava for their endless support and guidance in all times.

References

– H. Lee, K. Choi, J. Kim: A Design of Retrieval System for Presentation Docu-ments Using Content-Based Image Retrieval , 2011.

– Information Retrieval at http://en.wikipedia.org/wiki/Information_retrieval– An Introduction to Information Retrieval at http://nlp.stanford.edu/IR-book/

pdf/00front.pdf

– Command pdftotext at http://en.wikipedia.org/wiki/Pdftotext

5

Images

Fig. 3. Login Page of our Web Service

Fig. 4. User’s Person page

6

Fig. 5. Users File Handler

Fig. 6. Searching over the space of slides

7