InternshipReport

CONTENTS

1. Acknowledgement

2. (a) Introduction (b) Internship Objectives (c) Personal Developments Target

3. Project Details

(a) Searching Algorithm for Memes (b) Prediction Model using SVM Algorithm

4.Other Tasks and Activities 5.Reflection 6.Conclusion

ACKNOWLEDGEMENT The internship opportunity I had with Culture Machine was a great chance for learning and professional development. Therefore, I consider myself as a very lucky individual as I was provided with an opportunity to be a part of it. I am also grateful for having a chance to meet so many wonderful people and professionals who led me through this period. For this opportunity, I would like to thank: Amit Garde , who is the Head of Engineering of Culture Machine,Pune and my intern mentor. He always emphasised on the fact to go in the depth of the field and guided me through my internship with advice and feedback despite his busy schedule. Meghana Negi, who is the research engineer and my internship coach. I want to thank her for giving me the opportunity to do my internship at the company.She helped me during my internship by giving me feedback and tips on how to handle and approach situations. She had always time to answer all my questions concerning my internship. Furthermore, I would like to thank Arvind Hulgeri, Abhishek Kolipey and Jaju who were really helpful and created a good environment to work in. Besides my internship, I really enjoyed my stay in Pune. It was a great experience and I want to thank everybody for it.

RECOGNISING VISUAL MEMES

Introduction This report is a short description of my one and half month internship carried out in the data science team at Culture Machine. Internship Objectives:

● To see what it is like to work in a professional environment ● To see if this kind of work is a possibility for my future career

Personal Development Targets:

I set up personal developments targets to practise, improve and develop during my internship.

● To enhance my communication skills ● To improve my knowledge in the field of Data Science.

This report contains my activities that have contributed to achieve a number of my stated goals, along with the description of the projects undertaken by me. Finally I give a conclusion on the internship experience according to my learning goals.

RECOGNISING VISUAL MEMES

A. Searching Algorithm for Memes

1. Project Description :

This project’s aim is to search for the memes of a particular instance from the dataset of memes and show all the images corresponding to it. It has the potential to determine and classify the incoming streams of images from facebook,youtube etc into various categories of memes, hence an asset to the Culture Machine, being a data oriented media company. As a data science intern, my role was to design a searching algorithm for finding meme of a particular instance given a dataset of memes belonging to different categories/instances. Subsequently I came up with solutions that address the problems assigned. 2. Steps

● Downloaded a dataset containing meme ids corresponding to their urls, their names along with their ratings, date and data id.

● Defining a function that converts the url to its corresponding image using urllib.

● Defining a matching function which takes input of the name of meme and return set of those memes in the following way:

1. Converting all the memes into grey and resizing to the same size i.e. 100 x 110, keeping the average aspect ration in mind to avoid distortion.

2. Detecting keypoints (features) and their descriptions using ORB, a binary descriptor which uses Oriented FAST algorithm to detect keypoints and Rotated BRIEF to build their descriptors.

3. Taking a meme of particular instance and comparing its descriptors with descriptors of other memes in the dataset using Brute Force Algorithm, which finds the Hamming distance between each matched features.

4. Setting a threshold in order to classify a meme as a good match to the input meme. ( e.g hamming distance between two matched features should be less than 5 units )

Set of 5 input memes -

When matching function is called with ‘ Redditors Wife’ ( Meme Name ) as parameter, it takes a single meme image (primary image) of this instance from the primary dataset, knowing this image belongs to Redditors wife and compare it with all the images in the input dataset. Primary Image -

3. Final Outcome:

● Total number of memes of that particular instance present in the dataset.

● Slideshow of memes of that instance present in the dataset.

The matching function returns the following memes as the output :

Matching of keypoints is done by calculating the hamming distance between the descriptors (features). Lesser the distance, more similar the features. More number of similar features, more similarity between two images.

B. Prediction Model using SVM Algorithm

1. Project Description: The project’s aim is to predict the name of the memes from the

dataset of images containing memes as well as non-memes. It has the capability to differentiate between meme & non-meme and also to classify meme furthermore in its category. As a data science intern, my role was to design a prediction model which takes input of a set of images containing memes as well as non-memes, apply an algorithm and return returns output of images with their categories. Subsequently I came up with solutions that address the problems assigned. 2. Steps:

● Taking a dataset of images having ImageID which leads to their urls and their names.

● Assigning indexes to different categories of memes and non-memes. In my case, 0 is given for all non-memes while starting from 1, particular index is assigned to a particular category of meme.

● Input dataset is divided into train and test dataset using ratio that gives the best result.

● For train dataset, extracting descriptors (features) of each image using ORB descriptors.

Now, a basic question arises in our mind, how exactly features extraction is done for images ?

Following is my understanding of this question :

(a)ORB is basically a fusion of FAST keypoint detector and BRIEF descriptor with many modifications to enhance the performance. First it use FAST to find keypoints, then apply Harris corner measure to find top N points among them.

(b) FAST algorithm - We identify the similarity between two images by looking at points which has a significant intensity variation with respect to its neighbouring pixels.

1. Select a pixel in the image which is to be classified as the interest point or not. Lets its intensity be Ip.

2. Select an appropriate threshold, t. 3. Consider a circle of 16 pixels around the pixel under test. (See the

image below)

4. Now the pixel is a corner if there exists a set of N pixels in the circle (of 16 pixels) which are all brighter than Ip + t, or all darker than Ip - t. (Shown as white dash lines in the above image). N was chosen to be 12.

5. A high-speed test was proposed to exclude a large number of non-corners. This test examines only the four pixels at 1, 9, 5 and

13 (First 1 and 9 are tested if they are too brighter or darker. If so, then checks 5 and 13). If is a corner, then at least three of these must all be brighter than Ip +t or darker than Ip-t.

ORB also improves the rotation invariance of the keypoints computed by FAST algorithm.

(c) BRIEF Descriptors :

1. It is an example of binary descriptors. Binary descriptors are preferred over SIFT or SURF as they do not involve computation of gradients of a pixel in each patch and hence it is comparatively faster.

2. In general, Binary descriptors are composed of three parts: A sampling pattern, orientation compensation and sampling pairs.

3. Consider a small patch centered around a keypoint. We’d like to describe it as a binary string. First thing, take a sampling pattern around the keypoint, for example – points spread on a set of

concentric circle s.

4. Next, choose, 256 pairs of points on this sampling pattern. Now go over all the pairs and compare the intensity value of the first point in the pair with the intensity value of the second point in the pair – If the first value is larger then the second, write ‘1’ in the string, otherwise write ‘0’. That’s it. After going over all 256 pairs, we’ll have a 256 characters string, composed out of ‘1’ and ‘0’ that encoded the local information around the keypoint. (OpenCV represents it in bytes. So it will be 32 characters string )

In case of ORB, it doesn’t have an elaborate sampling pattern, uses moments for orientation calculation and learned pairs are taken as sampling pairs.

● Extracting features of each image in form of a matrix of shape N x 32 ( N is the number of keypoints detected )

● Vector- quantising these features in form of histograms (bag of visual words) in order to train them using SVM classifier.

1. Conversion of each image in a form of vector in n dimensional space.

2. Creating bag of visual words/ features by using k-means clustering algorithm which determines the center of each cluster ( Number of clusters is taken as square root of M/2, where M is the total number of features of all the images)

3. Using the approximate nearest neighbour algorithm for construction a feature histogram for each image.The function then increments histogram bins based on the proximity of the descriptor to a particular cluster center.

● Training the SVM classifier using features of train dataset and their corresponding indexes. ( Linear kernel of SVM is used which uses “one vs rest” approach for multi-class classification )

● Creating bag of features for images in test dataset as well. ● Predicting the index of the images in test dataset and calculating

the accuracy of the model.

3. Output :

Saving the output dataset as csv file having the predicted names of the memes ( e.g “ Redditors Wife” ) and non- memes ( “Non-memes” title is given).

OTHER TASKS AND ACTIVITIES

● Learning the use of shell script in installing applications,reading files and controlling their parameters.

● Learning the mechanism and complexity of algorithms through

book named “Algorithms” by Sanjoy Dasgupta.

● Learning various basics of spark and its implementation when the size of data is big.

REFLECTION The internship has been a fulfilling experience. I have been able to accomplish all my stated learning goals, and my expectations have been exceeded. The months spent with Culture Machine have given me a great insight into the startup world. I was genuinely impressed with their work culture, being both flexible and open. I have found that the professionals at Culture Machine are all highly qualified and hardworking individuals, and it was an honor to be working in their guidance. My mentors and the data science team as a whole has been very warm and supportive throughout, and I am glad to have built a strong bond with them.

CONCLUSION How data science and machine learning are carried out in a professional environment was always a part of my curiosity and the internship with Culture Machine has helped me experience it. I hope to pursue this as a career later in life, and therefore the experience and skill set gained here are invaluable. Some of the skills I gained are listed below :

● Image processing ● Feature detection and extraction of images ● Using different Machine Learning algorithms ● Knowledge of python libraries such as pandas,numpy,cv2,sklearn ● Tools such as shell script and jupyter notebook for python.

It was a wonderful experience. Thank you Culture Machine for this opportunity. :)

InternshipReport

Documents

Transcript of InternshipReport