MRP Abstracts 2017 - ryerson.ca

41
1 MRP Abstracts 2017 Abu-Ata, Muad Mustafa Husein - Optimization of Decision Model Microsimulation in Health Care Microsimulation is used in health care to evaluate cost-effectiveness of different diagnosis and treatment procedures. Producing valid and statistically significant simulation results requires large input size. The aim of this project is to speed up an existing decision model simulation for Obstructive Sleep Apnea (OSA) study and to generalize the simulation to any diagnostic/treatment methods. Additionally, as mortality prediction is an important feature in such models, we aim to accurately incorporate mortality prediction into the simulation model. Parallelization and code refactoring are utilized to scale up the microsimulation model. We could scale up the simulation model to simulate four million patients in 21.8 minutes (reducing computational time by a factor of 14). Moreover, we applied the Lee-Carter model to future predict mortality rates where the fitted model resulted in small residual errors. Chen, Yilin New York City Green Taxi Trip Optimization Most of people think driving a taxi is all about the driving skills, and there is no special rules or tricks to follow in order to let a taxi driver earn an outstanding amount of income. How much a taxi driver could earn all depends on luck and long time working hours. However, what if there are some hidden tricks that could help a taxi driver to increase the daily revenue? This research paper aims to reveal those tricks and rules by digging into the big data world. In this paper, I use the 2016 New York Green Taxi trip data from NYC open data source to generate an algorithm that takes expected starting location, the time of the day and date of the year as inputs, and outputs recommendations to taxi drivers on if the chosen expected starting location could earn the maximum revenue or the adjacent locations could earn higher revenues. The machine learning technique, random forest, is used to predict the factors that could affect the total revenue. The final simulation results indicate that by taking the recommendations provided by the algorithm, the revenue of a taxi driver is most likely to increase. Durrani, Afsah Filtering of Tweets to Identify and Remove Un-Informative Concepts Due to the recent technological advancements, there is a large increase in the number of online users and the social media content generated. The abundance of online social media data is used by multiple stakeholders to identify the public opinions, trending

Transcript of MRP Abstracts 2017 - ryerson.ca

1

MRP Abstracts

2017

Abu-Ata, Muad Mustafa Husein - Optimization of Decision Model Microsimulation

in Health Care

Microsimulation is used in health care to evaluate cost-effectiveness of different diagnosis and treatment procedures. Producing valid and statistically significant simulation results requires large input size. The aim of this project is to speed up an existing decision model simulation for Obstructive Sleep Apnea (OSA) study and to generalize the simulation to any diagnostic/treatment methods. Additionally, as mortality prediction is an important feature in such models, we aim to accurately incorporate mortality prediction into the simulation model. Parallelization and code refactoring are utilized to scale up the microsimulation model. We could scale up the simulation model to simulate four million patients in 21.8 minutes (reducing computational time by a factor of 14). Moreover, we applied the Lee-Carter model to future predict mortality rates where the fitted model resulted in small residual errors.

Chen, Yilin – New York City Green Taxi Trip Optimization

Most of people think driving a taxi is all about the driving skills, and there is no special rules or tricks to follow in order to let a taxi driver earn an outstanding amount of income. How much a taxi driver could earn all depends on luck and long time working hours. However, what if there are some hidden tricks that could help a taxi driver to increase the daily revenue? This research paper aims to reveal those tricks and rules by digging into the big data world. In this paper, I use the 2016 New York Green Taxi trip data from NYC open data source to generate an algorithm that takes expected starting location, the time of the day and date of the year as inputs, and outputs recommendations to taxi drivers on if the chosen expected starting location could earn the maximum revenue or the adjacent locations could earn higher revenues. The machine learning technique, random forest, is used to predict the factors that could affect the total revenue. The final simulation results indicate that by taking the recommendations provided by the algorithm, the revenue of a taxi driver is most likely to increase. Durrani, Afsah – Filtering of Tweets to Identify and Remove Un-Informative Concepts Due to the recent technological advancements, there is a large increase in the number of online users and the social media content generated. The abundance of online social media data is used by multiple stakeholders to identify the public opinions, trending

2

topics and user segmentation. The large amount of data requires high computational power which is traditionally dealt with, by removing uninformative words using preprocessing techniques, such as stopword removal, before analysis. We present approaches using two correlation algorithms to identify the uninformative concepts. The effectiveness of the approach is evaluated by measuring the performance of the LDA models applied on the new datasets derived from the experiments. Correlation with the sum of all concepts performs better as compared to the correlation with the noise signal. Varying correlation threshold values are experimented with of which higher thresholds provide with better LDA performance. Fatima, Hira - Analysis of Reddit Groups (Subreddits) Using Classification of Subreddit Posts In this paper, we applied a novel idea to utilize machine learning techniques to automatically label subreddit posts from a subreddit called “askhistorians”. Using descriptive analytics, I first conducted an exploratory analysis to see if I can find any patterns, correlations or relationships that could be used to generalize posting pattern and behaviour of reddit users. The second part of my analysis comes from training and evaluating eight classifiers that could correctly categorize reddit posts with a positive or negative label for the eight category codes listed in Appendix A. I used 3 different algorithms and compared their performance using accuracy, precision and recall. This research is a continuation of an existing study that started in Ryerson Social Media Lab (RSML) [1]. The dataset that was used to train and evaluate the classifiers was coded manually by (Ryerson Social Media Lab) RSML. The predicted classification results were used to provide more insights about the subreddit group.

Ghaderi, Amir - Credit Card Fraud Detection Using Parallelized Bayesian Network

Inferecing

The number of credit card transactions is growing, taking an ever-larger share of the worlds payment system. Improved credit card fraud detection techniques are required to maintain the viability of the worlds payment system. The aim of this Major Research project is (1) to develop a Bayesian network model that is able to predict fraudulent credit card transactions with minimal false positive predictions and (2) to reduce the processing time through the parallelization of the inferencing process. The Bayesian network was trained on credit card transaction data obtained from European cardholders for the month of September 2013. The results determined that Bayesian networks are able to be trained to predict fraudulent credit card transaction with zero false positive predictions. In addition, Bayesian network inferencing can be efficiently parallelized to reduce the overall processing time.

Ghaly, John - A Defect Prediction Model Using Delta Static Metrics

Dependence on software to automate, optimize and manage our daily tasks is growing

every day. As the demand for higher software functionalities increase, the software size

and complexity also increase. Maintaining and finding software defects are a hard and

3

time-consuming job. We propose a machine learning model to identify and predict

defect prone modules. We use an industrial dataset to build 8 classifiers from 5 different

categories based on Static, Churn, and Delta metrics. We found that the addition of

delta metrics significantly reduced the probability of false alarm while improving the

probability of detection. Our results validate our hypothesis on the added value of delta

metrics for improving results. We found that most algorithms achieved reasonable

performance giving the suitable technique.

Hon, Marcia - Alzeheimer's Diagnosis with Convnet

Alzheimer’s is a serious disease characterized by a progressive degeneration of the brain affecting 60 to 80 percent of dementia cases. The ability to automate this diagnosis is very important to accelerate treatment. In this project, convolutional neural networks (convnets) are used in order to automate the classification of Alzheimer’s disease from MRI images. 6400 MRI images were taken from http://www.oasis-brains.org using 5-fold with 80% to 20% test/validation. VGG16 won the ImageNet competition and it is thus used in this project. Its classification layer is retrained borrowing code from https://keras.io/applications/ and https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html using Keras, TensorFlow, and Python. A very high accuracy of 92% was achieved. This success proves that machine learning can successfully and readily be applied to medicine. Future projects could involve classifying skin cancer and other diseases with a visual component with different convnets. Additionally, if there is sufficient longitudinal data, MRI could be used to predict Alzheimer’s instead of mere classification.

Islam, Md Shariful - Opinion Mining Classification of Twitter Data for Major

Telecom Operators in Canada

A fierce competition is visible among the telecom operators to acquire more subscribers through advertisements and campaigns, especially in social media. Now the question arises how to measure the performance of operators based on customer response. The goal of this project is to measure the competitive performance of mobile phone operators by analyzing the customer sentiment from twitter data and to build a classifier model by using different machine learning algorithms. 9000 tweets are collected for top three mobile operators in Canada. After data cleaning and text processing, sentiment analysis was completed. Then sentiments are classified and compared by using three different algorithms. Among the three operators, Telus has the highest Positive sentiment than others. Among the algorithms, SVM and RF has better accuracy than decision tree. This will help wireless operators to know about the negative experiences and to turn it into positive experience by improving the particular service. . Kundu, Somnath - Graph Theory Persepctive of Stock Market Behaviour

It is often noticed in practice that the prices of different Stocks and other financial investment instruments move together. It is not too surprising since many different

4

companies engage in similar type of business and one business depends on other businesses. So those companies are assumed to be tied together by an invisible thread of relationship. Though it is difficult to find the actual relationship of the companies we can always measure the strength of their relationship by the similarity of movement of their attributes. We may assume that, if the correlation coefficient of one or more attributes of two stocks is larger than some chosen threshold value, then those two stocks are connected by an edge in the relationship graph. In this project our objective is to explore these relationships between the stocks from the Graph Theory perspective and investigate various properties of this Stock Relationships Graph, including clustering.

Patel, Jitesh - Predicting Breast Cancer Survival Based on Gene Expression and

Clinical Variables

Survival of breast cancer patients is irregular. Several gene sets are directly or indirectly involved in breast cancer. I explored whether the combination of mRNA expression of such sets may improve prediction of triple-negative breast cancer. I have used TCGA gene expression data for this study and classified 19 genes into two sets based on the relationship between death-risk and expression of each gene using Cox model. The up-regulation of the first gene-set combined with the down-regulation of the second gene-set is correlated with a high risk for the triple-negative breast cancer. The triple-negative breast cancer is classified based on the expressions of estrogen, progesterone and HER2 receptors. The combined effect of gene set1 and set2 on survival was predicted on overall data for triple-negative and luminal class of breast cancer using Kaplan-Meier model. Combining the effect of multiple gene signatures improves prediction of triple-negative breast cancer survival. This methodology can be relevant for different cancer types and target therapies. Rizvi, Syed Ali Mutahir - Prediction of the Directional Change or Strength of Forex

Rates

Predicting FOREX pairs direction or strength is extremely hard and predicting trends

has been an area of interest for researchers for many years due to its complex and

dynamic nature. There are hundreds of trend indicators for the prediction of FOREX but

the accuracy is not reliable. In this project, a combination of indicators (Day Close

Strategy, Moving Average Crossover, Fractal Strategy, Renko charts, ATR & Breakout)

and machine learning algorithms (Naïve Bayes, Support Vector Machine, Deep Neural

Networks) are used for better prediction accuracy and the results suggest that this

approach is helpful in providing decision support.

Shi, Pengshuai - Population Counting with Convolutional Neural Networks

In this project, we explore the challenge of automatic population counting from single images. Most recent work apply Neural Networks to extract visual features and regress the population count either explicitly or implicitly. This type of model has been shown to perform better than traditional counting methods that require localizing each object

5

and hand-crafted image feature representations. In this work, we compare two different types of CNN-based counting models. The first model consists of a fully Convolutional Neural Network (CNN) that predicts a pixel-wise density map, where the entity counts are realized by post-hoc summing over the density map. The second model is a neural network that consists of an initial set of convolutional layers, followed by fully connected layers that directly regresses the entity count. Our empirical evaluation considers three diverse data sets: (i) cells captured under a microscope, (ii) aerial views of sea lions, and (iii) aerial views of crowds of people. We find that the direct count regression approach generally performs better than the indirect one. In addition, we explore a saliency map approach to visualize the location of the count entities. Trikha, Anil Kumar - Enhancing User Interest Representation in Social Media

User interest detection in social media is valuable for providing recommendations of goods and services, modeling users, and supporting online advertising. Only recently have models for inferring implicit interests and predicting future interests been proposed. We extend these models by specifying a technique that yields an improved representation of user interests for these purposes. We evaluate the solution on publicly available Twitter data. The research question we address is whether user interests derived from micro-blogging posts can be more accurately represented using a data mining approach that utilizes association rules.

Yadav, Shailendra Kadhka - Risk Prediction of Collisions in Toronto

Collision prediction models are used for a variety of purposes; most frequently to estimate the expected accident frequencies from various roadway entities like aggressive driving, traffic control, road class, speeding etc. and also to identify factors that are associated with the occurrence of accidents. In this study, the Decision Tree, Random Forest and ARIMA time series model are implemented and analyzed over the Killed or Seriously Injured (KSI) Traffic Data so as to predict the severity of injury type, number of collisions in Toronto for future 12 months. The ARIMA model gives accuracy of 85% for the prediction of number of collisions. The Decision Tree using CART and Radom Forest models returns accuracy of 57% and 67% respectively for the classification of injury types. Yan, Bingsen - Automatic Sentiment Analysis Process: Amazon Online Shopping

Background: Sentiment Analysis appears to be significantly helpful offer of time-saving

and efficiency enhancement especially given the fact that customers nowadays tend to

rely more and more on products reviews when shopping online. Aim: Develop a

Sentiment Analysis Tool as a Decision Aid for Online Shopping Experience

Improvement. Methodology: We use web scraping technology to collect online real-time

data, sentiment analysis technology to get sentiment score for each review and

machine learning model to predict star rating. Results: We developed an automatic

process to generate a product report including price, star rating, reviews and sentiment

scores. Also, we analyzed the relationship between star rating and the three sentiment

scores. In addition, a prediction model has been built up to predict star rating using

6

sentiment score. Conclusion: The Tool enables customers to value a product in a more

efficient way. Also, this is a powerful tool for star rating prediction.

Yueh, Ming-hui - Determining Factors Influencing Prediction of Length of Stay

Having a predictive model helps doctors identify short stay patients more objectively. To

identify important features for predicting a patient’s length of stay of 72 hours or less,

three stages of data processing were performed from obtaining initial variables to

applying feature selection methods for determining a subset of features, which were

then fed into several learning algorithms. AUC and precision-recall curves were used to

measure model performance. Regardless of the selection method and impute approach,

ALB (Albumin) value, Age and HGB value were found to influence model performance

the most.

2018

Amadou, Angelina – Geospatial Simulation and Modelling of Out-Of-Home

Advertising Viewing Opportunity

Companies use out-of-home (OOH) advertising to promote their products. The purpose

of the project is to build an integrated multi-source simulation model that allows

Environics Analytics to establish optimal locations for marketing campaigns. The study

area is located in the province of Manitoba in Canada. It concerns mainly the Winnipeg

Metropolitan Area, which includes the city of Winnipeg and its surrounding

municipalities. Using Dijkstra's algorithm for finding shortest paths, a simulation

algorithm is developed. The top ten busiest intersections are retrieved and used as

recommended locations for out-of-home (OOH) advertising. Additionally, a Wilcoxon

signed rank test is used to validate the simulation output against empirical data. In

general, there is no statistically significant difference between the simulated data and

the empirical set. The study has shown that multi-agent-based models, although in their

infancy, represent a viable approach to modelling population dynamic. Results from

the simulation can be used to develop a new model which may include demographic

profiles of the population for further studies.

Arabi, Aliasghar – Text Classification Using Deep Learning in Reddit

Reply/Comments

In this paper, I implemented several deep learning models to automatically classify

posts from subreddit of “askhistorians” into defined classes using pre-trained word

embeddings vectors. The training data is taken from the research done at Ryerson

University Social Media lab. I used one-vs-the-rest classifier (OvR) to train separate

model for each of eight classes. Keras library from Python is used to develop deep

learning frameworks starting from individual models such as CNN, and LSTM and

finishing up by combining the individual models to form more complex versions such as

CNN+LSTM and LSTM+CNN. When compared with previous work using traditional

7

models and N-grams as features, improvement in all three accuracy, recall and

precision is observed. However, the best model considering all evaluation metrics,

stability/ranges of results for all iteration/fold, and run time found to be CNN for all

categories.

Arjumand, Isra – Stock Market Prediction Using Machine Learning

This project focused to find efficient prediction of the Apple Inc. (AAPL) stock price

movement to make effective investment decisions by generating trading decisions,

comparing of SVM, KNN and RF machine learning algorithms and profit comparison.

Research shows that use of machine learning algorithms with technical analysis gave

good results. Technical analysis was implied on the data to generate trading signals and

algorithms were trained on them to predict future stock trends. By applying trading rules,

decision points (buy, sell and hold) are generated. SVM performed better on experiment

1 and RF was efficient in experiment 2. Performance was evaluated using profit

percentage. Adding more technical indicators improved the profit percentage. In

conclusion, better profits are generated when technical indicators were used along with

machine learning techniques in contrast with technical indicators alone.

Beqaj, Inela – Diabetic Retinopathy Detection Using Convolutional Neural

Networks

Machine learning techniques are becoming more and more helpful in many areas of our

everyday life such as education, healthcare, etc. One of the main applications of these

techniques in healthcare is computer-aided diagnosis which are systems that assist

doctors in the interpretation of medical images. This project is focused on medical

image analysis of retinal images to identify the type Diabetic Retinopathy eye disorder

which is the leading cause of blindness among people diagnosed with diabetes. In this

project are used supervised techniques and semi-supervised ones to classify the

images. The two types of convolutional neural networks architectures applied in

supervised learning are VGG16 and DenseNet121, while the architecture used in semi-

supervised mode is Adversarial Autoencoder. The semi-supervised techniques achieve

the same accuracy as the supervised one, but they are more efficient because they

achieve the same accuracy using 10 percent of the labeled data

Chowdhury, Kakoli – Binary Classification on Clustered Data

Land-Mobile radio systems support many vital communication functions supporting

government and private operators, some related to public safety and mission critical

functions. The models produced will help in understanding the usage patterns at

different time periods to predict occupancy and demand by different channels across

the spectrum. CRC (Canadian Research Corporation) is providing Layer 1 data

sampled every three milliseconds. This data is further explored and processed under

this MRP. Sub-setting of data is conducted based on clustering and descriptive

statistical analysis designed to differentiate between channels exhibiting different

occupancy % patterns. Applying algorithms on clustered data is expected to show

8

distinct behaviors that are further utilized to find the best prediction model for spectrum

availability.

Fadel, Fady – Organizing Web Search Results Using Best Clusters Separation

In this paper, we applied a novel idea to utilize machine learning techniques to

automatically organize web search results from search engine queries. Using text

mining analytics, we first conducted analysis to identify features that can be used for

clustering. Second part of analysis was an evaluation of the best clusters separation

method and the performance comparison of the selected features against different

clustering algorithms.

Gupta, Vasudev – Predicting Gold Prices Using Neural Networks

The aim of the study in this paper is to predict Gold future prices using neural networks.

Prices of Gold change rapidly in real time across the globe, making the price prediction

interesting and challenging. Predicting Gold prices stresses the machine learning

algorithms and technology and is a good test case. The North American perspective on

Gold price prediction was used within this study. Gold is used as an investment vehicle

by large number of investors across the world and successful predictions can be very

helpful. Five input variables were used to predict the price, which are: Silver Future

price, Copper Future price, Dow Jones Industrial Average, US Dollar Index and VIX

volatility Index. Two Types of Neural networks models were used to predict the Gold

Prices: Feedforward Neural Network (FNN) and Recurrent Neural Network - Long

Short-Term Memory (RNN-LSTM). As well, different variations of training data -

weekly/daily, short/long term were tried. Experimentation was also undertaken with

USD/CNH (US Dollar to Chinese Renminbi exchange rate) as an additional input

variable.

He, Xin – Movie Recommender System: Using Ratings and Reviews

Because of information overload, it is becoming increasingly difficult for users to find the

content that they are interested in. Usually, the actual ratings are used to implement a

Recommender model. Currently, many item evaluation systems not only have the

ratings but also the reviews. In this report, we mainly describe how to use both ratings

and reviews to implement a recommender model. Additionally, the project investigates

the relationship between the ratings and reviews.

Hyder, Md Khaled – Sentiment Analysis of Twitter Data For Top Canadian

Retailers

The competition among retail companies is visible in all communication channels. Most

companies are now focusing on social media marketing to reach the vast consumer. In

parallel to aggressive communication retailers also want to measure their own and

competitor’s performance.

9

This project aims to measure the performance of retail companies in Canada by

analyzing users sentiment from tweets and build a machine learning model that can

predict with higher accuracy, also conduct exploratory analysis to find user engagement

and other hidden patterns.

286,668 tweets were collected for the top five retailers. After processing and cleaning

the dataset, an exploratory analysis was conducted to find hidden patterns, and a

sentiment classifier model developed using five algorithms experimented with two

vectorizers.

Among the five retailers, Sobeys has the highest positive score than others. Initially,

Linear SVM with count vectorizer produced the highest accuracy, then random

oversampling with TF-IDF vectorizer produced high and balanced precision and recall

values. This solution will help retailers to compare their performance with competitors.

Jain, Sachin – Binary Classification Prediction on Time Aggregated Data

The main objective of this Major Research Project (MRP) is to find out the effect of time

resolution on the prediction of the channel occupancy of Land Mobile Radio channels to

facilitate dynamic spectrum allocation that increases overall spectrum efficiency. This

project is a collaboration between Canadian Research Corporation and Data science

lab in Ryerson.

Layer 1 data is measured for occupancy percent of more than 7000 channels

approximately every three milliseconds. This MRP specifically looks to generate

aggregate dataset, generated from Layer 1 data, to predict channel occupancy. Further

predictive classification is conducted using Naïve Bayes and Logistic Regression

algorithms on the datasets. The ultimate goal of this project is to build spectrum

occupancy prediction model that is known to work best in given conditions.

Jandu, Arshnoor – Neural Style Transfer With Image Super Resolution

In fine art, humans have mastered the skill to create unique visual representations

through combining content and style of an image. However, rendering the semantic

content of an image in different styles is a difficult image processing task. Recent

success of Deep Learning in computer vision has demonstrated the power in creating

imagery by technique of separating and recombining the image content and style called

Neural Style Transfer (NST). Several online and offline optimization methods are

proposed that produces new images of high perceptual quality. However, these existing

methods do not offer flexibility of creating high resolution upscaled images. In this

project, I have implemented deep neural networks for Neural Style Transfer and Single

Image Super Resolution, where users can transform photos into desired paintings and

further upscale them in high resolution quality. This project also demonstrates

experimentation with several parameters of NST to create amazing photo effects.

10

Kashyap, Askhat – Stock Price Movement Prediction Using Social Media (Reddit)

Analysis

In this paper, we applied different machine learning techniques to predict stock price

movement based on metrics derived from reddit posts of a subreddit called “economy”.

As part of exploratory data analysis, I tried to identify patterns in stock price movement,

performed data cleanup on reddit posts and identified important topics discussed in

reddit posts. We categorized stock market data in 3 classes i.e. positive, negative and

steady, we marked data as positive/negative if the market direction is upward/downward

and more than a certain threshold (above +/- 1%) else we marked it as ‘steady’, we

considered volume changes while calculating this percentage.

Khan, Ghazala – 6-Month Infant Brain MRI Segmentation With Convolutional

Neural Network

Brain MRI segmentation and analysis is one of the most important and initial steps in

measuring brain’s anatomical structure and visualizing any changes and developments

in the brain. Early stage of brain development is “critical in many neurodevelopmental

and neuropsychiatric disorders, such as Schizophrenia and autism.” These

abnormalities and disorders are detectable at early infant age and early interventions

are possible to control a life at risk.

To investigate the problem this paper proposed two models 2D Conv and 3D FCNN for

the brain MRI tissues segmentation of 6 months infants into GM, WM and CSF with

multi-modality T-1 and T-2 weighted images by using MICCAI grand challenge

iSeg2017. The architecture of 2D Conv was inspired by VGG model with modifications.

The architecture of 3D Fully Convolutional Neural Networks was inspired by the recent

work on infant brain MRI segmentation.

The quantitative evaluation of 3D FCNN exhibited substantial advantages of the

proposed method in terms of accuracy of tissue segmentation with efficient use of

parameters. 3D FCNN has shown comparative performance with 21 state-of-the-art

international teams of the iSeg2017 challenge and acquired DSC score of 93%.

Luo, Jiefan – Twitter Bots Detection Utilizing Multiple Machine Learning

Algorithms

The purpose of this paper is to apply multiple machine learning algorithms to develop

bot-detection models for Twitter. Using exploratory analysis, I explored the Twitter

metadata and found useful behavior features to distinguish between normal users and

bots. For the training models, I found optimal hyperparameters to tune the different

models. I applied five algorithms including Naive Bayes, Decision Tree, Random Forest,

Linear Support Vector Machine (SVM), and Radial Basis Function SVM to classify bots

and humans. The results of the classification are the account identities, and I measured

the classification performance by accuracy, sensitivity, specificity, and area under the

11

receiver operating characteristics curve (AUC). The results present that the Random

Forest algorithm was most effective in detecting bots and identifying normal users.

Najlis, Bernardo – Applications of Deep Learning and Parallel Processing

Frameworks in Data Matching

Most of Data Science research work assumes a clean, deduplicated dataset as a pre-

condition. In reality, 80% of the time spent in data science work is dedicated to data

deduplication, cleanup and wrangling. Not enough research papers focus on data

preparation and quality, even though it is one of the major issues to apply Data Science.

The research subject of this paper is to improve Data Matching techniques on multiple

datasets containing duplicate data using parallel programming and Deep Neural

Networks. Parallel programming frameworks (like MapReduce, Apache Spark and

Apache Beam) can dramatically increase the performance of computing pair

comparisons to find potential duplicate record matches, due to O(n2) complexity of the

problem. Deep Neural Networks have shown great results to improve accuracy on many

traditional machine learning applications. The problem and solution researched are of

general applicability to multiple data domains (healthcare, business).

Ong, Liza Robee – Predicting Depression Using Personality and Social Network

Data

Over 300 million people worldwide suffer from depression. With the advent of social

network, our goal is to apply a novel approach to identify depression by investigating

what relationships exist between an individual’s social network information and speech

features, their personality, and depression levels. The study was conducted using a

publicly available dataset, called myPersonality, which contains more than 6,000,000

test results, and over 4,000,000 individual Facebook profiles. From the dataset, we

used depression risk and personality assessment scores, Facebook network and

linguistic measures. We created a classifier to extract a feature that indicates the

speech act of a status update. We applied several machine learning methods and

feature sets to predict depression risk based on personality type, speech acts, and

network influence. Our results show that the best predictors included personality

dimension scores on neuroticism, conscientiousness, and extraversion, and the usage

scores for the assertive and expressive speech acts.

Rafayal, Syed – Tucker 2 Tensor Decomposition Model Implementation on Visual

Dataset Using Tensor Factorization Toolbox

The main goal of this paper is to recognize and classify the images utilizing Tucker2

decomposition technique. In the first part of the experiment, the exploratory analysis is

conducted. The second part includes building training models and automatically

correctly label the testing images. In training and validation phase, different folds and

values of indices (i, j) are used to have the best performance using accuracy. In

addition, two approaches are adopted here for testing. The first approach randomly

12

selects training samples from core tensors. In the second approach, the similarity score

table is created and sorted in ascending order. The larger score of the image means

more noise and core indexes are collected from every level of noise by a certain

interval. Experimental results show the training models for the indices (i=8, j=8)

obtained more success and the second testing approach is more consistent. All

experiments have been conducted on a visual dataset which is a publicly available

dataset called Fashion MNIST using MATLAB factorization software package known as

Tensor Toolbox.

Rodrigue, Sami – Experiments With External Data and Non-Linearity for Channel

Usage Prediction

Neural Networks are one of the most popular models for predicting channel usage in the

telecommunication spectrum. They commonly use spectral, temporal or spatial

information from simulated or cellular data. However, these sources can fail to capture

the full array of user behavior. We will use fully connected Neural Networks and

Perceptrons on LMR data collected in Ottawa to explore whether enriching the input

space through the use of external data, such as weather data, applying non-linear

transformation to the input space improves the predictive power of the models. Based

on our initial analysis, we have failed to identify any improvement in prediction using

weather data, however the benefit of non-linear transformation is dependent on channel

behavior. The later point can be further explored via other models such as Recurrent

Neural Networks and different grouping of the channels.

Sharma, Suansh – Spectrum Occupancy Prediction in Land Mobile Radio Using

Multiple Hidden Markov Models

In this paper, we seek to predict the occupancy status of Land Mobile Radio channels

based on real life spectrum measurements using machine learning techniques.

Cognitive radios are essential for implementing dynamic spectrum sharing, which has

been gaining attention as a promising solution to alleviate the problem of spectrum

scarcity. HMMs are widely studied in the literature for spectrum prediction and by

design, HMMs learn from the sequential nature of the data, which is directly applicable

to case of temporal spectrum occupancy prediction. We implement a model made up of

multiple HMMs to perform spectrum occupancy prediction. We use submodels to

capture the primary user activity then the submodels are used to initialize a high-level

HMM, which is trained over an LMR channel’s occupancy over the time. We validate the

performance of the multiple Hidden Markov Models on LMR bands and show that the

multiple HMM model performs better than single HMM on predicting occupancy status

for the next hour. By training multiple HMMs, which capture not only channel occupancy

patterns over time but also low-level user activity patterns, size-able gains can be made

in the performance of data driven spectrum prediction techniques.

13

Sirwani, Naresh Kumar – Prediction of Query Hardness

Information retrieval (IR) became an important part of today’s data driven world,

although most IR systems suffers with high variance in their retrieval performance &

results quality due to several reasons, even the system who performs better on average

can still return poor results for some queries. Understanding such hard queries and in-

fact predicting their difficulty level before the search is taken place can bring many

improvements in performance of IR systems including but not limited to providing direct

user feedback on expected quality of results, federation or metasearch, content

enhancement and query expansion etc. In this paper, we systematically study &

implement various TF-IDF based pre-retrieval method to determine queries difficulty

level on different TREC’s data collection, we then compare the results of our

experiments with Neural embedding and SELM (Semantic Enabled Language Model)

based models for which results are already available from other similar studies and find

out which methods performs better, more relevant and accurate.

Taylor, Kisha – Automated Stock Trading Based on Predicted Direction of Next-

Day Closing Prices – S&P 500 Index

This paper develops a model that tries to mimic a trader based on predicted direction of

the next-day closing price of the S&P 500 ETF (Exchange Traded Fund) and can be

applied to a single stock/index.

Three approaches are used:

(1) Technical analysis only

(2) Machine Learning (ML) using only closing prices as inputs (baseline models) and

(3) ML model (“hybridized inputs”) that use a combination of technical indicator(s) and

raw closing price as inputs.

This classification problem uses Accuracy (main metrics), Precision & Recall and return

metrics. The data (sourced from Yahoo Finance) uses 3 ½ years of trading data (Open

& Closing Prices) from 2-January-2015 to 06-June-2018.

The paper also explores the use of a buffer, examining its predictive impact. The buffer

is essentially a threshold used to derive the signal generated by the technical indicator.

Tomini, Emmalie – Load Forecasting Using Recurrent Neural Networks in Ontario

Energy Markets

Reliable electricity load forecasting is essential for industry to devise efficient energy

management plans as well as in guiding conservation efforts. Rising market demand

14

and unpredictable behaviours has resulted in traditional methods of electricity prediction

being no longer robust enough to accurately forecast market demand. The aim of this

project was to use machine learning approaches to create a model for effective load

forecasting. Implemented in python, a recurrent neural network was trained on a variety

of input features in order to determine what information is necessary to model Ontario

load patterns. Calendar variables such as day, month year, day of the week and time,

as well as relative humidity and dew point temperature were determined to produce the

most accurate results when the RNN model was trained on this input space, yielding a

MAPE of 5.19% on the test set. The results obtained from the models implemented in

this study produce reasonably accurate day ahead electricity forecasts. However, there

is possibility for improvement in this field, and machine learning approaches provide an

excellent application in this area of study.

Walia, Harneet – Customer Acquisition Through Direct Marketing Campaign

Analysis

In this research, we analyze a direct marketing campaign dataset obtained from a Portuguese Financial institution to predict if a customer will subscriber to a fixed deposit(Upsell) along with predicting the month (time aware) to best reach out to the customer. To solve the issue of time-aware upselling, we have implemented Time Aware Upsell Prediction Framework (TAUPF) using two different approaches, with an aim to find the best approach and technique to build the prediction model. TAUPF is implemented using Upsell Prediction Approach (UPA) and Clustered Upsell Prediction Approach (CUPA). We have also tried to answer the data imbalance problem by examining and comparing different methods of sampling (Up-sampling and down-sampling). For decision tree, K-Nearest neighbor and Random forest it was observed that CUPA has higher F-Score than UPA. It is also observed that prediction of the month, the number of calls being made to the customer before the customer subscribes for a fixed deposit can be reduced by a significant number. Wan, Alexander – Learning About Tensor Decomposition to Determine Length of

Stay

Tensor decomposition is a section of data science that can be used to build prediction

models. By using tensor decomposition on St. Michael’s Hospital data, a model can be

developed to identify patient’s length of stay. An application of tensor decomposition

known as generalized tensor product is applied on the St. Michael’s Hospital dataset.

The dataset is assembled based on measuring variable’s performance, correlations for

pre-processing data and keeping variables the hospital deems important. Using

performance metrics like comparing other machine learning algorithms to measure

model performance. The average error was around 80 hours from tensor

decomposition. However, in comparison to the other machine learning algorithms,

tensor decomposition was more accurate. A major problem was the computer used for

this project was not powerful enough to test for higher dimensions. This could mean that

the data needs to be looked at again to make a better dataset to analyze.

15

Wu, Xinjie – Validation and Sensitivity Study of a LSTM Model for Stock Price

Prediction

Time series data is everywhere in everyday life as well as in many business sectors.

Ability to predict the performance of a process in the future will help reduce uncertainty,

risk, make the highest profit and best performance from many industries. Stock price

sequence is an easy accessed on-going time series dataset. The “unpredictable”

feature make it a good source to challenge emerging algorithms.

LSTM (Long Short Term Memory) is an algorithm well designed for time series data

forecast. In this project, a recent proposed LSTM model for stock future

price/movement prediction was studied and compared to other available models. Then

the model was applied to couple selected stocks for validation. The sensitivity study of

parameters in the model was also presented.

The study done showed the presented model had advantage over other models but are

still not universal. Perfect prediction was not guaranteed.

2019

Afsar, Tazin – Chest X-Ray Segmentation Utilizing Convolutional Neural Network (CNN)

The proposal of this project is to analyze the Chest X-ray segmentation process using

the improved Attention gate (AG) U-Net architecture. This model suppresses the

irrelevant regions and saliently points out the useful features for the targeted relevant

tasks. Also, it takes less computational resources and learns automatically for the

different sizes and shapes of the target chest’s X-ray images. AG with U-Net increased

the model’s sensitivity and accuracy. This experiment is a continuation of an existing

work of “RSNA Pneumonia Detection Challenge” [27]. The proposed architecture is

analyzed by using two renowned chest x-rays data-sets: Montgomery County and

Shenzhen Hospital. The experimental result shows the improvement of dice scores and

accuracy by 1.0% and 4.0% respectively compared with the existing standard U-net

architecture.

Ahmed, Sayed – Effect of Dietary Patterns on Chronic Kidney Disease (CKD)

Measures (ACR), and on the Mortality of CKD Patients

Chronic Kidney Disease (CKD) leading to End-Stage Renal Disease (ESRD) is very

prevalent today. Over 37 millions of Americans have CKD. CKD/ESRD and interrelated

diseases cause a majority of the early deaths. Many research studies have investigated

the effects of drugs on CKD. However, less attention has been given to the study of the

dietary patterns on CKD. This research study has uncovered significant correlations

16

between dietary patterns and CKD mortality as well as identified diagnostic markers for

CKD such as the Albumin to Creatinine Ratio (ACR). In this project, Dietary surveys

from NHANES, and CKD Mortality dataset from USRDS were utilized to study the

correlation between dietary patterns and morbidity of CKD patients. Principal

Component Analysis and Regression were utilized to find the effect. Machine Learning

Approaches including Regression, and Bayesian were applied to predict ACR values.

Grains, Other Vegetables showed positive correlations with Mortality whereas Alcohol,

Sugar, and Nuts showed negative correlations. ACR values were not found strongly

correlated with dietary patterns. For ACR value prediction, 10 Fold Cross Validations

with Polynomial Regression showed 95% accuracy.

Barolia, Imran – Synonym Detection with Knowledge Bases This study presents distributed and pattern based approach to identify similar words in given tweets, using low level vector embedding in vector space model. To achieve good results using distributed approach, Bilinear scoring function has been calculated. Score (u,v) = 𝑥𝑥𝑢𝑢𝑊𝑊xvt . 𝑥𝑥𝑢𝑢 is the potential source word embedding and 𝑥𝑥𝑣𝑣 are knowledge base seeds. Synonym seeds have been used from existing knowledge base (WordNet) and have been generated more synonyms that are not present in knowledge base but can be potential synonyms in given corpus. Term-Relevance Computational algorithm has also been used to identify synonyms that are specific to given corpus. Another approach that has been presented is pattern base approach. Co-occurrence matrix mas been prepared and it calculate the probability of occurring 𝑥𝑥𝑢𝑢 and 𝑥𝑥𝑣𝑣 within window size of 10. Low level embedding has been learned using conditional

probability of 𝑥𝑥𝑢𝑢 and 𝑥𝑥𝑣𝑣. Result have been presented for both approach and best result has been achieved by combining both approaches. These approached have been evaluated by regenerating same and more synonyms from dataset and evaluated against existing knowledge base. Using distribution and pattern based approach with bilinear scoring function and conditional probability the precision and recall were 74% and 55% respectively which is quite good as other study find 60% precision and lower recall.

Boland, Daniel – Battery Dispatching for Peak Shaving Using Reinforcement Learning Approaches Economic dispatch of energy resources such as batteries is an important and current problem. We apply three reinforcement learning algorithms, the Monte Carlo On-Policy and Off-Policy algorithms, and the DynaQ Planning algorithm, to a load-connected battery with time-of-use charges and a demand rate, to study the agent’s ability to converge towards a least-cost policy including peak-shaving. In two simple cases we use a fixed daily load profile, and in a third case we use 31-days of data to reflect uncertainty in the demand. In the simple cases, we observe the Monte Carlo agents converge more quickly and achieve better savings than the DynaQ agent, but all agents typically yield savings of only 40-50% of what is demonstrated to be possible after a

17

10,000-episode training time. The DynaQ agent significantly outperforms the Monte Carlo agents in the case of 31-days of data, highlighting planning behavior by reserving some charge and consistently achieving a higher degree of peak load reduction. Cai, Yutian - Musculoskeletal Disorders Detection With Convolutional Neural Network Musculoskeletal disorder is a common cause of chronic pain and movement impairment, which is diagnosed with medical imaging technologies such as X-rays. Due to the limited supply of skilled radiologists, the detection is expensive and time-consuming. In this project, we propose a model using machine learning or neural network techniques to perform the same task as radiologists in detecting abnormalities in Musculoskeletal X-ray. Musculoskeletal radiographs (MURA) is a large open-source radiograph image dataset that is used to develop and test our model. It contains labeled images for training and validation, as well as a hidden test to evaluate the model. Python will be employed in the project as it has a variety of package choices from statistical analysis to data visualization. We hope that the model can distinguish between the normal and abnormal X-ray studies and lead to significant advances in medical imaging technologies. Choi, Claudia – Using Deep Learning and Satellite Imagery to Predict Road Safety This paper expands on previous work combining satellite imagery and deep learning to predict road safety. Studies have shown support for the hypothesis that features of the built environment have an impact on city-scale issues and can be observed through satellite imagery. In this paper, a labelled dataset of satellite imagery was generated for the City of Toronto. Class balancing techniques were then used to mitigate model bias - the best technique was used for the experiments. A Convolution Neural Network (CNN) was trained for overall road accidents, pedestrian accidents and cyclist accidents. Each CNN model followed the ResNet50 framework pre-trained on ImageNet. The resulting high accuracy scores and low macro F1 scores indicate model sensitivity towards the majority class. The models were able to use observable features of the built environment to predict ‘highly safe’ regions but show poor performance on regions labelled has ‘highly risky’. Chowdhury, Mushfique – Forecasting Sales and Return Products For Retail Corporation and Bridging Among Them The purpose is to show how we can bridge between sales and return forecast data for each and every product of retail store by using the best model among several forecasting models & how management can utilize this information to improve customer satisfaction, inventory management or re-define policy for after sales support for specific products. The way of doing multi-product sales & return forecasting by choosing the best forecasting model (among several forecasting models) for every product was shown. Several machine learning algorithms has been used – ARIMA, Holt-Winters, STLF, Bagged Model, timetk, Prophet. For every product, best forecasting model was

18

chosen after comparing all of these models to generate sales and return forecast data which was then used to classify every product as “Profitable”, “Risky” and “Neutral”. Experiment showed that 3% of total products was identified as “Risky” items in future. Management can use this information to take some crucial decisions. This paper showed how to compare different models to choose the best one for each and every product and dynamically choose the best model to generate sales and return forecast data without giving more focus in optimizing the models. This is completely a new approach of utilizing sales and return forecast data to give a unique insight to the management for taking informed decision for different crucial aspects as identified above. Ensafi, Yasaman – Neural Network Approach For Seasonal Items Forecasting of a Retail Store In recent years, there has been growing interest in the field of Neural Networks. However, for the task of seasonal time-series forecasting which has many real-world applications, different researches have shown varied results. In this paper, the performance of Neural Network methods in seasonal time-series forecasting has been compared with other methods. At first, classical timeseries forecasting methods like Seasonal ARIMA and Triple Exponential Smoothing have been used and then, more current methods like recently published model Prophet, Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) have been applied. The dataset is public and consists of the sales history of retail store. The performance of different models has been compared to each other using different accuracy measurement methods such as RMSE and MAPE. The results showed the superiority of the Stacked LSTM over other methods and also, indicated the good performance of the Prophet model and CNN model. Etwaroo, Rochelle - A Non-Factoid Question Answering System for Prior Art Search A patent gives the owner of an invention the exclusive rights to make, use and sell their invention. Before a new patent application is filed, patent lawyers are required to engage in Prior Art Research to determine the likelihood that an invention is novel, valid or to make sense of the domain. To perform this search, existing platforms utilize keywords and Boolean Logic, which disregards the syntax and semantics of Natural Language and thus, making the search extremely difficult. Consequently, studies regarding semantics using neural embeddings exist, but these only consider a narrow number of unidirectional words. As a solution, we present a framework which considers bidirectional semantics, syntax and the thematic nature of natural language. The content of this paper is two-fold; BERT pre-trained embedding is used address the semantics and syntax of language, followed by the second component, which uses Topic Modelling to return a diverse combination of answers that covers all themes across domains.

19

Hosmani, Chaitra – User Interest Detection in Social Media Using Dynamic Link Prediction Social media provides a platform for users to interact freely and share their opinions and ideas. Several researches have been conducted to predict user interests in social media. Because of the dynamic nature of social media, user interests change over time. In this paper, given a set of emerging topics and user’s interest profile over these emerging topics we are interested to predict the user interest profile for the future. We conducted this experiment on twitter data captured for 2 months from 1 November 2011 to 1 January 2012. We will be using temporal latent space to infer characteristics of users and then predict user’s future interests over these given topics. We will evaluate the results with different ranking metrics like MAP and nDCG. We will also compare our results with the results of Zhu et al. temporal latent space which uses the same methodology but on a different dataset. Islam, Samiul – Product Backorder Prediction Using Machine Learning Techniques to Minimize Revenue Loss With Efficient Inventory Control Prediction of backorders of products boosts up companies’ revenues in many ways. In this work, we have predicted the backorder of products using two machine learning models named Distributed Random Forest (DRF) and Gradient Boosting Machine (GBM) in H2O platform and have compared their performances. We have observed that the GBM successfully identified approximately 94 products out of every 100 products those go on backorder. We have noticed that the current stock level and the lead time of products act as the deciding factor of products’ backorder in approximately 45% of cases. We have shown how this model can be used to predict the probable backorder products before actual backorder can happen and visualize the impact on inventory management. Moreover, we have identified that the decision threshold below 0.3 for high probable backorder products and the threshold between 0.2 to 0.8 for low probable backorder products maximizes organizational profit. House-Senapati, Kristie - The Use of Recommender Systems for Defect Rediscoveries Software defects are a known issue in the world of technological advancement. Software defects lead to the disruption of services for a customer, which in turn results in customer dissatisfaction. It is not feasible for all customers to install a fix for every known defect as this requires extra resources. Our goal is to predict which future defects a customer may discover, so that a fix can be put into place before the customer discovers that defect. We use recommender systems to build a predictive model. We evaluate our approach with publicly available datasets mined from Bugzilla (Apache, Eclipse and KDE). The datasets contain information about approximately 914,000 defects over a period of 18 years. From our experiments, we find that the popular algorithm performs the best with average Matthew Correlation Coefficient of 0.051. We also observe that the Funk SVD, apriori, eclat and random algorithm perform poorly.

20

Husna, Asma - Demand Forecasting in Supply Chain Management Using Different Deep Learning Methods Supply Chain Management (SCM) is a very fast growing and largely studied field of research that is gaining in popularity and importance. Most organizations focus on cost optimization and maintaining optimum inventory levels for consumer satisfaction, where Machine Learning techniques helps these companies. The main goal of this paper is to forecast the unit sales of thousands of items sold at different chain stores located in Ecuador. Three deep learning approaches Artificial Neural Network (ANN), Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) are adopted here for better predictions from the Corporación Favorita Grocery Sales Forecasting dataset collected from Kaggle website. Finally, the performances are evaluated and compared. The results show the LSTM network tends to outperform the other two approaches in terms of performance. All experiments have been conducted using Python's deep learning library and Keras and Tensorflow packages. Lee, Veson – Estimating Volatility Using A LSTM Hybrid Neural Network Volatility estimates of market traded financial instruments are used in risk management models and portfolio selection. Hybrid neural networks are neural networks which combine a traditional parametric model such as GJR-GARCH with a neural network component and have been shown to improve volatility predictions. This paper will examine hybrid neural networks incorporating two different neural architectures, one with a LSTM, without exogenous explanatory variables and measure its performance using data from the Toronto Stock Exchange and S&P 500. We found that, in a neural network without exogenous explanatory variables, hybridizing the network by incorporating the parameters from a GJRGARCH(1,1,1) model does seem to have some possible benefits. Matta, Rafik - Deep Learning to Enhance Momentum Trading Strategies using LSTM on the Canadian Stock Market Applying machine learning techniques to historical stock market data has recently gained traction, mostly focusing on the American stock market. We add to the literature by applying similar methods to the Canadian stock market, focusing on time series analysis for basic momentum as a starting point. We apply long-short term memory networks (LSTMs), a type of recurrent neural network and do a comparative analysis of the results of a LSTM to a logistic regression (LOG) approach as well as a basic momentum strategy for portfolio formation. Our results show that the LSTM financially outperforms both the LOG and basic momentum strategy, however the area under the curve of the receiver operating characteristic curves show the results do not outperform a random walk selection. We conclude that there might not be enough data in monthly returns for the LSTM in its current configuration.

21

Natarajan, Rajaram - Road Networks – Intersections and Traffic For this MRP, Research will be conducted on traffic/congestion in the Road Network. Our goal was to identify the most critical intersection, looking for ways to improve traffic and what can make the traffic worse. I will build a simulation model, research the city road network and the simulated traffic, I will show the areas which has the highest and lowest congestion based on the simulation, and which areas has the most efficient flows in traffic. I will also evaluate the impact on traffic when a critical node is brought down, impact on high congestion area and in the overall network. I will talk about the connections and relationships between the most critical nodes. Based on the research, recommendation will be provided on the changes in the road including new road/bridge construction that reduce the traffic and ways to reduce congestion. I will be using the Ottawa-Gatineau Dataset and Julia language. Ozyegen, Ozan – Experimental Results on the Impact of Memory in Neural Networks for Spectrum Prediction in Land Mobile Radio Bands Land-Mobile radio systems support many vital communication functions supporting government and private operators, some related to public safety and mission critical functions. The models produced will help in understanding the usage patterns at different time periods to predict occupancy and demand by different channels across the spectrum. CRC (Canadian Research Corporation) is providing Layer 3 data sampled every hour. This data is further explored and processed under this MRP. A powerful learning algorithm called Long Short Term Memory Networks is used to predict the occupancy of LMR bands over multiple time horizons. The results are compared with a seasonal ARIMA model and a Time Delay Neural Network. Results show that LSTM prediction models that remember long term dependencies and thus designed to work with time series data provide a better alternative for accurately predicting spectrum occupancy in bands that exhibit similar characteristics to LMR channels, especially as the forecast horizon gets longer. Patel, Eisha – Generating Stylistic Images Using Convolutional Neural Networks Fine arts have long been considered a reserved mastery for the minority of talented individuals in society. The ability to create paintings using unique visual components such as color, stroke, theme, etc. is currently beyond the reach of computer algorithms. However, there exist algorithms which have the capability of imitating an artist’s painting style and stamping it on to virtually any image to create a one-of-a-kind piece. This paper introduces the concept of using a convolution neural network (ConvNet or CNN) to individually separate and recombine the style and content of arbitrary images to generate perceptually striking “art” [2]. Given a content and style image as reference, a pre-trained VGG-16 ConvNet can extract feature maps from various layers. Feature maps hold semantic information about both reference images. Loss functions can be developed for content and style by minimizing the mean-square-error between the feature maps used. These loss functions can be additively combined and optimized to render a stylistic image [6]. This technique is called Neural Style Transfer (NST) and it

22

was originally developed by Leon Gatys in his 2015 research paper, “A Neural Algorithm of Artistic Style”. My MRP research attempts to replicate and improve upon the work done by Leon Gatys. The purpose of this research is to experiment using a variety of feature maps and tweaking the loss function to identify visually appealing results. A total variation loss factor is also included to minimize pixilation and sharpen feature formation. Images generated have been assigned a Mean Opinion Score (MOS) by a group of non-bias individuals to affirm the attractiveness of the results. Peachey Higdon, Ben – Time-Series-Based Classification of Financial Forecasting Discrepancies We aim to classify financial discrepancies between actual and forecasted performance into categories of commentaries that an analyst would write when describing the variation. We propose analyzing financial time series leading up to the discrepancy in order to perform the classification. We investigate what models are best suited towards this problem. Two simple time series classification algorithms – 1-nearest neighbour with dynamic time warping (1-NN DTW) and time series forest – and long short-term memory (LSTM) networks are compared to common machine learning algorithms. We perform our experiments for two cases: binary and multiclass classification. We examine the effect of including supporting datasets such as customer sales data and inventory. We also consider augmenting the data with noise as an alternative to random oversampling. The LSTM and 1-NN DTW models are found to be the strongest, suggesting that the time series approach is appropriate. Including the inventory dataset improves multiclass classification. Data augmentation grants a slight improvement for some models over random oversampling. Postma, Cassandra - Netflix Movie Recommendation Using Hybrid Learning Algorithms and Link Prediction Netflix, a streaming service that allows customers to watch a wide variety of movies, is constantly updating and optimizing their search and recommendation algorithms to improve user experience. The aim of this paper is to recommend movies to users using different link prediction methods as well as predict a user’s movie rating using a comparison of various learning algorithms. First, an exploratory analysis was conducted to find any correlations between variables and users. Then several algorithms were used to predict a user’s movie rating such as KNN, SVM, and hybrid learning algorithms. Finally, the data was represented as a graph and several link prediction algorithms were run to compare different recommender systems. Ragbeer, Julien – Peak Tracker IESO is the Crown Corporation responsible for operating the electricity market in the province of Ontario, Canada. IESO publishes (every-changing) forecasts for what it expects Ontario electrical demand to be in the near future. In this paper, we focus on short term time-series forecasting (within 24 hours). This solution hopes to forecast better than IESO, so that large commercial customers can feel surer about what the

23

upcoming demand is, and when to shave power if they are Class A customers. The solution combines many data sources (weather forecast data, historical weather data and historical demand data) and aggregates them. The project uses numerous regressors (both linear and non-linear) on the aggregated data to come up with a prediction which is compared to IESO’s forecast using 3 metrics, coefficient of determination, mean absolute error and the number of times it correctly predicts the hour of the highest daily demand. The results of this paper (10%-40% more accurate than IESO in some cases) show that there’s value in out-predicting IESO’s free model – being more accurate can have a positive effect on the bottom-line. Raja, Abdur Rehman – Rating Prediction of Movielens Dataset In the modern world convenience has become the biggest factor in our modern-day lives. Due to the overwhelming choices each consumer has there is a need to filter, prioritize and efficiently deliver recommendations to consumers. This project aims to look at one of the most famous datasets provided by GroupLens for research purposes called MovieLens 20M. GroupLens is a research lab trying to advance the theory and practice of social computing. GroupLens has collected and made available rating datasets from the MovieLens website which is a free movie recommendation service. The project will look at finding the best solution to predict the movie ratings to be used in a recommender system algorithm. One of the main algorithms we use that is discussed in detail is BellKor’s Solution which is the algorithm the winner used in the Netflix competition to predict movie ratings. A comparison of BellKor’s solution and other algorithms take place to find the best algorithm suited for this dataset. Roginsky, Sophie – Radio Coverage Prediction in Urban Environment Using Regression-Based Machine Learning Models Having a reliable predictive model of radio signal strength is an essential tool for planning and designing a radio network. The propagation model is often used to determine the optimal location of radio transmitters in order to optimize the power coverage in a geographic area of interest. This research proposes a Generalized Linear Model for radio signal strength prediction. Using feature engineering methods, the performance of the linear model was optimized to offer predictive accuracy comparable to more complex regression models, i.e. Multi-Layer Perceptrons and Support Vector Regressors, found in existing literature. Beyond computational efficiency, the advantage of the GLM is that it is linear in parameters, making it a viable option for coverage optimization applications. Saeed, Usman - Digital Text Forensic: Bots and Gender Classification on Twitter Data This research work describes the contribution of the Data Science department of Ryerson University, Canada in task bots and gender profiling at CLEF PAN-191 evaluation lab. The goal of this paper is to detect (A) if the author of a Tweet is a bot or a human, (B) if human, identify the gender of that particular author. Data set was made

24

available by PAN lab. We participated in the research of English language data set only. In the proposed approach, before applying machine learning models, we used different word vectorization techniques after applying various preprocessing techniques (stemming, stop words removal, lowercase, etc.) on the data set. On independent evaluation of PAN lab test dataset, we got best accuracies 79.51 on task A (binary class) by using MultinomialNB and 56.55 on task B (multi-class) by using Decision Tree classifier. Sokalska, Iwona - Boosting Bug Localization with Visual Input and Self-Attention Deep Learning (DL) methods have been shown to achieve higher Mean Reciprocal Rank (MRR) scores in bug-localization compared to Information Retrieval (IR) methods alone. A combination of both can boost scores by 6% for MRR of 48%. The DL model consists of Recurrent Neural Network. In natural language research, it has been demonstrated that RNN neural networks with visual input and ‘attention mechanism’ are more robust at tasks that require incorporation of distant information. The objective is to examine whether an RNN network with attention mechanism using images of code snippets can achieve higher scores than an RNN alone. Moreover, to see if the improved performance is in a similar range as the improved performance between standalone RNN vs. RNN + IR. Using the data gathered from the open source Spring-Boot project, covering data from 2013-2018, a baseline RNN model was compared to an enhanced model RNN with a supportive convolutional neural network that analyses the image of the source code. A 5-fold experiment was conducted to compare the baseline model with the 2 test models. Two test models differ only with the usage of self-attention in the convolutional branch. The test model with self-attention had the highest mean accuracy across the 5 folds, of 61.98 in comparison to 60.70 of the base model. The two-tailed Welch t-test reveals that this difference between the means is not statistically significant. In contrast the IR methods on average provided 6% boost to the scores. Tabassum, Anika – Developing a Confidence Measure Based Evaluation Metric for Breast Cancer Screening Using Bayesian Neural Networks Screening mammograms is the gold standard for detecting breast cancer early. While a good amount of work has been performed on mammography image classification and many of the recent ones have made use of deep neural networks successfully, there has not been much exploration into the confidence or uncertainty measurement of the classification, especially with Bayesian neural networks. In this paper, we propose a new evaluation criterion based on confidence measurement for breast cancer mammography image classification, so that in addition to classification accuracy, it provides a few numeric parameters that can be tuned to adjust the confidence level of the classification. We demonstrate the use of Bayesian neural networks and transfer learning in the process of achieving that. We also demonstrate the expected behaviour resulting from tuning of the parameters and conclude by saying that the approach is extendable to any domain in general and any number of classes.

25

Zhang, Shulin – Artificial Neural Networks in Modelling the Term Structure of Interest Rates In this paper, we applied Artificial Neural Network (ANN) to model the term structure of interest rates. In the exploratory analysis we observed the trend of the yield curve since 1991 to understand the underlying pattern. The Principle Component Analysis is employed to construct the input dataset as well as serving as a baseline model. We used different hyper-parameters, customized loss function and implemented regularization to tune the ANN model. The result section discussed the selection of best model and the prediction differences between PCA and ANN. ANN can match PCA results in a very limited case of strong regularization. The ANN has the potential to replace PCA but a careful design needs to be reviewed. This project is a continuation of an existing study that Dr. Alexey Rubtsov started for Global Risk Institute. The predicted analysis is used to provide more insights on financial applications of ANNs. Zhao, Xin – Station Based Bike Sharing Demand Prediction Bike-sharing have been increasing popularity in recent years due to its usage flexibility, reduction in traffic congestion and carbon footprint. Being able to accurately predict each bike-sharing station’s demand at any given hour is crucial for inventory management. This report first manipulated Bike Share Toronto ridership data with Toronto City Center weather data from 2016 Quarter 4 to 2017 Quarter 4, then implemented machine learning algorithms in particular Regression Trees, Random Forest, and Gradient Boosting Machine (GBM) to forecast station based hourly bike-sharing demand in the City of Toronto. The results indicated that Random Forest based prediction model was the most accurate model by comparing Root Mean Square Error (RMSE) of all bike-sharing stations.

2020 Abdolali-Senejani, Ali – Investigating the Challenges of Building a Robust Network Intrusion Detection System Through Assessment of Features and Machine Learning Models Background: Existing literature on network intrusion detection systems has focused on testing transfer learning models on different portions of the same dataset. The efficacy of transfer learning needs to be assessed on different sources of data. Aim: Use a combination of feature selection and transfer learning strategies to evaluate the performance of machine learning models on distinct network datasets. Methodology: Select common features among different datasets identified as important to perform transfer learning experiments. This involves training models on one dataset to evaluate performance on other datasets. Results: Feature importance tools illustrated that models were using irrelevant features to make decisions. Transfer learning experiments yielded poor results when tested on two distinct datasets. Conclusion: Dropping irrelevant features improved the performance of models. Poor

26

transfer learning results could be associated with factors such as large variations in their creation dates, leading to a significant difference in the workloads under study. Ahmed, Sabbir – Identifying White Blood Cell Sub Type from Blood Cell Images Using Deep Learning Algorithms White blood cell (WBC) differential counting is an established clinical routine to assess patient immune system status. Fluorescent markers and a flow cytometer are required for the current state-of-the-art method for determining WBC differential counts. However, this process requires several sample preparation steps and may adversely disturb the cells. We present a novel approach using deep learning algorithms to find subtype of white blood cell from blood cell image. For the datasets, two deep learning classifiers were evaluated on stain-free imagery using stratified 5-fold cross-validation. On the white blood cell dataset the best obtained result was 84% accuracy. Here we propose a model to identify subtype of white blood cell after evaluating a series of deep learning algorithms. Anumanchineni, Harish – Cardiovascular Risk Detection Using Machine Learning and Artificial Neural Networks In this paper, I applied machine learning and artificial neural networks to predict the risk of an individual to cardiovascular disease. An Individual day to day living habits were considered as features of the dataset for the research.70,000 records were used for research. The data was collected from kaggle.com. Major classification machine learning algorithms were applied for the risk prediction. The dataset was also trained on Artificial Neural Network for better training and more accurate risk prediction. Initially conducted the exploratory data analysis for knowing how the data is distributed. Conducted literature review to get useful insights before proceeding to Research. Applied machine learning and Neural Networks on the dataset. Experimented with reduced features considering key features for risk of cardiovascular disease. Used TensorBoard to see how the training is going at each iteration while using Neural Networks. Successfully predicted the risk with an accuracy of 73 percent. Araujo, Gregory – Transport Mode Detection Using Deep Learning Networks With the advancement of technology, large amounts of global positioning system (GPS) trajectory data is being produced and recorded by many devices and products. This has made learning transportation modes a relevant area of research given its applications to everyday life. Depending on the mode of transportation, factors such as location and weather can have an impact on information collected from GPS trajectories. This project focuses on combining extracted features from GPS trajectories collected via the MTL Trajet Project along with geospatial and temporal attributes provided by the Government of Canada to detect transport mode (bike, car, public transportation, walking). This project examines the effectiveness of utilizing Convolutional Neural Networks (CNN), Long Short Term Networks (LSTM) and Convolutional LSTM Networks (ConvLSTM) in transport mode detection.

27

Bagheri, Moeen – Multi-Step Forecasting of Walmart Products Forecasting future sales is important to retailers for managing inventory and making marketing decisions. Product sales are affected by many external factors, which must be considered when forecasting future sales. In this paper, the effects of these factors were directly taken into account in the four models created. These four models include three singular models, consisting of LSTM, MLP, and LGBM, as well as a hybrid model. The hyperparameters of the singular models were optimized using Bayesian optimization. Furthermore, we aimed to provide 28-days ahead sale forecasts, by forecasting one day at a time. The LGBM model was able to achieve the best performance, followed by the hybrid model. The outstanding results of the LGBM model shows the potential of boosting methods in improving the overall performance. Moreover, the LSTM model was able to outperform the MLP model, which demonstrates the ability of LSTM networks in learning from time-series data. Beilis Banelis, Aleksander – Loan Outcome Prediction in P2P Lending This study utilizes the book of loans from a Peer-to-Peer lending platform called Lending Club to determine if machine learning can be applied to predict the final outcome of already approved loan so that losses can be minimized from bad loans. Three different machine learning algorithms are applied: Logistic Regression, Random Forest, and Feedforward Neural Network. The dataset presents the challenges associated with imbalanced classes, and high class overlap among the features, these are addressed through under-sampling for model training, and cost-sensitivity for model optimization. Results indicate that the benefits of prediction models are almost entirely eliminated by the costs of incorrect predictions for good loans. Applying the same approach can yield a different result where error costs differ. Chane, Gagandip – Smart Reply for Online Patient-Doctor Chats Telehealth is an evolving field that’s enabling remote medical services through the use of technology. In this study, we propose a smart reply system in the medical domain to support online doctors through automated reply suggestion to patient messages by utilizing historical online chats between doctors and patients. We first showcase a data labeling exercise to transform raw data into a usable format. Followed by this we design a triggering model, a binary classification task to detect patient messages that are good candidates for reply suggestion and a reply suggestion model, a multinomial classification task to predict top responses for a given patient message. For the triggering model, feedforward neural network outperformed the random forest classifer. For reply suggestion, LSTM-based models slightly outperformed feedforward neural networks on average but had a significantly greater run-time. The results proved that its possible to achieve high performing smart reply system in the medical domain.

28

Chow, Roger – IDC Prediction in Breast Cancer Histopathology Images Invasive ductal carcinoma (IDC) is the most common form of breast cancer accounting for 80% of all breast cancer diagnoses. Manual diagnoses of IDC from examining histopathology slides by a pathologist is a tedious time-consuming process. With advances in whole slide image scanner technology, there is growing interest in the automation of IDC detection. In this paper, experiments were conducted to develop a deep convolutional neural network model for IDC classification on breast cancer histopathology image patches. An open source IDC dataset, containing 277,524 labelled image patches (70% IDC negative and 30% IDC positive), was used for this study. The highest performing model built in this study achieved an accuracy of 91.05%, a balanced accuracy 90.12% and a recall rate of 87.98% using the DenseNet201 architecture with cyclic learning. Chowdhury, Mohsena – Investigating Organizational Crisis Using Text Mining Technique In any organization, crises handling is a crucial factor. Crisis can be critical and may

even completely collapse the business. From the study we dig out a strong

communication could manage the organizational crisis very sensibly. For this project we

use email corpus of Enron corp. The aim of this research project to find out any probable

relation pattern among organizational email log that identified the crises before it appear

using text mining methods. Our approach utilizes Latent Dirichlet Allocation (LDA), a

popular topic modeling technique to determine the topical distributions of email. We also

utilize two different packages (Gensim and Mallet) to compute the LDA and also

compare the result with LSA/LSI. Finally we evaluate these three model by measuring

the Coherence Score and find the optimal model depending on the score. Moreover we

also analyse the resulted topics to find out if there

any pattern that indicate the potential crises. Cohen, Rory – Predicting the Profitability of Canada’s Big Five Banks: Predictive Analysis With Google Trends and Twitter Google Trends and Twitter sentiment analysis have been used by analytics researchers in a variety of studies to predict sales, market specific data, and customer consumption. This study aims to estimate the quarterly profitability ratios of Canada’s Big Five Banks using the aforementioned data. Differing from similar reports that attempt to use consumer behaviour in financial projections, we also use the Google Trends of each company’s direct competitors to generate our predictions. A linear regression model was created as a baseline for metric prediction. Two subsequent neural networks were tuned and applied to the same datasets in an attempt to improve performance. Two cost related metrics showed strong results using the trend and sentiment data. Ultimately, the findings suggest that factors outside of consumer behaviour have an outsized

29

impact on most quarterly financial metrics. Company resource allocation and management interests play a large role in the ratios within our scope. Dhall, Ankit – Household Space Heating Demand Modelling Using Simplified Black-Box Models This research aims at applying a novel idea of utilizing an ANN (Artificial Neural Network) black-box model to predict the space heating demand of households in Toronto, Ontario, Canada. The data used is gathered as a part of the Ecobee Donate Your Data program. First, an exploratory analysis is conducted using descriptive analytics and data visualization to try and find patterns, or relationships that could help give insight into the data. Further, multiple approaches and techniques such as data aggregation and inclusion of time-lag information are applied to model and predict the space-heating demands of any house using basic, easy to record features only. In addition, experiments are conducted to gauge the practical viability of the black-box model developed. This research was conducted as a continuation of an ongoing study at the Ryerson Centre for Sustainable Energy Systems (CSES). Despite a few issues faced with the data being modelled, the space-heating demand was successfully predicted using black-box ANN models using simple, easy to observe features and including time-lag information for the past half-hour. In addition, the model was able to portray a practical learning capability as additional data was added. For future studies to predict space-heating using the given data, it is recommended to apply data aggregation techniques and additional feature engineering, as well as filtering out only relevant data using domain knowledge to be able to achieve better prediction results. Emamidoost, Maryam – Application of Deep Learning in the Segmentation of the Brain Regions to Predict Alzheimer’s Disease Among any other brain disease, Alzheimer’s Disease is the one that is ranked to be the third cause of death after heart disease and cancer in older adults and the sixth throughout the United State. In this research, we aim to predict Alzheimer’s Disease based on the structural components of the human brain. To this end, we create two Convolutional Neural Network models, first for the segmentation of the brain regions based on the Harvard-oxford Atlas and the second for the prediction of Alzheimer’s Disease based on the segmented MRI images. The results from the prediction of AD based on the segmentation model, indicate that we can create a link between the structure of the brain and the appearance of Alzheimer’s Disease. Ewen, Nicolas – Self Supervision for Classification on Small Medical Imaging Datasets Traditionally, convolutional neural networks need large amounts of data labelled by humans to train. Self supervision has been proposed as a method of dealing with small amounts of labelled data. The aim of this study is to determine whether self supervision can increase classification performance on small medical imaging datasets. This study also aims to determine whether the proposed self supervision strategy is a viable option

30

for small medical imaging datasets. A total of 8 experiments are run comparing the classification performance of the proposed method of self supervision with the performance of basic transfer learning. The experiments run with the proposed self supervision strategy perform significantly better than their non-self supervision counterparts. The results suggest that self supervision can improve classification performance on small medical imaging datasets. They also suggest that the proposed self supervision strategy is a viable option for small medical imaging datasets. Ghavifekr, Amin – Machine Learning Approach in Forex (Foreign Exchange) Market Forecasting In recent years, applying machine learning techniques to historical Foreign Exchange market data has gained a lot of attention. We have contributed to the published literature by utilizing comparable methods to the four major currency pairs (EUR/USD, GBP/USD, USD/CHF, and USD/JPY), concentrating on time series analysis for trend or momentum predictions. We used Long-Short Term Memory networks (LSTMs), a form of recurrent neural network to build our model for testing two methods of prediction. Point-by-Point prediction and Multi-Sequence prediction. Furthermore, we examined the use of more than one input dimension. Our results showed that Multi-Sequence prediction method together with employing multi-dimensional inputs, although not perfect, does give us clearer indication of future price trends. Gupta, Vatsla – Automated Hate Speech Detection Using Deep Learning Models Social media is an interactive online platform where people express their opinions on various subjects. It has become a prominent ground for online toxic hate behavior. Online hate speech detection has received significant attention due to the rise in cyberbullying across various social media platforms and poses key challenges like understanding the context in which the words are used. To address this concern, we explore combinations of word embedding models like Keras word embedding, Word2Vec and GloVe with deep learning models like BiLSTM, CNN and CNN-BiLSTM to explore the deeper semantics and syntactic construct of the tweets to help understand the context and hence aid in hate speech detection. First, the words in the tweets are converted into word vectors using word embedding models. Then these word vectors are fed into deep learning models to effectively learn the context for hate speech classification. The experimental results showed that for each model with different embedding matrices significantly improved accuracy and F1 score. Tested by 10-fold cross-validation, CNN-Bi-LSTM with the GloVe word embedding performed the best. Also, to explain the predictions of deep learning classifiers, LIME analysis is performed to validate model’s credibility. Hemel, Tahseen Amin – A Deep Learning Approach in Detecting Financial Fraud Being capable of detecting fraud transactions out of all credit card transactions in real-time is extremely important for financial institutions. According to McKinsey, worldwide losses from card fraud could be close to $44 billion by 2025 . Thus it is challenging for

31

financial institutions to quickly identify the fraudulent transactions without hampering legitimate transactions to provide superior customer experience to all stakeholders. In this project, I have used the dataset which contains transactions made by credit cards in September 2013 by European cardholders (open source available in Kaggle) and conducted four different experiments with different classifying approaches to identity fraud and legitimate transactions. To measure the performance of the classifiers in different experiments I have used Classification reports, confusion matrix from ‘sklearn’ library and RMSE score, and compared which experimental setup is more efficient in identifying frauds. Houshmand, Bita – Facial Expression Recognition Under Partial Occlusion Facial expressions of emotion are a major channel in our daily communications, and it has been subject of intense research in recent years. To automatically infer facial expressions, convolutional neural network based approaches has become widely adopted due to their proven applicability to Facial Expression Recognition (FER) task. On the other hand Virtual Reality (VR) has gained popularity as an immersive multimedia platform, where FER can provide enriched media experiences. However, recognizing facial expression while wearing a head-mounted VR headset is a challenging task due to the upper half of the face being completely occluded. In this project we attempt to overcome these issues and focus on facial expression recognition in presence of a severe occlusion where the user is wearing a head-mounted display in a VR setting. We propose a geometric model to simulate occlusion resulting from a Samsung Gear VR headset that can be applied to existing FER datasets. Then, we adopt a transfer learning approach, starting from two pretrained networks, namely VGG and ResNet. We further fine-tune the networks on FER+, AffectNet and RAF-DB datasets. Experimental results show that our approach achieves comparable results to existing methods while training on three modified benchmark datasets that adhere to realistic occlusion resulting from wearing a commodity VR headset. Ilic, Igor – Explainable Boosted Linear Regression Time series forecasting is a continuously growing research field. In recent times, there has been a growth in the number of deep learning-based models. While these models are highly accurate, a trade-off is made in terms of model interpretability. Not only do deep learning methods present problems with interpretability, but they are also more difficult to compare on single time-series datasets. This work alleviates these problems by presenting two novel ideas. First, a new approach to time series model comparison is introduced. This new approach allows for robust time series comparison in cases of lengthy model training time. Then, a new time series forecasting model, Explainable Boosted Linear Regression (EBLR), is presented. EBLR is compared to other ensemble methods and can retain accuracy while reducing complexity in its formulation.

32

Ioi, Kevin – Dirichlet Multinomial Mixture Models for the Automated Annotation of Financial Commentaries Supply chain managers require detailed reports to be written whenever performance significantly deviates from forecasts. These reports summarize and explain the key drivers of the variance, which are then used to inform future business decisions and forecasting. This paper proposes an automated system for the annotation of the variance commentaries and applies machine learning models to classify time series instances of performance data with the topic derived labels. Class labels manually annotated by an industry analyst are compared against the output of three topic modeling methodologies, namely LDA, GSDMM and GPUDMM. Various machine learning models are applied for the classification task, including LSTM, FCN, XGB and KNN-DTW. The numerical study shows that topic derived labels achieve higher levels of performance in the classification task when compared to the baseline labels. The proposed system could save time and provide valuable insights for business management. Ionno, Anthony – Benchmarking Machine Learning Prediction Methods on an Open Dataset of Hourly Electricity Data for a Set of Non-Residential Buildings Roughly 64.7% of Canada’s annual electricity consumption was attributed to non-residential (commercial and industrial) buildings in 2016 [1]. Smart meters provide companies and building managers with an opportunity to track electricity consumption at an hourly or sub-hourly granularity. Efficient management of a non-residential building’s electricity consumption is beneficial for the building manager in terms of bill savings, the electricity system, the demand reduction in peak hours or demand shifting can defer or even prevent significant system infrastructure investment [2], and for the environment in the form of reduced carbon emissions. The aim of this paper is to present a variety of supervised and unsupervised machine learning methods that might allow companies or building managers to better predict future electricity consumption to make more informed decisions on building operations. In this paper we train six machine learning models on one year’s worth of hourly electricity data for each of the 828 non-residential buildings in our dataset. Randomised search time-series cross-validation was used to determine optimal hyperparameters for each building and model combination. We also present a cluster analysis model as an exploratory technique to understand how daily electricity load profiles can be grouped and compared more generally in a variety of circumstances. Our test results show that across our sample and each of the models tested Mean Absolute Percentage Error (MAPE) varied considerably and that this is likely due to significant differences in a building’s electricity consumption patterns between the training and test set. We also found that Gradient Boosting Decision Trees (GBDT) outperformed all the other machine learning models we tested by a significant margin.

33

Kabe, Devika – Text Highlighting to Improve Quality of Online Medical Services The medical domain is one which is often subject to information overload. With the constant updates to online medical repositories and the increasing availability of biomedical datasets, it is difficult to analyze the data in a structured way. This creates additional work for medical professionals who are heavily dependent on medical data to complete their research. This paper aims to apply different text highlighting techniques to be able to capture relevant medical context. This would reduce the doctors’ cognitive load and response time to patients by facilitating them in making faster decisions, thus improving the overall quality of online medical services. Two methods to highlight text are performed. The first is via Local Interpretable Model-Agnostic Explanations (LIME), which are applied to a number of classification models. The second method is applying binary classification models to n-grams. These models are applied to different vector embeddings including word2vec and BERT. The results of these experiments show that unigram classification models outperform LIME and can successfully be used to highlight medically-relevant words. The results also show that performance goes down as the models highlight bigrams and trigrams, and thus segment highlighting needs to be analyzed further. Kamei, Josephine – Predicting the Remaining Useful Life of the C-MAPSS Turbofan Engine Simulation Dataset FD001 Prognostics has been employed in machinery maintenance where degradation patterns due to various mechanical problems are observed, and as such, prognostics constantly monitors the current state of the machinery which helps predict the time remaining before a likely machinery or system failure, which is referred to as remaining useful life (RUL). This report focused on utilizing the C-MAPSS turbofan engine simulation dataset FD001 where implementing Regression Decision Trees, Random Forests, and Gradient Boosting Regressor algorithms would predict RUL values. The results indicated that Random Forests achieved the most accurate prediction model when compared to the other two algorithms. Karami, Zahra – Cluster Analysis of Stock For Efficient Portfolio Management Stocks are a common kind of financial time series. In this project, I present a new similarity measure for time series clustering and then select a set of stocks to create an efficient portfolio, which is of crucial importance in the process of creating an efficient portfolio. This method reduces the efficient times of portfolio using clustering-based selection, and only selects a subset of stocks from different groups to create an efficient portfolio each time, then it is easy to get the portfolio with the lowest risk at a given level of return. S&P index stocks were used for current work and compared with other selection methods, the results show that this method could largely reduce the efficient times of portfolio. ward hierarchical cluster was used to cluster stocks.

34

Lalonde, Rebecca – Direct Marketing Modelling: Comparing Accuracy and True Positive Rates of Classification Models There are many classification models available: when predicting the result of a marketing campaign, which is the best to use? Metrics such as accuracy and true positive rate must be considered in order to maximize profit. This paper compares these metrics across various classification models. The dataset used is a Portuguese bank’s campaign, found at UCI’s Machine Learning Repository (Moro et al., 2014), targeting customers through phone calls to encourage them to subscribe to a term deposit. The classification goal is to predict whether a customer will accept or decline. Analysis is coded in RStudio; the models looked at include Naïve Bayes, logistic regression, decision trees, and SVM. Accuracy and true positive rates are compared with confusion matrices and ROC curves. The runtimes and interpretability of each model are also discussed. Li, Vivian – Predicting Stock Market Volume Changes with News Article Topics In this project, news articles from Kaggle’s “All the News” dataset are used to predict changes in the S&P500 index trading volumes. The dataset was split into subsets to create 5 cross-validation sets and 3 test sets. To build the model, we started by extracting 100 main topics on peak days, where a peak day is defined as a day where the change in trading volumes is outside the 95% confidence interval in the training set. From the 100 topics, a LASSO regression model was used to select the most relevant topics for predicting volume changes and to generate predictions. For the 3 test datasets, the model performance was mainly evaluated on the number of peaks explained and the RMSE, and the results had varied success. Compared to a time series prediction, however, the LASSO regression was able to better predict the timing of the volume fluctuations. On days where the number of peaks were explained, many of the top topics were related to finance, business, and politics. Malik, Garima – Predicting Financial Commentaries Using Deep Neural Networks Companies generate financial reports to measure business performance and assess deviations from the forecasts. Analysts comment on these reports to explain the causes of deviations. In this research paper, we propose a deep learning-based approach to predict the commentaries from the financial data generated by a company. We formulated the problem as a time series classification task where variance drawn from the difference of forecast and actual numbers is presented as a monthly time series. The data is manually labeled by financial expert’s into financial commentary classes. We considered various Deep learning models for the prediction task including Long Short Term Memory Networks (LSTM) and Fully Convolutional Networks (FCN). In order to show the competencies of neural network architectures, we have also created the synthetic time series data and classification is performed on industry data as well as on rule-based data. We consider AI interpretability as an additional component to the project to better explain the predictions to business users. Our numerical study shows that FCN provides higher performance and a natural and better explainability with Class Activation Maps compared to the other methods. The proposed approach

35

leverages management information systems to offer significant insights for the managers and financial experts on key financial issues including sales and demand forecasting. Milacic, Dejan – Neural Style Transfer of Environmental Audio Spectograms Neural Style Transfer is a technique which uses a Convolutional Neural Network to extract features from two input images and generates an output image which has the semantic content of one of the inputs and the “style” of the other. This project applies Neural Style Transfer to visual representations of audio called spectrograms to generate new audio signals. Audio inputs to the style transfer algorithm are sampled from the Dataset for Environmental Sound Classification (ESC-50). Generated audio is compared on the basis of input spectrogram type (STFT vs. CQT) and pooling type (max vs. average). Comparison is done using Mean Opinion Scores (MOS) calculated from ratings of perceptual quality given by human subjects. The study finds that STFT spectrogram inputs achieve high MOS when subjects are given a description of the style audio. The audio generated using CQT spectrogram inputs raises concerns about using visual domain techniques to generate audio. Murad, Mohammad Wahidul Islam – Demand Forecasting For Wholesale Sales by Industry Considering Seasonality Demand Demand forecasting is the basis for planning supply chain activities and is very important to choose effective forecasting technique that is appropriate on specific data set. The appropriate forecasting technique helps management to use this information and maintain the flow of materials, products and information in supply chain management. Many active researches are going on different demand forecasting techniques for several years. The aim of this research project is to study and implement effective forecasting techniques applied on time-series data set with different wholesale products by industry type under the North American Industry Classification System. The objective of this research project is to demand forecast for wholesale product by industry based on historical time-series data, evaluate and compare forecast accuracy using performance measurement evaluation metrics. In this research project, three time-series forecasting models ARIMA, SARIMAX and Seasonal Decomposition were used to predict the demand for 23 different wholesale products. The evaluation metrics Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) were used to identify the accuracies between actual and predicted data. The outcome of this project is to compare different forecasting models and identify the most suitable forecasting technique that can be used for predicting wholesale products. Oyetola, Oyindamola – Predicting Housing Prices Using Deep Neural Networks Housing price prediction is an important aspect of real estate as it can improve the efficiency and stability of the real estate market for buyers, sellers and the government. Regression analysis is the traditional approach to hedonic price-prediction models. Using

36

a detailed dataset of house sale transactions in Ames, Iowa, this research compares the predictive power of four analytical methods (deep neural networks, principal component regression, support vector regression and regression trees). Results show that the deep neural network approach is the most effective for predicting house prices. The deep neural network also performs better with large amounts of data compared to other machine learning algorithms. Parker, Megan – Predicting Stages of Dementia: An Exploration of Feature Selection and Ensemble Methods Dementia is a syndrome which affects 50 million patients worldwide with symptoms ranging from forgetfulness to difficulty walking [1]. Diagnosis of dementia is a challenging and important problem as no test exists today which can easily classify the type of dementia for a patient [3]. The objective of this research is to build a pipeline, which uses imaging and non-imaging features to predict the stages of dementia for a given patient. The first aim of this research is to determine whether grouping features into subsets can improve model performance. The second aim of this research is to determine whether the results for individual classifiers can be improved using ensemble methods. The ADNI Dataset will be used in this experiment [2]. The highest performing model was an ensemble which used a combination of deep learning and traditional classifiers trained separately on imaging and non-imaging data, with an accuracy of 89.12%. Patel, Kshirabdhi – Insight Extraction from Regulatory Documents Using Text Summarization Techniques Legal documents are hard to understand and generally requires special knowledge to be able understand and gain information from it. In such situation it is hard to find and follow acts and regulations which are suitable for our business, jobs or other work. Sometimes hiring a person who understands it and can helps us costs hundreds of dollars. So, in current time there is a need of some technology which can help us to overcome these problems and recommend us a list of acts and regulations which are suitable to us and also provide summary which can make us understand those legal texts. To address the discussed situation here we are developing a NLP framework to automatically extract relevant documents as per the user's requirements and give an summary report of the regulations. The dataset which was used here is Canadian Government Regulation and Acts. This dataset was made public for the use of data science community in the year of 2018 by Canadian Government. Percival, Dougall – Experiments in Human-Interpretable Feature Extraction for Medical Narrative Classification Statistics Canada’s Canadian Coroners and Medical Examiner’s Data is a database containing coroner’s reports, unstructured text with the results of their findings. Statistics Canada is searching for improved methods of identifying relevant information, and

37

classifying reports. Due to COVID-19 imposed constraints, a Medical Transcriptions data set was used to mimic this data. To solve this problem, seven experiments were conducted using rule-based and machine learning based techniques of information extraction, and text classification. The results indicate that custom Named Entity Recognition, a subset of Natural Language Processing, is the most promising method for extracting key information that can further help classify unstructured text narratives. As a government agency, Statistics Canada requires transparency in its methods, and the best method offers not only a strong data classifier, but also one that is transparent and easily interpretable. Rezwan, Asif – Analysis of Daily Weather Data in Toronto to Predict Climate Change Using Bayesian Approach The daily weather data of Toronto City from 1840 – 2017 was used to assess whether there is a change in the pattern of occurrence of rainfall and snowfall over the years in this region using Bayesian analysis procedure. The Markov Chain Monte Carlo (MCMC) methods were used to find the posterior. No-U-Turn Sampler, as a recent MCMC method, generated approximate posterior distributions of lengths of wet and dry spells for the rainfall and snowfall data for the 177-year period. By time series plotting of the posterior a comparison was made and it was found that the probabilities of wet spells have seen some significant changes over the time for both Rainfall and Snowfall Data; the trend is upward in the case of rainfalls while downward for snowfalls. Saha, Milan – Consumer Opinion Classification for Major Canadian Telecom Operators Telcom market in Canada is very much competitive. There are three main operators along with some other small Telco’s are providing service in Canadian market. Social media is good source of data to measure how any company is performing since customers provide their opinions and put reviews online themselves. Operators are also interactive via social media to reach out to customers. This project is to measure the competitive performance of mobile phone operators from the data of Twitter by creating a classification model with machine learning to investigate consumer opinions and their way of interaction via tweet data. Total 116,375 tweets were collected from the official accounts of the top Canadian telecom operators (Bell, Rogers, TELUS, FreedomMobile). After processing and cleaning the dataset, an exploratory analysis was done to find hidden patterns, and a classifier model was developed to analyze sentiment scoring using few ML algorithms. Out of all 4 Canadian telecom’s TELUS has highest percentage of positive tweets. Rogers is in second position with 70% score. For the text classifier model Linear SVM with count vectorizer had maximum accuracy and later after fine tuning with random oversampling technique TD-IDF vectorizer produced the highest accuracy. This solution will help wireless telecom operators to learn about the negative customer experiences and improve the positive experience of services. Saleem, Muhammad Saeed – Speech Recognition on English and French Dataset

38

Emotions are a basic part of human nature and carry additional insight into human actions. In this paper I’ll attempt to create a model that’ll help classify basic human emotions. The initial model will be created on English Language dataset from Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. Based on recent studies, Mel-Spectogram helps extract important features from an audio data and those features will be used in 4 different models, SVM, Deep NN, CNN, CNN+LSTM, to test which will provide the best result. The best model will also be tested on a French dataset where the idea is to test if basic emotions differ depending on the type of language. Saleem, Waleed – Recognizing Pattern Based Maneuvers of Traffic Accidents in Toronto This paper discusses the utilization of machine learning techniques to detect patterns of traffic accidents in Toronto. The primary and most fundamental purpose of carrying out this research is to identify and analyze the driving patterns and behaviors in Canada Toronto, as the main sample. The aim of this project paper is to examine the factors that contribute to road accidents in the country; and to evaluate statistically the effect of certain driver’s personal characteristics on road accidents. This paper has proposed a model that trained multi class classification dataset through machine learning algorithms. This paper has used 4 classifiers that are used for supervised learning. Each classifier is implemented on the dataset in order to find the accuracy of the model. The classifiers are also compared to find out the best option for the given dataset. The accuracy of the algorithm showed more than 95% on the dataset which indicates the algorithm was a perfect fit for the given dataset. Seerala, Pranav Kumar – Classification of Chest X-Ray Images of Pneumonia Patients In this work, an attempt has been made to come up with a neural network with a limited number of parameters, with a goal of classifying chest X-rays of pneumonia from healthy patients. The intended applications would be edge devices like cellular phones, Raspberry Pi’s and other computing devices that could be used in developing countries which might be lacking in hardware to deploy and update the model. The dataset used is primarily on pediatric patients and demonstrates the usage of image segmentation, image de-noising and training data selection, to train on images with the most meaningful information, rather than the entire dataset. The results of the hyper-parameter tuned model show a dramatic improvement in overall accuracy of the test set when compared to other Kaggle kernels. Silina, Eugenia – Knowing the Targets When Innoculating Against an Infodemic: Classifying COVID-19 Related News Claims In this paper, news claims related to Covid-19 were classified into multiple mutually exclusive pre-defined categories, based on text content of each claim. The goal of the study was to automate this classification, previously performed manually. For this

39

purpose, a Naïve Bayes (NB), “one-vs-one” (OVO) and “one-vs-all” (OVA), also known as “one-vs-the-rest” (OvR), approaches were used. While results for OVO were inconclusive, NB and OVA produced similar results, though the overall performance metrics for both NB and OVA were not very high. However, due to the particulars of the dataset, including it being unbalanced, predictions of some of the claim types were more successful than of others, with performance metrics ranging significantly. Somisetty, Kusumanjali – Online Detection of User’s Anomalous Activities Using Logs Securing the Organization’s Confidential Information is always a concern for any Organization. This Paper implements a Machine Learning approach to monitor the Users activities and determine the anomalous Data. The term anomalous data refers to data that are different from what are expected or normally occur. Detecting anomalies is important in most industries. For example, in network security, anomalous packets or requests can be flagged as errors or potential attacks. In customer security, anomalous online behavior can be used to identify fraud. In addition, in manufacturing and the Internet of Things, anomaly detection is useful for identifying machine failures. Song, Tianci – Damaged Property Detection With Convolutional Neural Networks Recent studies have shown good results using the VGG network architecture to detect damaged buildings automatically with satellite imagery after any natural disasters including hurricane and tsunami. The purpose of this project is to enhance the damaged property detection process using post-hurricane satellite images with different convolutional neural network architectures. The image dataset used in this study contained affected area in Greater Huston Area, Texas before and after Hurricane Harvey in 2017. Two architectures including ResNet and Inception were used in the project. For each architecture, three configurations were trained with 3-fold cross validation, and the best configuration was chosen to develop the final model. The results showed that ResNet provided higher prediction power with accuracy around 98%, while Inception had slightly lower accuracy around 94%. In conclusions, ResNet outperformed Inception. The configuration with data augmentation and reduced adaptive learning rates yielded an improvement for both architectures. Thanabalasingam, Mathusan – Using ProtoPNet to Interpret Alzheimer’s Disease Classification over MRI images Alzheimer’s Disease is one of the most common diseases today with no cure. It is important to be able to identify the disease in an effective matter, and machine learning has made major strides in being able to do so. Deep neural networks are difficult to interpret; the inputs and outputs can be understood by humans, but not the actual process. This paper considers a relatively new interpretability deep learning neural network architecture called ProtoPNet, which allows a model to make its classifications based on prototypical parts of the input image. The models are trained on the OASIS MRI brain image dataset. The results of this experiment show that ProtoPNet is able

40

to classify images with relatively high accuracy, while also providing a level of interpretability not present in most deep learning models. Tsang, Leo – Predicting NBA Draft Candidates Using College Statistics In this paper, we applied machine learning algorithms to dissect potential NBA draft candidates. We began by conducting descriptive analytics to identify trends of NBA drafted players, followed by an understanding of today’s game, and patterns in looking at our recent past drafted players. The second part of our analysis is handling of imbalance data using Synthetic Minority Over-sampling, and random under sampling. The last part of our analysis is creating strong attributes by feature engineering existing data, and applying XGBoost, Logistic Regression, Multi-Layer Perceptron to identify potential draft candidates. Lastly, we go over the importance of this type of model, and how it can be used by the front office of NBA teams. Uddin, Md Rokon – Demands and Sales Forecasting for Retailers by Analyzing Google Trends and Historical Data Objective of this project is to create forecasting models for retailers by using Artificial Neural Network (ANN) so that they could make business decisions by visualizing future data. Two forecasting models are introduced here. One is sales model which will predict future sales and the second one is demand model which will predict future demands. To achieve the objective, CNN-LSTM model is used for both sales and demand predictions because this hybrid model can learn from very long range of historical data and predict the future efficiently Xu, Shaofang – Credit Risk Rating Model Development via Machine Learning Credit rating is a fundamental piece in credit risk management for financial institutions. Recently researchers and practitioners of financial institutions started applying machine learning methods to credit rating problems. Using the historical issuers’ ratings as target and issuers’ information and performance histories as predictive features, credit rating problem can be solved as a binary classification or multi-class classification problem under the supervised learning. This article adopts four approaches: logistic regression (LR), decision trees (DT), gradient boosting regression tree (GB), and random forests (RF), and proposes a simple framework that utilizes the ordinal characteristics embedded in the credit ratings. Many popular binary classification algorithms can also be incorporated into the proposed framework. Empirical results on US listed companies indicate that decision tree based ensemble algorithms, GB and RF in this article, have outperformed the other two approaches as well as the traditional statistical model in all performance measurements including the discriminatory power and the rating match-rate. Yeasmin, Nilufa – A Prediction Model for Chest Radiology Reports and Capturing Uncertainties of Radiograph Using Convolutional Neural Network

41

Chest radiography is the most common imaging examination worldwide, critical for screening, diagnosis, and management of many life threatening diseases. The purpose of this project is to create a model for automatically detecting radiology reports, capturing uncertainties inherent in radiograph by using Convolutional Neural Network (CNN). The idea is to investigate different approaches to using the uncertainty labels for training convolutional neural networks that output the probability of these observations given the available frontal and lateral radiographs. The train models take input a single-view chest radiograph and output is the probability of each of the 14 observations. In addition, I‟ve used DenseNet121 and DenseNet169 for training the dataset. I‟ve compared the performance of the different uncertainty approaches on a validation. The model has performed as well as radiologists in detecting different pathologies in chest X-rays. Zhang, Dongrui – Predicting Exchange Rate of Currency by LSTM Model Using information technology for international foreign exchange rates forecasting will help investors and policy-makers get more profits and make better policies. Machine learning algorithms are widely used in the prediction of financial time series. The LSTM (Long Short-Term Memory) neural network, as one of the classic models in machine learning technology, is advantageous in mining long-term dependency of sequential data. Based on the analysis of the prediction of the CAD/USD exchange rate, this project discusses the feasibility of the short-term direction forecasting of exchange rate using the LSTM neural network, find out the influence for accuracy between different time steps, and between only exchange rate datasets and exchange rate datasets with macroeconomic features. The results show that using the LSTM model, it has the best performance to predict the exchange rate of the next week’s direction, and has no obvious improvement by adding macroeconomic features.