Lecture2notes: Summarizing Lecture Videos by Classifying ...

Lecture2notes: Summarizing Lecture

Videos by Classifying Slides and

Analyzing Text with Machine Learning

Hayden Housen

https://[email protected]: @HHousenMentor: Dhiraj Joshi @ IMB

All Images Without Citations Are My OwnBackground: https://unsplash.com/photos/ieic5Tq8YMk

Natural Language Processing • Computer Vision • Machine Learning

Phone

Presentation

Goal/Focus

AI Model

Lecture Video Detailed Notes

Goal/Focus

Slide ClassifierAI Summarization

Models

Website

Gap/Hypothesis

End-to-End Process

Presenter SlideSlide

Lecture Video

Video

Audio

Summarization Algorithm

Detailed Notes

Brief Methodology

Combination Algorithm

StructuredContent

Transcript

Render

Computer Vision

“Teaching Computers To See”•Face

Recognition Emotion Analysis

Crowd Analytics

Sports: draw lines on field& highlights

Medical Imaging

Movie Special Effects

Self Driving Cars

Computer Vision

“Teaching Computers To See”

•Face Recognition

Emotion Analysis

Crowd Analytics

Sports: draw lines on field& highlights

Medical Imaging

Movie Special Effects

Self Driving Cars

75%

Humans generate massive amounts of video (Video data

accounted for 75% of the total internet traffic in 2017) [23]

Literature Review – Note Taking

Note taking is almost a universal activity among students.[27]

Students’ notes are generally incomplete, and thus not adequate for reviewing the material.[28]

Students prefer guided notes[29] and course final exam performance higher for guided notes.[30]

Estimated Effects

Previewing: [31] suggests that summarized lecture slides reduce the amount of previewing time required without impacting quiz scores

MOOCs: Summaries enable quick skimming of the main points.

Automated summaries will

Decrease time spent creating notes

Increase quiz scores (content knowledge)

Enable faster learning

Literature Review – Lecture Summarization

TalkMiner22 – Web scraping and OCR used to index online lecture videos. We implement the

duplicate alignment algorithm suggested by TalkMiner.

Summary of TalkMiner [22] Summary of [20]

“Lecture Video Indexing20 – Inspired our slide structure analysis algorithm. We use the same

title identification and text classification thresholds.

Literature Review – Lecture Summarization

BERT Text Summarization Lectures [23] is most related since it applies deep learning to lecture summarization.

Approach is limited because:

It only tests the BERT model

Does not consider the information on the slides

Does not fine-tune BERT and merely uses it for embeddings

Has no STT component

Gap/HypothesisMain Components

Slide ClassifierAI Summarization

Models

WebsiteEnd-to-End

Process

Presenter SlideSlide

Lecture Video DatasetAI

Summarization Models

WebsiteEnd-to-End

Process

Main Components

AI Slide Classifier Model

1. Slide Classification Model

Classify frames of input video into 9 classes

Allows system to identify images containing slides for additional processing

Means system can ignore useless frames (containing only the presenter or audience)

SortingAuto Sort (AI Classification)

Manual Sorting Repairs

1.1. Lecture Video Dataset – Diagram

WebsiteScraper

Video Downloader

Frame Extractor

YouTubeScraperHuman

Interaction

Compile Data

Videos CSV

Train Slide

Classifier

Bootstrapping

MITOpenCourseWare

Final dataset contains 15599 images extracted

from 78 videos manually classified into 9

categories.

1.2. Slide Classifier – Dataset Modifications

Train-test Train-test-three CV

1.2. Slide Classifier – Dataset Modifications

Train-test

• Split dataset evenly (by category) into train (80%) & test (20%) sets.

• Average deviation = 0.0929.• Random subset would cause

class imbalance.

Train-test-three

• Only the “slide” and “presenter_slide” classes used in end-to-end process, so create dataset with only those classes.

• “Other” class contains remaining7 classes

• Average deviation = 0.001104.

1.3. Slide Classifier – Squished Images

After hyperparameters found (by training 262 models in CV using optimization framework), trainfinal-general and three-category models.

Frames from videos have 16:9 aspect ratio but ImageNet CNNs accept images with 1:1 aspect ratio.

Center cropping and scaling can remove important data.

Train a model by simply rescaling.Misclassified as “slide”

because presenter cropped out.

1.4. Slide Classifier – Results

Model Dataset Accuracy Accuracy (train) F-score Precision Recall

Final-general train-test 82.62 98.58 87.44 97.73 82.62

Three-category train-test-three 89.97 99.72 93.82 99.95 89.97

train-test 83.86 97.16 88.16 97.72 83.86

train-test-three 87.21 100 91.57 99.8 87.21Squished-image

For each model configuration, trained 11 models and report the average metrics.

In the final pipeline, the three-category model is used.

Slide classifier can be used in other research to easily identify slides.

1.4 Slide Classifier – Results

Final-generalMedian (by accuracy)

Three-categoryMedian (by accuracy)

1.5 Slide Classifier – Discussion

Incorrectly classifying “slide” frames as “presenter_slide” will have minimal impact on final summary will be ignored during processing.

Incorrectly classifying “presenter_slide” frames as “slide” will impact final summary will not be cropped.

2.1 End-to-End Approach – Overview

ffmpeg Slide Classifier that uses the ML model from

previous slide trained on

the dataset

Crop slides captured by

video camera

Extract Frames Classify Frames Perspective Crop Cluster Slides

Segment or Normal

2.1 End-to-End Approach – Overview

Tesseract OCRStroke Width

Title IdentificationBold Identification

Segment or DeepSpeechCMU Sphinx

VoskGoogle Cloud Speech API

CombinationModifications

ExtractiveAbstractive

Slide Structure Analysis & OCR Figure Extraction Transcribe Audio SummarizationCluster Slides

Text Detection w/EAST

Color DetectionShannon Entropy

2.2. Perspective Cropping Algorithm

Frames Unique SlidesAlgorithm

Feature Matching

No Yes

RANSAC Transform

Remove Duplicates

Match Features

Corner Crop Transform

2.2. Perspective Cropping Algorithm

Frames

SlidesScreen

Capture

Video Camera

Classifier

Cluster Together

Unique Slides

Camera Motion

Feature Matching

No Yes

RANSAC Transform

Remove Duplicates

Camera Motion

Match Features

Corner Crop

Transform

2.3. Perspective Cropping – CCT

CCT

Contours

Hough Lines

Feature Matching

No Yes

RANSAC Transform

Remove Duplicates

Camera Motion

Match Features

Corner Crop

Transform

2.3. Perspective Cropping – Feature Match

Iterates through the “screen capture” and “video camera” slides in chronological order.

When this category switches, begins matching using ORB & FLANN until number of matched features below threshold.

Goal: find “video camera” within “screen capture”


If number of matches is above a threshold then the slides are considered identical.

2.4. Perspective Cropping – More Content?

124240 content units addedAlgorithmThe slide with the most content is kept.

2.4. Slide Structure Analysis (SSA)

Title: 2.4. Perspective Cropping – Feature MatchBullet 1: **Iterates** through the “screen capture” and “video camera” slides in **chronological** order.Bullet 2: When this **category switches**, begins matching using ORB & FLANN until number of matched features below threshold.Bullet 3: **Goal:** find “video camera” within “screen capture”No footer text

Goal: Copy and paste formatted text from slides while not having access to the text, only an image of the text.

2.4. Slide Structure Analysis – 4 Algorithms

Stroke Width1. Distance Transform2. Peak Local Max – Find peaks in color

Identify Title [20]

Criteria for line to be a title:1. In upper third of image2. In first block and first

paragraph3. Stroke width > 1

standard deviation from average stroke width

4. Line height > average5. x position of top-left

corner < image_width*0.77

6. > 3 characters

Categorize Text [20]

Based on stroke width and height of text bounding boxes.• Bold text• Small text• Normal text

Figure Extraction1. Detects objects that are

close together & meet a size and aspect ratiothreshold

2. Checks that each potential figure contains a minimal amount of text and is color

3. Removes overlappingfigures

2.5. Figure Extraction

2.5. Figure Extraction

Detects objects that are close together &

meet a size and aspect ratio

threshold

Checks that each potential figure

contains a minimal amount of text and is

color

Removes overlappingfigures

2.6. Transcribe Audio – Speech-To-Text

Tested performance of STT models on 43 lecture videos from slide classifier dataset.

Chunking by Voice Activity: Uses WebRTC Voice Activity Detector (VAD) – increases the speed of STT by reducing the amount of audio without speech.

STT ModelsDeepSpeech

SphinxVosk small-0.3

Vosk daanzu-20200905Vosk aspire-0.2

Google’s paid STT API

2.6. Transcribe Audio – Results

Model WER MER WIL LS TC WER Processing Time

DeepSpeech (chunking) 43.01 41.82 59.04 5.97 ≈4 hours

DeepSpeech 44.44 42.98 59.99 5.97 ≈20 hours

Google STT ("default" model) 34.43 33.14 49.05 12.23 ≈20 minutes

Vosk small-0.3 35.43 33.64 50.84 15.34 ≈8.5 hours

Vosk daanzu-20200905 33.67 31.87 48.28 7.08 ≈5.5 hours

Vosk aspire-0.2 41.38 38.44 56.35 13.64 ≈19 hours

Vosk aspire-0.2 (chunking) 41.45 38.65 56.56 13.64 ≈19 hours

"LS TC WER" stands for LibriSpeech test-clean WER (Word Error Rate)Sphinx was not tested because it is between 6x-18x slower than other models.

Extract Frames Classify Frames

Perspective Crop Cluster Slides

Slide Structure Analysis &

OCR

Figure Extraction

Transcribe AudioSummarization

3.1. End-to-End Approach – Slides/Videos

= Novel ML Models= Novel Approaches

3.1. Text Summarization

Extractive SummarizationIdentifies important sections of the text and generates them verbatim producing a subset of the sentences from the originaltext

Abstractive SummarizationReproduces important material in a new way after interpretationand examination of the textusing advanced naturallanguage

Summarization Model


Slides Transcript

Audio Transcript


Combination Algorithm• only_asr – only uses the audio transcript (deletes the

slide transcript)• only_slides – reverse of only_asr• concat – appends audio transcript to slide transcript• full_sents – audio transcript appended to only the

complete sentences of the slide transcript• keyword_based (most advanced) – selects a certain

percentage of sentences from the audio transcript based on keywords found in the slides transcript


Slides Transcript

Audio Transcript

Summarization Model


Summarization Model


Slides Transcript

Audio Transcript



Slides Transcript

Audio Transcript


Summarization Model

Summarization Model1. Extractive Summarization

• Cluster – groups the lecture transcript into categories by topic and summarizes each topic using “generic”

• Generic (non-neural) – standard summarization heuristics

2. Abstractive Summarization• PreSumm or BART or TransformerSum or

PEGASUS

3.1. Text Summarization – SOTA Models

PreSumm: Applied BERT to text summarization

BART: standard seq2seq architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT)

PEGASUS: pretraining task is intentionally similar to summarization

[24]

3.2. TransformerSum – Overview

Summarization training and inference library created by Hayden Housen

Improves the architecture of BertSum(extractive component of PreSumm)

TransformerSum was used by a company that serves industry-leading enterprises, which engage with millions of customers daily.

3.2. TransformerSum – Diagram

S = SentenceT = TokenE = EmbeddingR = Sentence Ranking

Tokenizer

T 1T 2T 3T 4…

Word EmbeddingModel

TE 1TE 2TE 3TE 4

…

Pooling

SE 1SE 2

…

Input Document

S 1S 2…

Classifier

R 1R 2…

3.2. TransformerSum – Pooling Modes

Test all three pooling modes (mean_tokens, sentence_rep_tokens, and max_tokens) using DistilBERT and DistilRoBERTa

Across all datasets and models, pooling mode has no significant impacton the final ROUGE scores.

Model Pooling Method CNN/DM WikiHow ArXiv/PubMed

sent_rep 42.71/19.91/39.18 30.69/08.65/28.58 34.93/12.21/31.00

mean 42.70/19.88/39.16 30.48/08.56/28.42 34.48/11.89/30.61

max 42.74/19.90/39.17 30.51/08.62/28.43 34.50/11.91/30.62

sent_rep 42.87/20.02/39.31 31.07/08.96/28.95 34.70/12.16/30.82

mean 43.00/20.08/39.42 30.96/08.93/28.86 34.24/11.82/30.42

max 42.91/20.04/39.33 30.93/08.92/28.82 34.28/11.82/30.44

distilbert-base-uncased

distilroberta-base

3.2. TransformerSum – Classifiers

Test all four classification layers.

No significant difference in ROUGE scores between classifiers.

Model Classifier CNN/DM WikiHow ArXiv/PubMed

simple_linear 42.71/19.91/39.18 30.69/08.65/28.58 34.93/12.21/31.00

linear 42.70/19.84/39.14 30.67/08.62/28.56 34.87/12.20/30.96

transformer 42.78/19.93/39.22 30.66/08.69/28.57 34.94/12.22/31.03

transformer_linear 42.78/19.93/39.22 30.64/08.64/28.54 34.97/12.22/31.02

simple_linear 42.87/20.02/39.31 31.07/08.96/28.95 34.70/12.16/30.82

linear 43.18/20.26/39.61 31.08/08.98/28.96 34.77/12.17/30.88

transformer 42.94/20.03/39.37 31.05/08.97/28.93 34.77/12.17/30.87

transformer_linear 42.90/20.00/39.34 31.13/09.01/29.02 34.77/12.18/30.88

distilbert-base-uncased

distilroberta-base

3.2. TransformerSum – Results

R1/R2/RL-Sum CNN/DM WikiHow ArXiv/PubMed

distilbert-base-uncased 42.71/19.91/39.18 30.69/08.65/28.58 34.93/12.21/31.00

distilroberta-base 42.87/20.02/39.31 31.07/08.96/28.95 34.70/12.16/30.82

bert-base-uncased 42.78/19.83/39.18 30.68/08.67/28.59 34.80/12.26/30.92

roberta-base 43.24/20.36/39.65 31.26/09.09/29.14 34.81/12.26/30.91

mobilebert-uncased 42.01/19.31/38.53 30.72/08.78/28.59 33.97/11.74/30.19

BertSumExt 43.25/20.24/39.63 None None

BertSumExt-large 43.85/20.34/39.90 None None

MatchSum bert-base 44.22/20.62/40.38 31.85/08.98/29.58 None

MatchSum roberta-base 44.41/20.86/40.55 None None

PEGASUS-base 41.79/18.81/38.93 36.58/15.64/30.01 37.39/12.66/23.87

PEGASUS-large HN 44.17/21.47/41.11 41.35/18.51/33.42 44.88/18.37/26.58

PEGASUS-large C4 43.90/21.20/40.76 43.06/19.71/34.80 45.10/18.59/26.75

Smaller: Mobilebert-uncased-ext-sum model achieves 96.59% of the performance of BertSumwhile containing 4.45x fewer parameters.

Faster: Distilroberta-base-ext-sum model takes 74x less time to train than MatchSum.

My

Res

ults

4. Lecture2Notes – Website

Summary – Five-Fold Contribution

Lecture Video

Dataset

• Appropriate categories

• Large variety

Slide Classifier

• Identify important frames in slide presentations

•Summarization Models

• Improve state-of-the-art in text summarization

• Novel approaches

•End-To-End

• Multimodal approach

• Convert lecture videos to notes

Online Web Service

• Allows people to easily use the created lecture summarizer to create notes based on their own lectures

https://unsplash.com/photos/OyCl7Y4y0Bk https://www.advectas.com/en/blog/what-is-machine-learning/

Machine LearningAssess Impact on Education

Future Research

Summaries of lecture slide sets enhanced likelihood students would preview material & sometimes led

to better quiz scores [21]

Slide classifier (makes future slide analysis easier), dataset of lecture

videos, TransformerSum

Acknowledgements

Dr. Dhiraj Joshi for giving me advice on methodology

Gillian Rinaldo for guiding me

Classmates/Peers for supporting me

Parents for helpful suggestions

References

[1] Szeliski, R. (2010). Computer Vision: Algorithms and Applications. Springer.

[2] Boden, M. (2008). Mind as machine (p. 781). Oxford: Clarendon Press.

[3] Dragonette, L. (2018, April 19). Artificial Intelligence - AI. Retrieved December 6, 2018, from https://www.investopedia.com/terms/a/artificial-intelligence-ai.asp

[4] Machine learning. (2018, December 04). Retrieved from https://en.wikipedia.org/wiki/Machine_learning

[5] Feature (computer vision). (2019). Retrieved from https://en.wikipedia.org/wiki/Feature_(computer_vision)

[6] Ripley, B. (2009). Pattern recognition and neural networks (p. 354). Cambridge: Cambridge University Press.

[7] What is the difference between labeled and unlabeled data?. (2019). Retrieved from https://stackoverflow.com/questions/19170603/what-is-the-difference-between-labeled-and-unlabeled-data/19172720#19172720

[8] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi: 10.1007/bf00994018

[9] A Beginner's Guide to Neural Networks and Deep Learning. (2019). Retrieved from https://skymind.ai/wiki/neural-network

[10] Le, J. (2018). A Tour of The Top 10 Algorithms for Machine Learning Newbies. Retrieved from https://towardsdatascience.com/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11

[11] Brownlee, J. (2017). What is the Difference Between Test and Validation Datasets?. Retrieved from https://machinelearningmastery.com/difference-test-validation-datasets/

[12] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 1097-1105. Retrieved February 5, 2019.

[13] Image Classification. (n.d.). Retrieved February 5, 2019, from https://cs231n.github.io/classification/

[14] Seif, G. (2018, December 13). How to do everything in Computer Vision. Retrieved February 3, 2019, from https://towardsdatascience.com/how-to-do-everything-in-computer-vision-2b442c469928

[15] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2016.91

[16] Chablani, M. (2017, August 21). YOLO - You only look once, real time object detection explained. Retrieved February 5, 2019, from https://towardsdatascience.com/yolo-you-only-look-once-real-time-object-detection-explained-492dc9230006

[17] Gandhi, R. (2018, July 09). R-CNN, Fast R-CNN, Faster R-CNN, YOLO - Object Detection Algorithms. Retrieved February 5, 2019, from https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e

References

[18] Krizhevsky, A. (n.d.). Google Scholar. Retrieved February 5, 2019, from https://scholar.google.com/citations?user=xegzhJcAAAAJ

[19] Mahasseni, B., Lam, M., & Todorovic, S. (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2017.318

[20] Yang, H., Siebert, M., Luhne, P., Sack, H., & Meinel, C. (2012). Lecture Video Indexing and Analysis Using Video OCR Technology. Journal of Multimedia Processing and Technologies, 2(4), 176-196. doi:10.1109/sitis.2011.20

[21] Shimada, A., Okubo, F., Yin, C., & Ogata, H. (2017). Automatic Summarization of Lecture Slides for Enhanced Student Preview Technical Report and User Study. IEEE Transactions on Learning Technologies, 11(2), 165-178. doi:10.1109/tlt.2017.2682086

[22] Adcock, J., Cooper, M., Denoue, L., Pirsiavash, H., & Rowe, L. A. (2010). TalkMiner. MM'10 - Proceedings of the ACM Multimedia 2010 International Conference, 241-250. doi:10.1145/1873951.1873986

[23] “Cisco Visual Networking Index: Forecast and Trends, 2017–2022 White Paper.” VNI Global Fixed and Mobile Internet Traffic Forecasts. Cisco, 27 February 2019, https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-741490.html

[24] Liu, Y., & Lapata, M. (2019). Text Summarization with Pretrained Encoders. Retrieved from https://arxiv.org/abs/1908.08345

[25] Sciforce. (2019, January 23). Towards Automatic Text Summarization: Extractive Methods. Retrieved December 27, 2019, from https://medium.com/sciforce/towards-automatic-text-summarization-extractive-methods-e8439cd54715.

[26] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2016.90

[27] Pauline A. Nye, Terence J. Crooks, Melanie Powley, and Gail Tripp. Student note-taking related to university examination performance. Higher Education, 13(1):85–97, February 1984. ISSN 0018-1560, 1573-174X. doi: 10.1007/BF00136532. URL http://link.springer.com/10.1007/BF00136532.

[28] Kenneth A. Kiewra. Providing the Instructor’s Notes: An Effective Addition to Student Notetaking. Educational Psychologist, 20(1):33–39, January 1985. ISSN 0046-1520, 1532-6985. doi: 10.1207/s15326985ep2001_5. URL https://www.tandfonline.com/doi/full/10.1207/s15326985ep2001_5.

[29] Jennifer L Austin, Matthew D Thibeault, and Jon S Bailey. Effects of Guided Notes on University Students’ Responding and Recall of Information. Journal of Behavioral Education, page 12, 2002.

[30] Anneh Mohammad Gharravi. Impact of instructor-provided notes on the learning and exam performance of medical students in an organ system-based medical curriculum. Advances in Medical Education and Practice, Volume 9:665–672, September 2018. ISSN 1179-7258. doi: 10.2147/AMEP.S172345. URL https://www.dovepress.com/impact-of-instructor-provided-notes-on-the-learning-and-exam-performan-peer-reviewed-article-AMEP.

[31] Atsushi Shimada, Fumiya Okubo, Chengjiu Yin, and Hiroaki Ogata. Automatic Summarization of Lecture Slides for Enhanced Student PreviewTechnical Report and User Study. IEEE Transactions on Learning Technologies, 11(2):165–178, April 2018. ISSN 1939-1382, 2372-0050. doi: 10.1109/TLT.2017.2682086. URL https://ieeexplore.ieee.org/document/7879355/.

Lecture2notes: Summarizing Lecture

Videos by Classifying Slides and

Analyzing Text with Machine Learning

Hayden Housen

https://[email protected]: @HHousenMentor: Dhiraj Joshi @ IMB

All Images Without Citations Are My OwnBackground: https://unsplash.com/photos/ieic5Tq8YMk

Natural Language Processing • Computer Vision • Machine Learning

Lecture2notes: Summarizing Lecture Videos by Classifying ...

Documents

Transcript of Lecture2notes: Summarizing Lecture Videos by Classifying ...