Alisha Rege ([email protected])cs229.stanford.edu › proj2016 › poster ›...
Transcript of Alisha Rege ([email protected])cs229.stanford.edu › proj2016 › poster ›...
-
Viewpoint Invariant Person Classification in RGB-D Data
Alisha Rege ([email protected])
Purpose
Artificial Intelligence can play a key role in healthcare; however, due to patient confidentiality (HIPAA), we are unable to process this information without putting up some boundaries. This boundary comes in the form of RGB-D data; it prevents us from seeing a face or a distinguishing personal characteristic in videos. This project attempts to detect a person from any viewpoint in Stanford Health’s RGB-D data. The goal is to create a detection system that will be able to identify a person from any view point. This will allow nurses and doctors to sense problems such as if a person suddenly fell or if the person has not moved in days. A 6-layer CNN classifier is used to classify the object.
Dataset
• Dataset from Stanford’s Lucile Packard Children’s Hospital
• RGB-D data from 3 different viewpoints
• Hand-labelled Data• Previous project created bounding
boxes for objects• Large variations in viewpoints, object
appearance and pose, object scale
Convolutional Neural Network Implementation
Thank you to Stanford Health Center and Stanford Artificial Intelligence Vision Lab for the dataset
6 – Layer CNN w/ Dropout:[1][2]
• Input (56 x 56 x1)• Convolutional Layer (5x5 convolution w/ 32 Bias) & RELU• Pooling Layer (2 down samples)• Convolutional Layer (5x5 convolution w/ 64 Bias) & RELU• Pooling Layer (2 down samples)• Fully-Connected Layer (14x14x64 Inputs -> 1024 Outputs)• Out (1024 Inputs -> 2 classes)
Dataset Preprocessing
• Translating Video into frames• Cropping annotations to feed into network• Resizing all images to 56x56x1• Batch Size: 50• Cleaning data
Acknowledgement
Mean Average Precision (mAP)
Discussion
References
[1] Yann LeCun et al.: LeNet-5, convolutional neural networks. http://yann.lecun.com/exdb/lenet/. Retrieved: April 22, 2015.
[2] Martín Abadi, Ashish Agarwal, etc. TensorFlow: Large-scale machinelearning on heterogeneous systems, 2015. Software available from tensorflow.org.
Baseline
• SVM w/ HOG Descriptors• HOG Descriptors: Slides through image and calculates the number of
gradients in certain direction • Uses NMS – only one major object in certain pixel range• Skimage Implementation
Convolutional Layer:
𝑥"#ℓ = & & 𝜔()
*+,
)-.
𝑦("1()(#1))ℓ+,*+,
(-.
𝑦"#ℓ =σ(𝑥"#ℓ )nonlinearity
min7
12 𝑤
; + 𝜆 𝑤 ;;
Results
• mxm convolutional size • 𝑥"#ℓ is every pixel selected
Future Work
• Try different camera type to distinguish doctor/nurse/etc• Use information to detect anomolies
𝑚𝐴𝑃 =1
𝑛𝑢𝑚𝑒𝑛𝑡𝑟𝑖𝑒𝑠 & 𝐴𝑃(𝑐)IJ*KILM"KN
O-,
𝐴𝑃 𝑐 = ∑ Q R SMKT(R)UVWX∑ MKT(R)UVWX
%𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 1 −𝐿𝑎𝑏𝑒𝑙O_MMKOL − 𝐿𝑎𝑏𝑒𝑙`MKa"OLKa
𝐿𝑎𝑏𝑒𝑙O_MMKOL∗ 100
Classifier Training mAP
Training Accuracy
Testing mAP
Testing Accuracy
SVM w/ HOG descriptors .702 .609 .542 .532
CNN over cropped images .977 .971 .693 .590
CNN over entire image .723 .624 .560 .540
CNN with double representation .985 .981 .705 .650
CNN with unequal representation .985 .904 .693 .608
CNN with Histogram Equalization .988 .982 .715 .667
CNN on precise cropped images .999 .999 .912 .890
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 200 400 600 800
Mea
n A
vera
ge P
reci
sion
(mA
P)
Epoch
Epoch vs Mean Average Precison
Training Accuracy Testing Accuracy
Epoch: 410 Epoch is maximum valueHistogram Equalization:
Effect of Training Size on Output:
Viewpoint Number of Training Images
Testing mAP
Top-Down 544Y/377N .71Mid-top Wall 767Y/383N .81Hallway 374Y/284N .64