CS229 Final Poster Cong and Davidcs229.stanford.edu/proj2017/final-posters/5148536.pdf ·...

1
Model and Algorithms The entire systems includes three parts, encoder, decoder and objective function. We focused on optimizing the first two parts. Encoder Optimization The Gated Recurrent Neural Network have shown success in applications involving sequential or temporal data but increase parameterization and is expensive. We experiment with different variation of GRU and reduce the parameters in the network without compromising the performance. The encoder definition: With the two gates to the variant gate presented as update gate and reset gate: Reduced the number of parameters in the GRU RNN from by Decoder Optimization We defined “Smart Vector” algorithm to better extract each sentence’s semantic with unsupervised learning. one decoder for the first sentence of each paragraph, one for the next sentence and one for the previous sentence. A number of linguistics researches have shown first sentence of each paragraph have significant semantic relatedness with the rest of the sentence in this same paragraph. Objective: given a tuple, where S 0 is the first sentence of the paragraph The objective optimized is the sum of the log-probabilities for the first sentence of each paragraph, the forward and backward sentence conditioned on the encoder representation: Optimized Neural Network Story Generator Cong Ye and David Wang CS229 2017 Stanford University Introduction The goal of this project is to combine visual and language processing by building an intelligent storyteller based on input images. We will focus on entertainment purpose first and same model with appropriate dataset and feature adjustments could also be used for early Childhood Education, medical science and geography research. To achieve our goal, we first try to employ unsupervised learning of a generic, distributed sentence encoder. Then, we will leverage the continuity of text from novel and movie scripts as training dataset. The only source of supervision in our models is from Microsoft COCO images to captions. That is, we did not collect any new training data to directly predict stories given images. Data Collection For training visual-semantic embedding modal, we use Microsoft COCO dataset. For training sentence semantic “vector”, we use BookCorpus dataset from University of Toronto To evaluate our smart vector performance, we use SICK dataset. Research We divided the potential areas that could help our project into the following two parts, image captioning and sentence semantics. For the image captioning, we found some early stage work. For example some researchers used CRF Labeling method to present a system to automatically generate natural language descriptions from images. And a system with better performance and accuracy was built with CNN. Similar goals were also achieved by m-RNN. The paper inspired us most describes skip-thought vector algorithm to track the sentence semantics. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. Future Work We will find a way to improve training efficiency in further and reduce computational complexity in network parameterization. In future, instead of a simple description of the picture, we want to optimized our module to understand the meaning of a picture. Evaluation Result h t = (1 z t ) h t 1 + z t " h t " h t = g ( W h x t + U h ( r t h t 1 ) + b h ) z t = σ ( W z x t + U z h t 1 + b z ) r t = σ ( W r x t + U r h t 1 + b r ) 3 × ( n 2 + nm +n ) z t = σ ( U z h t 1 + b z ) r t = σ ( U r h t 1 + b r ) 2 × mn Query and nearest sentence if he had a weapon, he could maybe take out their last imp , and then beat up errol and vanessa . if he could ram them from behind, send them sailing over the far side of the levee , he had a chance of stopping them . then, with a stroke of luck , they saw the pair head together towards the portaloos . then, from out back of the house , they heard a horse scream probably in answer to a pair of sharp spurs digging deep into its flanks . I’ll take care of it , ” goodman said , taking the phonebook . I’ll do that ,” julia said , coming in . The table shows nearest neighbors of sentences from a smart vector model trained on the BookCorpus dataset. These results show that smart vectors learn to accurately capture semantics and syntax of the sentences they encode. 4.3 4.2 3.4 4.6 4 3.2 5 3.2 4.5 2.6 1.9 4.2 2.9 2.4 4.8 3.6 4.4 3.5 2.4 4.5 4 3.6 4.6 4.1 3.5 4.9 3.5 4.4 4.2 2.1 4.3 3.5 3.5 4.4 4.2 4.1 4.2 3.5 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Semantic Relatedness GT PD Predictions from the SICK test set. GT is the ground truth relatedness marked in dataset, scored between 1 and 5 and PD is our modal’s judgment. Story Generated: I woman was on the beach , holding her breath . She gave me a quick hug , and she had no idea what to do . In fact , it seemed as if I had never seen her come out of the surf . In fact , I was going to be the only woman in the world for the past twenty-four hours . She shook her head and bowed her head over my shoulder . In fact , it was so much easier for him to go on a tropical beach at The Shade . She felt as if I were the only woman in the world , standing naked on a sandy beach . Nearest Captions: A woman on the beach has a pink hat and umbrella . A woman is standing on the beach with a red umbrella . A women on a beach holding a pink umbrella . A woman walking across a beach in her panties and a blue shirt . The woman in a dress stands by the shoreline and waves streamers all around h t + 1 = (1 z t ) h t 1 + z t " h t " h t = g ( W d x t + U d ( r t h t 1 ) + b h ) z t = σ ( W z d x t 1 + U z d h t 1 + b z ) r t = σ ( W r d x t 1 + U r d h t 1 + b r ) ( s 0 , s i 1 , s i , s i + 1 ) log P( w 0 t | w 0 <t , h i ) + t log P( w i + 1 t | w i + 1 <t , h i ) + t log P( w i 1 t | w i 1 <t , h i ) t Test with a online picture

Transcript of CS229 Final Poster Cong and Davidcs229.stanford.edu/proj2017/final-posters/5148536.pdf ·...

Page 1: CS229 Final Poster Cong and Davidcs229.stanford.edu/proj2017/final-posters/5148536.pdf · CS229_Final_Poster_Cong_and_David Created Date: 12/12/2017 7:46:12 AM ...

Model and AlgorithmsThe entire systems includes three parts, encoder, decoder and objective function. We focused on optimizing the first two parts.Encoder OptimizationThe Gated Recurrent Neural Network have shown success in applications involving sequential or temporal data but increase parameterization and is expensive. We experiment with different variation of GRU and reduce the parameters in the network without compromising the performance.

The encoder definition:

With the two gates to the variant gate presented as update gate and reset gate:

Reduced the number of parameters in the GRU RNNfrom by

Decoder OptimizationWe defined “Smart Vector” algorithm to better extract each sentence’s semantic with unsupervised learning.

one decoder for the first sentence of each paragraph, one for the next sentence and one for the previous sentence. Anumber of linguistics researches have shown first sentence of each paragraph have significant semantic relatedness with the rest of the sentence in this same paragraph.

Objective: given a tuple, where S0 is the first sentence of the paragraph

The objective optimized is the sum of the log-probabilities for the first sentence of each paragraph, the forward and backward sentence conditioned on the encoder representation:

OptimizedNeuralNetworkStoryGeneratorCongYeandDavidWangCS2292017StanfordUniversity

IntroductionThe goal of this project is to combine visual and language processing by building an intelligent storyteller based on input images. We will focus on entertainment purpose first and same model with appropriate dataset and feature adjustments could also be used for early Childhood Education, medical science and geography research.

To achieve our goal, we first try to employ unsupervised learning of a generic, distributed sentence encoder. Then, we will leverage the continuity of text from novel and movie scripts as training dataset. The only source of supervision in our models is from Microsoft COCO images to captions. That is, we did not collect any new training data to directly predict stories given images.

Data Collection• For training visual-semantic embedding modal, we

use Microsoft COCO dataset.• For training sentence semantic “vector”, we use

BookCorpus dataset from University of Toronto• To evaluate our smart vector performance, we use

SICK dataset.

Research

We divided the potential areas that could help our project into the following two parts, image captioningand sentence semantics.

For the image captioning, we found some early stage work. For example some researchers used CRF Labeling method to present a system to automatically generate natural language descriptions from images. And a system with better performance and accuracy was built with CNN. Similar goals were also achieved by m-RNN.

The paper inspired us most describes skip-thought vector algorithm to track the sentence semantics. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations.

Future Work• We will find a way to improve training

efficiency in further and reduce computational complexity in network parameterization.

• In future, instead of a simple description of the picture, we want to optimized our module to understand the meaning of a picture.

Evaluation Result

ht = (1− zt )⊙ ht−1 + zt ⊙ "ht"ht = g(Whxt +Uh(rt ⊙ ht−1)+ bh )

zt =σ (Wzxt +Uzht−1 + bz )rt =σ (Wrxt +Urht−1 + br )

3× (n2 + nm+n)

zt =σ (Uzht−1 + bz )rt =σ (Urht−1 + br )

2×mn

Queryandnearestsentenceifhehadaweapon,hecouldmaybetakeouttheirlastimp,andthenbeatuperrol andvanessa .ifhecouldramthemfrombehind,sendthemsailingoverthefarsideofthelevee,hehadachanceofstoppingthem.then,withastrokeofluck,theysawthepairheadtogethertowardstheportaloos .then,fromoutbackofthehouse,theyheardahorsescreamprobablyinanswertoapairofsharpspursdiggingdeepintoitsflanks.I’lltakecareofit,”goodman said,takingthephonebook.I’lldothat,”julia said,comingin.

The table shows nearest neighbors of sentences from a smart vector model trained on the BookCorpus dataset. These results show that smart vectors learn to accurately capture semantics and syntax of the sentences they encode.

4.3 4.2

3.4

4.6

4

3.2

5

3.2

4.5

2.6

1.9

4.2

2.9

2.4

4.8

3.6

4.4

3.5

2.4

4.5

4

3.6

4.6

4.1

3.5

4.9

3.5

4.44.2

2.1

4.3

3.5 3.5

4.44.2 4.1 4.2

3.5

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

SemanticRelatedness

GT PD

Predictions from the SICK test set. GT is the ground truth relatedness marked in dataset, scored between 1 and 5 and PD is our modal’s judgment. Story Generated:

I woman was on the beach , holding her breath . She gave me a quick hug , and she had no idea what to do . In fact , it seemed as if I had never seen her come out of the surf . In fact , I was going to be the only woman in the world for the past twenty-four hours . She shook her head and bowed her head over my shoulder . In fact , it was so much easier for him to go on a tropical beach at The Shade . She felt as if I were the only woman in the world , standing naked on a sandy beach .

Nearest Captions:A woman on the beach has a pink hat and umbrella .A woman is standing on the beach with a red umbrella .A women on a beach holding a pink umbrella .A woman walking across a beach in her panties and a blue shirt .The woman in a dress stands by the shoreline and waves streamers all around

ht+1 = (1− zt )⊙ ht−1 + zt ⊙ "ht"ht = g(W

dxt +Ud (rt ⊙ ht−1)+ bh )

zt =σ (Wzdxt−1 +Uz

dht−1 + bz )

rt =σ (Wrdxt−1 +Ur

dht−1 + br )

(s0 , si−1, si , si+1)

logP(w0t |w0

<t ,hi )+t∑ logP(wi+1

t |wi+1<t ,hi )+

t∑ logP(wi−1

t |wi−1<t ,hi )

t∑

Test with a online picture