Post on 21-Jul-2020
Jonathan Huang Stanford University
Data driven student feedback for Programming Intensive MOOCs
Towards global scale CS education
Leonidas Guibas Chris Piech Andy Nguyen
Steve Jobs Stanford, 2005
“all of my working-class parents' savings were being spent on my college tuition… …the minute I dropped out [of college] I could stop taking the required classes that didn't interest me, and begin dropping in on the ones that looked interesting.”
Course selection is better online
MOOC = Massive Open Online Courses
Untapped potential
CS does not count towards high school math/science
requirements in 36 of 50 states
$0
$10
$20
$30
$40
73-7
478
-79
83-8
488
-89
93-9
498
-99
03-0
408
-09
13-1
4
Thou
sand
s
*** source: www.code.org, www.collegeboard.org
0
500
1000
1500
2000
2011 2016 2021
Thou
sand
s
400 students 100,000 students
Stanford ML-class
10 TAs 2,500 TAs (???)
Stanford ML-class
Ease of global scale feedback on a spectrum
Short Response
Long Response
Multiple choice
Essay questions
Proofs Today: Programming Assignments
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
Feedback for Coding Assignments: Easy?
8
Test Inputs
Correct / Incorrect ? Test Outputs
Linear Regression submission (Homework 1) for Coursera’s ML class
The “but it works!!” solution function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters hypo = X*theta; newMat = hypo – y; trans1 = (X(:,1)); trans1 = trans1’; newMat1 = trans1 * newMat; temp1 = sum(newMat1); temp1 = (temp1 *alpha)/m; A = [temp1]; theta(1) = theta(1) - A; trans2 = (X(:,2))’ ; newMat2 = trans2*newMat; temp2 = sum(newMat2); temp2 = (temp2 *alpha)/m; B = [temp2]; theta(2)= theta(2) - B; J_history(iter) = computeCost(X, y, theta); end theta(1) = theta(1); theta(2)= theta(2); Why??
Correctness Efficiency
Style Elegance
Better: theta = theta-(alpha/m) *X'*(X*theta-y)
Good Good Poor Poor
for a class of 100,000, and in real time,
Let’s do this:
New programming problems
New courses
But… can’t require too much instructor effort to make this work for:
New programming languages
We now have massive datasets…
Visualization of 40,000 implementations of linear regression submitted to Coursera’s ML course
[Moocshop, 2013]
# St
uden
ts
Intro CS
1K 10K
20M
Efficient index for “code phrases” of a
MOOC dataset
Shared structure discovery amongst
many student submissions
Applications such as bug finding (w/o execution) and
MOOC-scale feedback
Results on real MOOC with > 1 million
submissions
Codewebs Engine
First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
Correct
Incorrect
First, an example application function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
Correct
Incorrect
Dear Lisa Simpson, consider the dimension of the expression:
X'*(X*theta-y) and what happens after you call sum on it…
Syntax based approach:
Attach this message to everyone containing that exact expression
(covers 99 submissions)
The extraneous sum bug takes many forms…
(Easier) Output based approach:
Attach message to everyone who matched extraneous sum bug in unit test output
(covers 1091 submissions)
theta = theta-alpha*1/m*sum(X'*(X*theta-y));
theta = theta-alpha*1/m*sum(((theta’*X’)’-y)’*X);
theta = theta-alpha*1/m*sum(transpose(X*theta-y)*X);
…
Codewebs approach to feedback
Combined 1604
theta = theta-alpha*1/m*sum(X'*(X*theta-y));
Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression
Output based 1091
Codewebs 1208
# submissions covered by single message
Codewebs approach to feedback
Combined 1604
theta = theta-alpha*1/m*sum(X'*(X*theta-y));
Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression
Output based 1091
Codewebs 1208
# submissions covered by single message
~47% improvement over just using an output based
feedback system!!
Abstract syntax tree representations
• Whitespace • Comments • …
ASTs ignore:
function A = warmUpExercise() A = []; A = eye(5); endfunction
ASSIGN
IDENT (A) INDEX_EXP
IDENT (eye) ARGUMENT_LIST
CONST (5)
ASTs
Indexing documents by phrases
term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}
The bright and blue butterfly hangs on the breeze…
We all something something yellow submarine…
“blue sky” “yellow submarine”
Indexing documents by phrases
term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}
The bright and blue butterfly hangs on the breeze…
We all something something yellow submarine…
“blue sky” “yellow submarine”
What basic queries should an AST search engine support?
Code Phrases
Subtrees and subforests of an AST
BINARY_EXP (*)
POSTFIX (‘) BINARY_EXP (-)
IDENT (X) IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta)
Code Phrases
Context within a larger subtree
BINARY_EXP (*)
POSTFIX (‘) BINARY_EXP (-)
IDENT (X) IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta)
Code Phrases
Context within a larger subtree
BINARY_EXP (*)
POSTFIX (‘) BINARY_EXP (-)
IDENT (X) IDENT (y)
replacement site
The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}
10 print “hello” 20 goto 10
10 for i=1:10 20 x = x+1 30 end
The Codewebs index Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c01571626bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 61d4bfccaa97cca2004102a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}
10 print “hello” 20 goto 10
10 for i=1:10 20 x = x+1 30 end
Very expensive, esp. for large ASTs! def buildIndex(): for A in ASTs: for every code phrase x contained in A: Compute hashcode h[x] Insert A at h[x]
subtrees, subforests, contexts
Hashing Code Phrases
BINARY_EXP (-)
IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta) BINARY_EXP (-)
IDENT (y)
BINARY_EXP (*)
IDENT (X)
IDENT (theta)
Step 1. Create postorder listing of nodes.
Step 2. Hash postorder list via:
Recycling hash computations
BINARY_EXP (-)
IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta) BINARY_EXP (-)
IDENT (y)
BINARY_EXP (*)
IDENT (X)
IDENT (theta)
Observation: Can hash sublist of postorder to get hash of code phrases!
Recycling hash computations
BINARY_EXP (-)
IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta) BINARY_EXP (-)
IDENT (y)
BINARY_EXP (*)
IDENT (X)
IDENT (theta)
Observation: Can hash sublist of postorder to get hash of code phrases!
Idea of DP: Store prefix hashes and prime powers for all
O(n) in time and space
Recycling hash computations
BINARY_EXP (-)
IDENT (y) BINARY_EXP (*)
IDENT (X) IDENT (theta) BINARY_EXP (-)
IDENT (y)
BINARY_EXP (*)
IDENT (X)
IDENT (theta)
Observation: Can hash sublist of postorder to get hash of code phrases!
After precomputation, can get any other hash in constant time!
Indexing is fast in practice
00.5
11.5
22.5
3
0 100 200 300 400 500
Run
time
(sec
onds
)
Average AST size (# nodes)
Time for indexing 1000 ASTs
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
Application: Statistical Bug Finding
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(X'*(X*theta-y)); J_history(iter) = computeCost(X, y, theta); end
83% of ASTs containing this code phrase were buggy!
Solution: X'*(X*theta-y);
vs.
Query Index
Fail Fail Fail Pass Fail Fail
83% of ASTs containing this code phrase were buggy!
Is sum(X'*(X*theta-y)) likely to be a bug?
Query Index
Fail Fail Fail Pass Fail Fail
83% of ASTs containing this code phrase were buggy!
Is sum(X'*(X*theta-y)) likely to be a bug?
Query Index
Fail Fail Fail Pass Fail Fail
83% of ASTs containing this code phrase were buggy!
Is sum(X'*(X*theta-y)) likely to be a bug?
Compute bug probability for all subforests, return smallest bugs found
Many ways to formulate probabilistic bug localization (we compute probability on local contexts)
Bug Detection Accuracy
00.10.20.30.40.50.60.70.80.9
1
0 0.2 0.4 0.6 0.8 1
Bug
det
ectio
n F-
scor
e (B
asel
ine)
Bug detection F-score (Codewebs)
neural net training with backpropagation
logistic regression objective
linear regression with gradient descent
Each point represents a single coding problem. Bubble size = Average # nodes per submitted AST
better
bette
r
More than one way to skin a cat…
Canonicalization: apply semantic preserving transformation rules to ASTs to increase matching probability
X*1*(Y+Z)
X*[1;1]’*[Y;Z]
1. Impossible to predict all ways of writing the same thing
1. Canonicalization rules not
typically generalizable across languages
X*ones(1,2)*[Y;Z]
X*repmat(1,2,1)’*[Y;Z]
X*ones(1,length([Y;Z]))*[Y;Z]
X*transpose([1;1])*[Y;Z]
X*repmat(1,size([Y;Z],1),1)’*[Y;Z]
X*(Y+Z)
X*(Z+Y)
(Y+Z)*X
X*Z+X*Y
Z*X+X*Y
Z*X+Y*X
Difficulties with Canonicalization
Codewebs Approach Use data to determine canonicalization rules
Customize rules to each assignment
Don’t need to be perfect, we’re not building a compiler!
Here’s the idea. def residual (X, theta, y): hypothesis = X * theta solution = hypothesis - y return solution
def residual(X, theta, y): hypothesis = (theta’ * X’)’ solution = hypothesis - y return solution
Counter Example
Agreement can be context dependent
def foo(): solution = solveProblem() print(solution) return solution
def foo(): solution = solveProblem() print(!solution) return solution
= ?
Join on context
Query Index
Query Index
= ?
Fail Fail Pass Pass Fail Pass
Fail Fail Pass Pass Fail Pass
100% probability of equivalence! ** fine print: need to account for sample size in general
Workflow
theta = theta-alpha*1/m*(X'*(X*theta-y));
“alphaOverM” “prediction”
“residual”
Human provides:
length (y) size (X, 1) size (y, 1)
rows (X) m rows (y)
length (X) length (x (:, 1)) size (X) (1)
(theta' * X' - y')'
(X * theta - y)
({hypothesis} - y)
({hypothesis}' - y’)'
[{hypothesis} - y]
sum({hypothesis} - y, 2)
…
alpha * (1 ./ {m}) alpha * (1 / {m}) alpha * 1 ./ {m}
.01 / {m} alpha * {m} ^ -1 alpha .* (1 ./ {m})
1 / {m} * alpha alpha ./ {m} alpha .* (1 / {m})
alpha * inv ({m}) 1 .* alpha ./ {m} alpha * pinv ({m})
alpha / {m} alpha .* 1 / {m} 1 * alpha / {m}
(theta' * X')'
(X * theta)
theta(1) + theta (2) * X (:, 2)
(X * theta (:))
[X] * theta
sum(X.*repmat(theta',{m},1), 2) …
{m}
{alphaOverM}
{hypothesis} {residual}
Canonicalization improves bug detection accuracy
0.650.7
0.750.8
0.850.9
0.95F-
scor
e
# unique ASTs considered
without canonicalization
with canonicalization
High
er is
bet
ter
How many submissions can we give feedback to with fixed effort?
0
5000
10000
15000
20000
25000
0 1 10 19
# su
bmis
sion
s co
vere
d
(out
of 4
0,00
0)
# equivalence classes
with 25 ASTsmarked
with 200 ASTsmarked
Canonicalization, 25 marked ASTS
No Canonicalization, 200 marked ASTs
If we can find shared structure, we can facilitate feedback
In education: ASTs, proofs, essays, architecture, poems…
“gradient” “residual”
“learning rate”
“base case”
“inductive hypothesis”
“apartheid”
“Afrikaner Calvinism”
“elections in South Africa”
Data is revolutionizing many fields
And it can revolutionize education too!
Thank you!! jhuang11@stanford.edu http://www.stanford.edu/~jhuang11 @jonathanhuang11