The Symbiosis of Program Analysis and Machine LearningMLSecurity/talks/prateek.pdf ·...
Transcript of The Symbiosis of Program Analysis and Machine LearningMLSecurity/talks/prateek.pdf ·...
The Symbiosis ofProgramAnalysisandMachineLearning
PrateekSaxenaAssociateProfessor
NationalUniversity ofSingapore
ProgramAnalysis,Classically
2
Program
Property
𝜋DeductiveVerification
ThisPhoto byUnknownAuthorislicensedunderCCBY-SA
RulesTrue/False
But,InPractice…
Program
ThisPhoto byUnknownAuthorislicensedunderCCBY-SA
Rules
Property
𝜋?? ?
• TooComplextoAnalyze/Model• ProbabilisticSystem
• Probabilistic/StochasticProperties• AmbiguousSpec.(Eg. Goodpatch?)
• NotRe-Targetable• IntractableAnalysis
ThisPhoto byUnknownAuthorislicensedunderCCBY-SA-NC
CanMachineLearningHelp?
ThisTalk…
4
ML
Verification
Security3Mainstream
SecurityAnalysisTasks
MachineLearningProgramAnalysisNewRepresentations&InferenceTools
5
ProgramRepresentations
Property
𝜋Induction(ML)
ThisPhoto byUnknownAuthorislicensedunderCCBY-SA
Rules
ProgramAnalysisMachineLearningDeductiveReasoning
6
MLSystem
Property
𝜋DeductiveVerification
ThisPhoto byUnknownAuthorislicensedunderCCBY-SA
Rules
MLPA:NewRepresentations&InferenceTools
ForSymbolicExecution
Jointwork with:Shiqi Shen ,ShwetaShinde,Soundarya Ramesh,Abhik Roychoudhury (NDSS2019)
SymbolicExecution
1 def f (x, y): 2 if (x>y)3 x = x+y4 if (x–y > 0)5 assert false 6 return (x, y)
x↦ Ay↦ B
x↦ A+By↦ B
x↦ Ay↦ B
A>B A≤B
x↦ A+By↦ B
x↦ A+By↦ B
A>0 A≤0
xandyaresymbolicvariablesAandBaresymbolicvalues
DynamicSymbolicExecution(DSE):AwidelyusedvariationofSE
8
assert false
SymbolicExecutionforFindingSecurityBugs
Kite SAGE
jCUTEManticore
S²EAngr
9
ThePathExplosionProblem1 void copy_data(..., int *file,...) {2 static double data[4096], value;3 read_double_value(file, ... );4 value = fabs (data [0]); 5 for(i=0; i<4096; i++)6 if(file[i] == 0.0) count++;7 data[1] /= (value+count-3);8 …9 }
data[1] /= (value+count-3);
i=0i=1
… … … …… … … …
i=…i=4095
24096paths
10
PriorApproaches:Learnabetterrepresentation,symbolicallysolve!• ExpressinSMTtheoryoffloating-point• Inferthat‘count’=#of0sininputbytes• Assert:value+count- 3=0
FPALinear
arithmetic
Vectors
But,whychoosethisspecificconstraintrepresentation?1 void copy_data(..., int *file,...) {2 static double data[4096], value;3 read_double_value(file, ... );4 value = fabs (data [0]); 5 for(i=0; i<4096; i++)6 if(file[i] == 0.0) count++;7 data[1] /= (value+count-3);8 …9 }
data[1] /= (value+count-3);
11
BVStringRealBool
UF
AUniversalApproximateRepresentation?
FPALinear
arithmetic
Vectors
1 void copy_data(..., int *file,...) {2 static double data[4096], value;3 read_double_value(file, ... );4 value = fabs (data [0]); 5 for(i=0; i<4096; i++)6 if(file[i] == 0.0) count++;7 data[1] /= (value+count-3);8 …9 }
KeyInsights
DesiredRepresentation:
Aneuralnetworkisanapproximaterepresentationofthedesired…
12
file
data[1] /= (value+count-3);
𝑐𝑜𝑢𝑛𝑡 ==Σ+∈[.,0.12]𝑠𝑖𝑔𝑛 𝑓𝑖𝑙𝑒 𝑖 == 0
Remarks:- NeuralNetworksareuniversalapproximators- Increasingpracticalsuccess
KeyInsights
ValuesofVariablesinCVP
ValuesofSymbolicVariables
Learnanapproximationwithsmallnumberof
I/Oexamples 13
1 int main (…) {2 if (strlen(filename)>1 && filename[0]==‘-’)3 exit(1)4 copy_data(…);5 …6 }7 void copy_data(..., int *file,...) {8 static double data[4096], value;9 read_double_value(file, ... );
10 value = fabs (data [0]); 11 for(i=0; i<4096; i++)12 if(file[i] == 0.0) count++;13 data[1] /= (value+count-3);14 …15 }
filename
file
data[1] /= (value+count-3); CVP:Divide-by-zero
ApproximateConstraint(asaneuralnet):
file count &value
ANewApproach:Neuro-symbolicExecution
Program
ThisPhoto byUnknownAuthorislicensedunderCCBY-SA
Property
𝜋
Symbolic(SMT)Constraints
NeuralNetwork
SymbolicExploitCondition
SATISFIABLE?PATH
EXPLOSION
Neuro-SymbolicExecution– Shenetal.[NDSS’19]
𝑁: 𝑖𝑛𝑓𝑖𝑙𝑒 → 𝑣𝑎𝑙𝑢𝑒, 𝑐𝑜𝑢𝑛𝑡
1.Reachabilityconstraints:𝑠𝑡𝑟𝑙𝑒𝑛 𝑓𝑖𝑙𝑒𝑛𝑎𝑚𝑒 ≤ 1∨𝑓𝑖𝑙𝑒𝑛𝑎𝑚𝑒 ≠ ′ − ′
2.Vulnerabilitycondition:𝑣𝑎𝑙𝑢𝑒 + 𝑐𝑜𝑢𝑛𝑡 − 3 == 0
Purelysymbolicconstraints:Novariablesharedwithneuralconstraints
SMTsolver
Mixedconstraints:Includingbothneuralconstraintsandsymbolicconstraintswithsharedvariables
ConstraintSolving:SatisfiabilityChecking
15
∧
Neuro-SymbolicExecution– Shenetal.[NDSS’19]
value+count
L
SolvingSMT+NeuralConstraints:EncodeSMTconstraintsasthelossfunction
Symbolicconstraint
Criterionforcraftingthelossfunction:Theminimumpointofthelossfunctionsatisfiesthesymbolicconstraints.
3
16
𝑁: 𝑖𝑛𝑓𝑖𝑙𝑒 → 𝑣𝑎𝑙𝑢𝑒, 𝑐𝑜𝑢𝑛𝑡 ∧𝑣𝑎𝑙𝑢𝑒 + 𝑐𝑜𝑢𝑛𝑡 − 3 == 0
𝐿 = 𝑎𝑏𝑠(𝑣𝑎𝑙𝑢𝑒 + 𝑐𝑜𝑢𝑛𝑡 − 3)
Neuro-SymbolicExecution– Shenetal.[NDSS’19]
OptimizeusingGradientDescent
file count,value
257202741201802017… … …030
count value loss Gradient:
file:000…Concretelyvalidatetheexploit
17
𝛻O+PQ𝐿
Neuro-SymbolicExecution– Shenetal.[NDSS’19]
ReachingCVPs
CollectingI/OExamples
NeuralNetTraining
ConstraintSolving
Startingpoint
BottleneckDSE
Randomfuzz
NeuEX =Neuro-symbolicExecution+KLEE
18
DSEEngine(KLEE)
NeuEx ToolOverview
SourceCode CVPs
SymbolicVariables
InputGrammar(optional)
Bottlenecks:1. UnmodeledAPIs2. LoopUnrolled
Count>10K3. Z3Timeout(>10
mins)4. MemoryCap>3GB
SMTSolver(Z3)
NeuralMode
NeuralMode
NeuralMode
NeuralMode
fork
fork
fork
fork
ValidatedExploits
NeuEx
19Neuro-SymbolicExecution– Shenetal.[NDSS’19]
Evaluation
● Recall:NeuralmodeisonlytriggeredwhenDSEencountersbottlenecks
● Benchmarks:7ProgramsknowntobedifficultforclassicDSE○ 4Realprograms
■ cURL:Datatransferring■ SQLite:Database■ libTIFF:Imageprocessing■ libsndfile:Audioprocessing
○ LESEbenchmarks■ BIND,Sendmail,andWuFTP
Include:1.Complexloops2.Floating-pointvariables3.UnmodeledAPIs
20Neuro-SymbolicExecution– Shenetal.[NDSS’19]
NeuEx vsKLEE
21
#CVPsreachedorcoveredbyNeuEx is25%higherthanvanillaKLEE.
KLEEgetsstuck(e.g.complexloops)
NeuEx finds94%and89%morebugsthanvanillaKLEEinBFSand
RANDmodein12hours.
Neuro-SymbolicExecution– Shenetal.[NDSS’19]
MLPA:NewRepresentations&Inference
ForTaintAnalysis
Jointwork with:Shiqi Shen ,ShwetaShinde,Soundarya Ramesh,Abhik Roychoudhury (NDSS2019)
TaintAnalysis
• Taintanalysistrackstheinformationflowwithinaprogram:• E.g.T[v]isthetaintbitforoperand“v”
• Taintanalysisisthebasisformanysecurityapplications• Informationleakagedetection• Enforcingprogramintegrity• Vulnerabilitydetection• …
1 int parse_buffer(char buffer[100], struct pkt_info *info) {
2 char check_flag;
3
4 check_flag = buffer[5] & 0x16;
5
6 err = init_pkt_info(info);
7 if (!err)
8 return err;
9 info->flag = check_flag;
10 /* … */
11 strncpy(info->data, buffer + 6);
12 info->seq = get_current_seq();
13 return OK;
14 } IsReturnAddressTainted?
/* tainted input from network socket */
1 int parse_buffer(char buffer[100], struct pkt_info *info) {
2 char check_flag;
3
4 check_flag = buffer[5] & 0x16;
5
6 err = init_pkt_info(info);
7 if (!err)
8 return err;
9 info->flag = check_flag;
10 /* … */
11 strncpy(info->data, buffer + 6, 50);
12 info->seq = get_current_seq();
13 return OK;
14 }
movsx eax, byte ptr [rsi + 5]and eax, 16mov cl, almov byte ptr [rbp - 25], cl
Writebinarytaintrulesbasedoninstructionsemantics
buffer
check_flag
T[check_flag]=T[buffer+5]
TaintMapT[]
Info
TaintAnalysis onBinaries
TaintRuleRepresentationsinExistingSystems
• Whatisthetaintruleforand eax, 16 onthex86architecture?
TaintEngine1
T[eax]=T[eax]
TaintEngine2
T[eax]=T[eax]
T[pf]=T[sf]=T[zf]=T[eax]T[of]=T[cf]=0
TaintEngine3
T[eax]=T[eax]
T[pf]=T[sf]=T[zf]=T[eax]T[of]=T[cf]=0
ifimm ==0{T[eax]=0}
OneEngineToServe’Em All:LeongetAl.[NDSS’19]
Complexityof“Real”TaintRules
• Inputdependentpropagation
• Sizedependentpropagation
• Architecturalquirksforbackwardscompatibility
• Therecanbe1000sofopcodesperinstructionset!
if (size == 64 || size == 32 || size == 16) {for (x = 0; x < size / 8; x++) {if (t1[x] & t2[x]) t1[x] = 1;else if (t1[x] and !t2[x])t1[x] = t1[x] & op2[x];
else if (!t1[x] & t2[x])t1[x] = t2[x] & op1[x];
else t1[x] = 0;} else if (size == 8) {
// 0 if it’s lower 8 bits, 1 if it’s upper 8 bitspos1 = isUpper(op1); pos2 = isUpper(op2);if (t1[pos1] & t2[pos2]) t1[pos1] = 1;else if (t1[pos1] & !t2[pos2])t1[pos1] = t1[pos1] & op2[pos2];
else if (!t1[pos1] & t2[pos2])t1[pos1] = t2[pos2] & op1[pos1];
else t1[pos1] = 0;}}if (mode64bit == 1 and size == 64)
for (x = 32; x < size; x++) t1[x] = 0;
if (size == 64 || size == 32 || size == 16) {for (x = 0; x < size / 8; x++) {if (t1[x] & t2[x]) t1[x] = 1;else if (t1[x] and !t2[x])t1[x] = t1[x] & op2[x];
else if (!t1[x] & t2[x])t1[x] = t2[x] & op1[x];
else t1[x] = 0;} else if (size == 8) {
// 0 if it’s lower 8 bits, 1 if it’s upper 8 bitspos1 = isUpper(op1); pos2 = isUpper(op2);if (t1[pos1] & t2[pos2]) t1[pos1] = 1;else if (t1[pos1] & !t2[pos2])t1[pos1] = t1[pos1] & op2[pos2];
else if (!t1[pos1] & t2[pos2])t1[pos1] = t2[pos2] & op1[pos1];
else t1[pos1] = 0;}}if (mode64bit == 1 and size == 64)
for (x = 32; x < size; x++) t1[x] = 0;
if (size == 64 || size == 32 || size == 16) {for (x = 0; x < size / 8; x++) {if (t1[x] & t2[x]) t1[x] = 1;else if (t1[x] and !t2[x])t1[x] = t1[x] & op2[x];
else if (!t1[x] & t2[x])t1[x] = t2[x] & op1[x];
else t1[x] = 0;} else if (size == 8) {
// 0 if it’s lower 8 bits, 1 if it’s upper 8 bitspos1 = isUpper(op1); pos2 = isUpper(op2);if (t1[pos1] & t2[pos2]) t1[pos1] = 1;else if (t1[pos1] & !t2[pos2])t1[pos1] = t1[pos1] & op2[pos2];
else if (!t1[pos1] & t2[pos2])t1[pos1] = t2[pos2] & op1[pos1];
else t1[pos1] = 0;}}if (mode64bit == 1 and size == 64)
for (x = 32; x < size; x++) t1[x] = 0;
if (size == 64 || size == 32 || size == 16) {for (x = 0; x < size / 8; x++) {if (t1[x] & t2[x]) t1[x] = 1;else if (t1[x] and !t2[x])t1[x] = t1[x] & op2[x];
else if (!t1[x] & t2[x])t1[x] = t2[x] & op1[x];
else t1[x] = 0;} else if (size == 8) {
// 0 if it’s lower 8 bits, 1 if it’s upper 8 bitspos1 = isUpper(op1); pos2 = isUpper(op2);if (t1[pos1] & t2[pos2]) t1[pos1] = 1;else if (t1[pos1] & !t2[pos2])t1[pos1] = t1[pos1] & op2[pos2];
else if (!t1[pos1] & t2[pos2])t1[pos1] = t2[pos2] & op1[pos1];
else t1[pos1] = 0;}}if (mode64bit == 1 and size == 64)
for (x = 32; x < size; x++) t1[x] = 0;
OneEngineToServe’Em All:LeongetAl.[NDSS’19]
LearningTaintRulesAutomatically
• Generateobservations(input-outputpairs)• Infersoundrulefromobservations(exactmode)• Generalizerulewithsimplechangeininferencealgorithm(generalizationmode)
ObservationEngine
Observations(10110…,11100)
…(10111…,11000)
InferenceEngine
TaintComputation
Rules
A→BX→AY→Z
cmovb eax,ebxInstruction
OneEngineToServe’Em All:LeongetAl.[NDSS’19]
• Flipabitandobservetheoutputforchanges.• ∆EBX0 →∆EAX0• ∆EBX0 →∆EBX0
• Influence(Inf)onlyvalidif:• EAX=11100011,EBX=00101000
• Formatruthtablewithallofthecollectedobservations.• Trueifthereisachange,Falseotherwise
• Unseenvaluesareconservativelysetto“Don’t-Cares”
TaintInduce:SampleandLearn
0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0
0 0 1 0 1 0 0 01 1 1 0 0 0 1 1EAX0 EBX0EBX7EAX7
0 0 1 0 1 0 0 1
0 0 1 0 1 0 0 10 0 1 0 1 0 0 1
0 0 0 1 1 1 0 0 0 1 0 1 1 0 1 0
0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0
0 1 0 1 1 0 1 0
0 1 0 1 1 0 1 1 0 1 0 1 1 0 1 1
0 1 0 1 1 0 1 1
EAX0 EAX1 … EBX0 EBX1 ... Inf
1 1 … 0 0 … 1
1 1 … 1 0 … 1
mov eax,ebx
0 0 … 1 1 ... 1
0 0 … 0 0 … 1
… … … … … ... 0
OneEngineToServe’Em All:LeongetAl.[NDSS’19]
CapturesConditional&IndirectDependencies
cmovb eax,ebx
StateBefore
StateAfter
MemorySlotsEAX MemorySlotsEBX
MemorySlotsEAX EBX
ECX
ECX
CF
ebx→eaxCF=1, EAX=542, EBX=19, ECX=7, …CF=1, EAX=32, EBX=3, ECX=0, …CF=1, EAX=873, EBX=32, ECX=1, …
…
eax →eaxCF=0, EAX=12,EBX=4,ECX=1023…CF=0, EAX=42,EBX=11,ECX=13,…CF=0, EAX=2,EBX=3,ECX=33,…
…
cmovb eax,ebx
StateBefore
StateAfter
MemorySlotsEAX MemorySlotsEBX
MemorySlotsEAX EBX
ECX
ECX
CF
OneEngineToServe’Em All:LeongetAl.[NDSS’19]
TaintInduce:Outputsasuccinctrule• Useinferencetechniquetolearnasuccinctrule fortheobservedfunction
OneEngineToServe’Em All:LeongetAl.[NDSS’19]
CF=0, EAX=12, …Z FalseCF=1, EAX=333, … TrueCF=0, EAX=42, … FalseCF=0, EAX=44, … FalseCF=1, EAX=873, … TrueCF=0, EAX=1023, … FalseCF=0, EAX=33, … FalseCF=1, EAX=32, … TrueCF=0, EAX=2, … False… DC
InferenceCF=1 True
IF
(EBX0 →EAX0)THEN
(EAX0 →EAX0)ELSE
Results:ComparisonwithState-of-the-art
• ComparewithTEMU,Triton,libdft• LAVA-M,libtiff,binutils,etc.• CheckstaintpropagationforeachindividualinstructiononbetweenTaintInduceandeachofthetool
• Only0.28%ofthediscrepanciesareerrorsinTaintInduce• AlloftheerrorsmadebyTaintInduceisduetoZF
Matches:93.27%- 99.5%withexistinghand-writtentools.Only0.28%discrepanciesareerrorsinTaintInduce.
X86Instructionsxw
Arith Comp Jump Move Cond FPU SIMD Misc Total
TaintInduce 43 9 33 33 60 85 259 28 550
libdft 15 5 1 30 32 X X 8 91
Triton 38 9 19 33 32 X 144 13 288
TEMU 7 1 2 3 X X X X 13
OneEngineToServe’Em All:LeongetAl.[NDSS’19]
Results:CoverageandCorrectnessAuto-generatedtaintrulesfor4architectures:x86,x64,AArch 64,MIPS-I
withnomistakesfor~71%oftheinstructions
Arith Comp Jump Move Cond FPU SIMD Misc
x86 √ √ √ √ √ √ √ √x64 √ √ √ √ √ √ √ √
AArch64 √ √ √ √ √ √ √ √MIPS-I √ √ √ √ - - - -
Methodology:trainfor100seeds,teston1000randominputsforeachinstruction
RoomforFutureWork:Learnpreciserulesforalltheinstructions….
OneEngineToServe’Em All:LeongetAl.[NDSS’19]
PAML:DeductiveReasoning
Jointworkwith:TeodoraBaluta,Shiqi Shen,ShwetaShinde,Kuldeep S.Meel (CCS2019)
ConcernswithMLSystemsRobustness Fairness Memorization
MLMODEL
[Trojaning Attacks- Liuet.al][AdversarialExamples– Goodfellow etal.]
QualitativeVerification
VERIFIER
𝑁Boolean ∃𝑥, 𝜋 𝑁, 𝑥
But,thenetworkorpropertyisoftenstochastic…
QuantitativeVerification
QUANTITATIVEVERIFIER𝑁
Howmany?
QuantitativeVerificationofNeuralNetworksAnditsSecurityApplications– Baluta etal.[CCS’19]
NPAQ:AQuantitativeVerifierForNeuralNets
𝑁COUNT-
PRESERVINGENCODERS
CNFFORMULA(𝜑)
APPROX.MODEL-COUNTER
COUNT|𝑅(𝜑)|
Pr[(1 + 𝜖)Z[|𝑅(𝜑)| ≤ 𝑟 ≤ (1 + 𝜖)|𝑅(𝜑)|] ≥ 1 − 𝛿PAC-styleSoundnessGuarantees:
ErrorTolerance
TrueCount
Approx.Count Confidence
QuantitativeVerificationofNeuralNetworksAnditsSecurityApplications– Baluta etal.[CCS’19]
NPAQResults• 84modelsofupto51,410parameters• 1,056encodedCNFformulae• 3applicationsonMNISTandUCIAdultdatasets:• Robustness:Howmanyadversarialsampleswithinsomeperturbationdistance?
• Fairness:Howoftenwillapredictionchange(favorably)ifthegenderoftheapplicantischanged,keepingallelseconstant?
• Trojanattackefficiency:Howoftenwillanimagewithatriggerresultindesiredmisclassification?
97.1%oftheencodedformulasolvedwithin24hourstimeouteachQuantitativeVerificationofNeuralNetworksAnditsSecurityApplications– Baluta etal.[CCS’19]
KeyTakeaways● MachineLearningHelpsProgramAnalysis
Ø Whenprogram,propertyoranalysisrulesareuncertainØ ProvidespowerfulapproximaterepresentationsandsolvingtoolsØ SpecificApplications:
● Neuro-SymbolicExecution● AutomaticallyLearningTaintRules
● ProgramAnalysisHelpsMachineLearningØ ByverifyingpropertiesØ SAT/SMT-basedquantitativereasoningisapowerfultoolØ SpecificApplications:Fairness,Robustness,Memorization
39Thankyou!