Profile-driven Inlining for Erlang Thomas Lindgren [email protected].

34
Profile-driven Inlining for Erlang Thomas Lindgren [email protected] m

Transcript of Profile-driven Inlining for Erlang Thomas Lindgren [email protected].

Page 1: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Profile-driven Inlining for Erlang

Thomas Lindgren

[email protected]

Page 2: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Inlining Replace function call f(X1,…,Xn) with

body of f/n Optimization enabler

– Simplify code– Specialize code– Remove ”optimization fence”

Standard tool in modern compiler toolbox

Page 3: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Inlining Main problem: which calls to inline?

– Code growth reduces performance– Estimate code size growth– Select the best estimated sites subject to cost

Some static estimations:– f/n is small? (= inline cost is small)– Inlining the call to f/n enables optimization

Are we optimizing the important code?– Or just the convenient code?

Page 4: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Inlining Dynamic estimation

– Profile the program– Select the best hot call sites for inlining

Optimize the important code

Page 5: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Our approach Inlining driven by profiling Permit cross-module inlining

– Computations often span several modules– Code growth measured for whole program

Cross-module optimization enabled by (i) module aggregation and (ii) guarded conversion of remote to local calls

(will not describe this further here) [Lindgren 98]

Page 6: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

The rest of this talk Overview of method Performance measurements

Page 7: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Inline forest Inlinings to be done

represented by forest

Nodes are inlined call sites

Leaves are call sites to be checked

(Example shows nested inlining)

Some sites are notinlined

f

g f g

h

h

Page 8: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Priority-based inlining All call sites (leaves in inline forest) are

placed in priority queue– Priority = estimated number of calls

When a call site f is inlined, the call sites in f are added to the queue– Priority scaled appropriately

Page 9: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Inlining algorithm Preprocess code

– call_site and size maps– Initialize priority queue– Initialize inline forest

While prio queue not empty– Take call site (k, f)– Try to inline it

Page 10: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Preprocessing for each function visited k times

– for each call site visited k’ times set ratio(call_site) = (k’/k)

Adjust ratio so that < 1.0 Self-recursive call sites := 0.0

– (improves code quality) maps (function -> [{call_site, ratio}])

Page 11: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement(3,BbcRec,1)); 3 -> "..."; 16 -> "..."; 24 -> "..." end, case if Octet5 band 128 == 128 -> {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6(NewBbcRec,Rest)) end end.

Original code marked with number of visits

Page 12: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement(3,BbcRec,1)); 3 -> "..."; 16 -> "..."; 24 -> "..." end, case if Octet5 band 128 == 128 -> {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6(NewBbcRec,Rest)) end end.

Special attention to function calls

Page 13: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement(3,BbcRec,1)); 3 -> "..."; 16 -> "..."; 24 -> "..." end, case if Octet5 band 128 == 128 -> {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6(NewBbcRec,Rest)) end end.

dec_bearer_capability/2 runs 200,000 timesdec_bearer_capability_6 visited 200,000 times ratio is (200/200) = 1.0 adjust ratio to 0.99

Page 14: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Inlining a call site Bookkeeping phase (code gen later) Call to f(X1,…,Xn), visited k times k < minimum frequency? stop tot_size + size(f) > max_size? skip Otherwise,

– tot_size += size(f)– for each call site g of f

add (k * ratio, g) to priority queue extend node f by call sites g1,…,gn

Iterate until no call sites remain

Page 15: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Example Inlining applied to decode1

– Protocol decoding– Single module

Page 16: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

decode1decode_ie_coding_1/3 [800k]decode_action/1 [800k]dec_bearer_capability/2 [200k]dec_bearer_capability_6/2 [198k]decode_ie_heads_setup/5 [198k]…

Prio queue Inline forest

dec_bearer_capability/2 -> [(dec_bearer_capability_6, 1.00)]decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.2), (decode_ie_heads_setup/5, 0.6)]…

Call_site mapping (selected parts)self-recursive so setto 0.0

adjust to 0.99

Page 17: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

decode1decode1decode_ie_coding_1/3 [800k]decode_action/1 [800k]dec_bearer_capability/2 [200k]dec_bearer_capability_6/2 [198k]decode_ie_heads_setup/5 [198k]…

Prio queue Inline forest

dec_bearer_capability/2 -> [(dec_bearer_capability_6, 0.99)]decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.0), (decode_ie_heads_setup/5, 0.0)]…

Call_site mapping

Try to inline

Page 18: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

-decode_action/1 [800k]dec_bearer_capability/2 [200k]dec_bearer_capability_6/2 [198k]decode_ie_heads_setup/5 [198k]…

decode1decode1decode1

Prio queue Inline forest

Page 19: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

--dec_bearer_capability/2 [200k]dec_bearer_capability_6/2 [198k]decode_ie_heads_setup/5 [198k]…

decode1decode1decode1decode1

Prio queue Inline forest

Page 20: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

decode1decode1decode1decode1decode1

Prio queue Inline forest

Final result:-inline dec_bearer_cap_6/2 into dec_bearer_cap/2 yielding (*)-Inline dec_ie_coding/1, decode_action/1 and (*) into decode_ie_heads_setup/5-During inlining, one inline was rejected for too much code growth (not shown)

Now time for code generation

Page 21: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Code generation Walk each inline tree from leaf to root

– Replace inlined calls f(E1,…,En) with (fun(X1,…,Xn) -> E end)(E1,…,En)

– General case: nested inlines Simplify the resulting function

– Apply fun to arguments (above)– Case-of-case– Case-of-if– …

Page 22: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Measurements Used five applications

– decode1 (small protocol decoder)– ldapv2 (ASN.1 encode/decode)– gen_tcp (send/rcv over socket)– beam (compiler)– mnesia (simulate HLR)

Page 23: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

BenchmarksApp Mods Funcs Calls Local Visited

Gen_tcp 13 658 1546 989 202

ldapv2 5 321 1038 616 140

beam 51 2347 9669 7594 2653

mnesia 63 4207 13390 8435 984

Page 24: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

BenchmarksBenchmarksApp Mods Funcs Calls Local Visited

Gen_tcp 13 658 1546 989 202

ldapv2 5 321 1038 616 140

beam 51 2347 9669 7594 2653

mnesia 63 4207 13390 8435 984

Page 25: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

BenchmarksBenchmarksApp Mods Funcs Calls Local Visited

Gen_tcp 13 658 1546 989 202

ldapv2 5 321 1038 616 140

beam 51 2347 9669 7594 2653

mnesia 63 4207 13390 8435 984

Page 26: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Performance Very preliminary

– Code generation problems for beam and mnesia => unable to measure

– (Probably due to name capture bug) Did not use outlining, higher-order

specialization, apply open-coding [EUC’01] Tried only emulated code

– Native code compilation failed

Page 27: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Speedup vs baseline

decode1 1.05

gen_tcp 1.04

ldapv2 1.10

Native compilation of inlined decode1 provided a net slowdown

Page 28: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Future work Integrate with other optimizations Plenty of opportunities for further

source-level simplifications Suggests new approach to module

aggregation – (do it after inlining instead of before)

Tuning, measurements– Bugfixing …

Page 29: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Conclusion

Profile-guided inlining speeds up real code

Whole-program, cross-module inlining probably necessary

Page 30: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Backup slides

Page 31: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

%% inlined, before simplifydec_bearer_capability(BbcRec,[Octet5|Rest]) -> ... case if Octet5 band 128 == 128 -> false; true -> true end of true -> dec_bearer_capability_5a(NewBbcRec,Rest); false -> _0_BbcRec = NewBbcRec,[_0_Octet6] = Rest, _0_STC = case (_0_Octet6 bsr 5) band 3 of 0 -> 0; 1 -> 1 end, _0_UPCC = case _0_Octet6 band 3 of 0 -> 0; 1 -> 1 end, _0_NewBbcRec = erlang:setelement(6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC) end.

Case-of-if

Page 32: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

%% after simplify:dec_bearer_capability(BbcRec,[Octet5|Rest]) -> ... if Octet5 band 128 == 128 -> _0_BbcRec = NewBbcRec, [_0_Octet6] = Rest, _0_STC = case (_0_Octet6 bsr 5) band 3 of 0 -> 0; 1 -> 1 end, _0_UPCC = case _0_Octet6 band 3 of 0 -> 0; 1 -> 1 end, _0_NewBbcRec = erlang:setelement(6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC); true -> dec_bearer_capability_5a(NewBbcRec,Rest) end.

Page 33: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

Module merging We want to optimize over several modules

at a time What to do about hot code loading?

– Merge modules to aggregates– Convert suitable remote calls into local calls– Guard such calls to preserve code loading

semantics– Annotate code regions with ”origin module” to

enable precise process purging Or … extend Erlang appropriately

Page 34: Profile-driven Inlining for Erlang Thomas Lindgren thomasl_erlang@yahoo.com.

decode_ie_heads_setup(Bin,TypeOfCall,EprFlag,IEList,BrepFlag) when erlang:is_binary(Bin), erlang:size(Bin) >= 4 -> {Bin1,Bin2} = erlang:split_binary(Bin,4), [Id,F,L1,L0] = erlang:binary_to_list(Bin1), _4_Flag = F, Action = if _4_Flag band 16 == 16 -> case _4_Flag band 3 of 0 -> clear_call; 1 -> discard_proceed; 2 -> discard_proceed_status; _ -> undefined end; true -> false, ignore end, _3_F = F, Coding = case _3_F band 96 of 0 -> itu_t_standard; 96 -> atm_forum_specific; _ -> undefined end, case 256 * L1 + L0 of Len when Len > 0 -> case catch erlang:split_binary(Bin2,Len) of {'EXIT',_} -> decode_ie_heads_setup(not_a_binary,TypeOfCall,EprFlag,IEList,BrepFlag); {Bin3,Bin4} -> IE = {ie,Id,Coding,Action,Len,Bin3}, case Id of 94 -> BbcRec = {scct_bbc,undefined,undefined,undefined,undefined,undefined}, case catch begin _2_BbcRec = BbcRec, [_2_Octet5|_2_Rest] = erlang:binary_to_list(Bin3), _2_NewBbcRec = case _2_Octet5 band 31 of 1 -> erlang:setelement(3,_2_BbcRec,1); 3 -> erlang:setelement(3,_2_BbcRec,3); 16 -> erlang:setelement(3,_2_BbcRec,16); 24 -> erlang:setelement(3,_2_BbcRec,24) end, if _2_Octet5 band 128 == 128 -> _2__0_BbcRec = _2_NewBbcRec, [_2__0_Octet6] = _2_Rest, _2__0_STC = case (_2__0_Octet6 bsr 5) band 3 of 0 -> 0; 1 -> 1 end, _2__0_UPCC = case _2__0_Octet6 band 3 of 0 -> 0; 1 -> 1 end, _2__0_NewBbcRec = erlang:setelement(6,erlang:setelement(5,_2__0_BbcRec,_2__0_UPCC),_2__0_STC); true -> true, dec_bearer_capability_5a(_2_NewBbcRec,_2_Rest) end end of {'EXIT',_} -> CauseRec = {scct_cause,undefined,2,100,[94]}, RelCompUniMsg = {release_complete_uni,[CauseRec],[]}, {error_throw_relcomp,RelCompUniMsg}; NewBbcRec -> case erlang:element(5,NewBbcRec) of 0 -> decode_ie_heads_setup(Bin4,0,EprFlag,[IE|IEList],BrepFlag); 1 -> decode_ie_heads_setup(Bin4,1,EprFlag,[IE|IEList],BrepFlag) end

end; 84 -> decode_ie_heads_setup(Bin4,TypeOfCall,yes_epr,[IE|IEList],BrepFlag); 99 -> decode_ie_heads_setup(Bin4,TypeOfCall,EprFlag,[IE|IEList],yes_brep); _ -> decode_ie_heads_setup(Bin4,TypeOfCall,EprFlag,[IE|IEList],BrepFlag) end end; Len when Len == 0 -> decode_ie_heads_setup(Bin2,TypeOfCall,EprFlag,IEList,BrepFlag) end;decode_ie_heads_setup(_,1,yes_epr,IEList,no_brep) -> {1,IEList};decode_ie_heads_setup(_,1,yes_epr,IEList,yes_brep) -> {1,lists:reverse(IEList)};decode_ie_heads_setup(_,1,no_epr,_,no_brep) -> CauseRec = {scct_cause,undefined,2,96,[84]}, RelCompUniMsg = {release_complete_uni,[CauseRec],[]}, {error_throw_relcomp,RelCompUniMsg};decode_ie_heads_setup(_,0,_,IEList,no_brep) -> {0,IEList};decode_ie_heads_setup(_,0,_,IEList,yes_brep) -> {0,lists:reverse(IEList)};decode_ie_heads_setup(_,no_bbc_ie,_,_,_) -> CauseRec = {scct_cause,undefined,2,96,[94]}, RelCompUniMsg = {release_complete_uni,[CauseRec],[]}, {error_throw_relcomp,RelCompUniMsg}.