Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model...
Transcript of Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model...
![Page 1: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/1.jpg)
Telefónica Digital – Video Products Definition & Strategy
using std::cpp 2015
Joaquín M López Muñoz <[email protected]> Madrid, November 2015
Mind the cache
![Page 2: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/2.jpg)
Our mental model of the CPU is 70 years old
Von Neumann in front of EDVAC
![Page 3: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/3.jpg)
Current reality looks quite different
Intel Core i7-3770K (Ivy Bridge)
![Page 4: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/4.jpg)
Processor-memory gap
Year
Pe
rfo
rman
ce
More cores
More cache
![Page 5: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/5.jpg)
L3 8 MB
L2 256 KB
Cache hierarchy
L1 32 KB
L2 256 KB
L1 32 KB
L2 256 KB
L1 32 KB
L2 256 KB
L1 32 KB
![Page 6: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/6.jpg)
L2 256 KB
Cache line
L1 32 KB
L1 32 KB
64 B
![Page 7: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/7.jpg)
Access times
DRAM: 60 ns / 237 cycles
L1: 1 ns / 4 cycles
L2: 3.1 ns / 12 cycles
L3: 7.7 ns / 30 cycles
![Page 8: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/8.jpg)
Measure measure measure
![Page 9: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/9.jpg)
Our little measuring tool
template<typename F> double measure(F f) { using namespace std::chrono; static const int num_trials=10; static const milliseconds min_time_per_trial(200); std::array<double,num_trials> trials; volatile decltype(f()) res; // to avoid optimizing f() away for(int i=0;i<num_trials;++i){ int runs=0; high_resolution_clock::time_point t2; measure_start=high_resolution_clock::now(); do{ res=f(); ++runs; t2=high_resolution_clock::now(); }while(t2-measure_start<min_time_per_trial); trials[i]=duration_cast<duration<double>>(t2-measure_start).count()/runs; } (void)(res); // var not used warn std::sort(trials.begin(),trials.end()); return std::accumulate(trials.begin()+2,trials.end()-2,0.0)/(trials.size()-4)*1E6; }
![Page 10: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/10.jpg)
Profiling environment
Intel Core i5-2520M @ 2.5 GHz (Sandy Bridge)
L1: 2×32 KB 8-way 3 cycles
L2: 2×256 KB 8-way 9 cycles
L3: shared 3 MB 12-way 22 cycles
4.0 GB DRAM
Windows 7 Enterprise
Visual Studio Express 2015
x86 mode (32 bit), default release settings
![Page 11: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/11.jpg)
Container traversal
![Page 12: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/12.jpg)
Linear traversal
// vector std::vector<int> v=...; return std::accumulate(v.begin(),v.end(),0); // list std::list<int> l=...; return std::accumulate(l.begin(),l.end(),0); // shuffled list std::list<int> l=...; // nodes shuffled in memory return std::accumulate(l.begin(),l.end(),0);
![Page 13: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/13.jpg)
Linear traversal: results
20.2×
6.6×
![Page 14: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/14.jpg)
Linear traversal: results
111×
179×
225×
![Page 15: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/15.jpg)
Linear traversal: analysis
Two cache aspects to watch out for
Locality
Prefetching
std::vector excels at both
std::list: sequentially allocated nodes provide some sort of non-guaranteed locality
Shuffled nodes is the worst scenario
current cache line prefetched cache line
![Page 16: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/16.jpg)
Matrix sum
boost::multi_array<int,2> a(boost::extents[m][m]); // row_col long int res=0; for(std::size_t i=0;i<m;++i){ for(std::size_t j=0;j<m;++j){ res+=a[i][j]; } } return res; // col_row long int res=0; for(std::size_t j=0;j<m;++j){ for(std::size_t i=0;i<m;++i){ res+=a[i][j]; } } return res;
![Page 17: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/17.jpg)
Matrix sum: results
1.7× 1.9×
![Page 18: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/18.jpg)
Matrix sum: analysis
Max # of segments in L1: 32 KB / 64 B = 512
L2 saturation @ 40972 = 16,785,409
Partial cache associativity makes things actually worse
m m
m
m
5132 = 262,656 51
3
513 513
51
3
![Page 19: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/19.jpg)
Layout tricks
![Page 20: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/20.jpg)
AOS vs SOA
// aos struct particle { int x,y,z; int dx,dy,dz; }; using particle_aos=std::vector<particle>; auto ps=create_particle_aos(n); long int res=0; for(std::size_t i=0;i<n;++i)res+=ps[i].x+ps[i].y+ps[i].z; return res; // soa struct particle_soa { std::vector<int> x,y,z; std::vector<int> dx,dy,dz; }; auto ps=create_particle_soa(n); long int res=0; for(std::size_t i=0;i<n;++i)res+=ps.x[i]+ps.y[i]+ps.z[i]; return res;
![Page 21: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/21.jpg)
AOS vs SOA: results
3×
2.2×
![Page 22: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/22.jpg)
AOS vs SOA: analysis
AOS cache load throughput: 8 useful values per fetch (average)
SOA throughput: 16 useful values per fetch
16/8 = 2 times faster
Yet results are even better than this!
x x x x x x x x x x x x x x x x x x x
64 B
y y y y y y y y y y y y y y y y y y y
z z z z z z z z z z z z z z z z z z z
64 B
z dx dy dz x y z dx dy dz x y z dx dy dz x y z
x y z dx dy dz x y z dx dy dz x y z dx
x y z dx dy dz x y z dx dy dz x y dy dz
![Page 23: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/23.jpg)
Compact AOS vs SOA
// aos struct particle { int x,y,z; }; using particle_aos=std::vector<particle>; // ...same as before // soa struct particle_soa { std::vector<int> x,y,z; }; // ...same as before
![Page 24: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/24.jpg)
Compact AOS vs SOA: results
2.9× 2×
1.1×
![Page 25: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/25.jpg)
Compact AOS vs SOA: analysis
Both AOS and SOA fetch only useful values
Yet SOA is (diminishingly as n grows) faster
The CPU is prefetching x, y, z in parallel
L1 L2 better than L1 L2 L3 better than L1 ··· DRAM
![Page 26: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/26.jpg)
Random access in AOS vs SOA
// aos auto ps=create_particle_aos(n); long int res=0; for(std::size_t i=0;i<n;++i){ auto idx=rnd(gen); res+=ps[idx].x+ps[idx].y+ps[idx].z; } return res; // soa auto ps=create_particle_soa(n); long int res=0; for(std::size_t i=0;i<n;++i){ auto idx=rnd(gen); res+=ps.x[idx]+ps.y[idx]+ps.z[idx]; } return res;
![Page 27: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/27.jpg)
Random access in AOS vs SOA: results
1.3×
![Page 28: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/28.jpg)
Random access in AOS vs SOA: analysis
You’re now in a position to make an educated guess
Moral: don’t trust your intuition
![Page 29: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/29.jpg)
Parallel count
std::vector<int> v=...; auto f=[&](int* px,int* py,int* pz,int* pw){ auto th=[](int* p,int* first,int* last){ *p=0; while(first!=last){ int x=*first++; *p+=x%2; } }; std::thread t1(th,px,v.data(),v.data()+n/4); std::thread t2(th,py,v.data()+n/4,v.data()+n/2); std::thread t3(th,pz,v.data()+n/2,v.data()+n*3/4); std::thread t4(th,pw,v.data()+n*3/4,v.data()+n); t1.join();t2.join();t3.join();t4.join(); return *px+*py+*pz+*pw; }; int res[49]; // near result addresses return f(&res[0],&res[1],&res[2],&res[3]); // far result addresses return f(&res[0],&res[16],&res[32],&res[48]);
![Page 30: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/30.jpg)
Parallel count: results
2.2×
![Page 31: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/31.jpg)
False sharing
L2 L1 L3
L2 L1
T1
T2
Writing to res[0] invalidates cached res[1]
Coherence management requires full write to DRAM
Rule of thumb: use shared write memory only to communicate thread results
![Page 32: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/32.jpg)
Branch prediction
![Page 33: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/33.jpg)
Filtered sum
std::vector<int> v=...; // unsorted long int res=0; for(int x:v)if(x>128)res+=x; return res; // sorted std::sort(v.begin(),v.end()); long int res=0; for(int x:v)if(x>128)res+=x; return res;
![Page 34: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/34.jpg)
Filtered sum : results
6.1× 5.9×
![Page 35: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/35.jpg)
Instruction pipeline
Increases execution throughput by n (latency remains the same)
Sandy Bridge: 14-17 stages
Ivy Bridge: 14-19 stages
Branch prediction failure invalidates partially executed queue
stage 1 stage 2 stage 3 stage n ···
stage 1 stage 2 stage 3 stage n ···
stage 1 stage 2 stage 3 stage n ···
stage 1 stage 2 stage 3 stage n ···
stage 1 stage 2 stage 3 stage n ···
stage 1 stage 2 stage 3 stage n ···
inst 1
inst 2
inst 3
inst n
···
![Page 36: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/36.jpg)
A particularly obnoxious case: bool-based processing
struct particle{ bool active; int x,y,z; ... }; std::vector<particle> particles; void render_particles(){ for(const particle& p:particles)if(p.active)render(p); }
Remove active flag by smartly laying out objects struct particle{ int x,y,z; ... }; std::vector<particle> particles; std::size_t num_active; void render_particles(){ for(std::size_t i=0;i<num_active;++i)render(particles[i]); }
Better cache density, no branching, fewer elements processed
![Page 37: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/37.jpg)
Polymorphic containers
struct base{virtual int f()const=0; virtual ~base(){}}; struct derived1:base{virtual int f()const{return 1;};}; struct derived2:base{virtual int f()const{return 2;};}; struct derived3:base{virtual int f()const{return 3;};}; using pointer=std::shared_ptr<base>; std::vector<pointer> v=...; // unsorted long int res=0; for(const auto& p:v)res+=p->f(); return res; // sorted std::sort(v.begin(),v.end(),[](const pointer& p,const pointer& q){ return std::type_index(typeid(*p))<std::type_index(typeid(*q)); }); long int res=0; for(const auto& p:v)res+=p->f(); return res; // poly_collection: more on this later poly_collection<base> v=...; v.for_each([&](const base& x){res+=x.f();});
![Page 38: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/38.jpg)
Polymorphic containers: results
4.3×
1.2×
10.1×
1.6× 18.6×
![Page 39: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/39.jpg)
Polymorphic containers: analysis
Sorting improves branch prediction even without data locality
What is this poly_collection thing?
Branch prediction-friendly + optimum data locality
More info on http://tinyurl.com/poly-collection
Cutting-edge compilers: look for devirtualization
d1 d3 d1 d2 d3 d3 d1 d2 d1
d1 d3 d1 d2 d3 d3 d1 d2 d1
vector<shared_ptr<base>>
poly_collection<base>
![Page 40: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/40.jpg)
Concluding remarks
![Page 41: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/41.jpg)
Concluding remarks
You can’t just ignore reality if you’re serious about performance
Think about the planet: don’t waste energy
Measure measure measure
(Almost) nothing beats traversing a vector in its natural order
One indirection ≈ one cache miss
If multithreading, beware of false sharing
SOA can get you huge speedups (and can not)
Can you remove flags?
Look for data-oriented design if you want to know more
Be regular: sort your objects
![Page 42: Mind the cache · Joaquín M López Muñoz Madrid, November 2015 Mind the cache . Our mental model of the CPU is 70 years old Von Neumann in front of EDVAC](https://reader033.fdocuments.in/reader033/viewer/2022050304/5f6c8c9acd91093f1231975a/html5/thumbnails/42.jpg)
Mind the cache
Thank you
github.com/joaquintides/usingstdcpp2015
using std::cpp 2015
Joaquín M López Muñoz <[email protected]> Madrid, November 2015