Introduction to Memory Optimization
-
Upload
koray-hagen -
Category
Software
-
view
757 -
download
0
description
Transcript of Introduction to Memory Optimization
An Introduction to Memory Optimization Techniques
Koray Hagen
My background
1. Software engineer in the games industry1. League of Legends2. Hearthstone3. PlayStation 4 4. Xbox 360
2. Worked on many optimization problems for:
1. Game scalability2. Content pipelines3. Client run-time4. Server run-time5. Data formats
3. One rule. Performance is king.
Prerequisite Knowledge
1. Exposure to C or C++
2. Exposure to computer architecture
Thank you to SCEA Santa Monica Studio
1. Christer Ericson1. VP, Central Technology, Activision2. Previously, Director of Technology, SCEA3. Author of “Real-time Collision Detection”4. Author of original optimization presentation5. Authority on optimization and game physics
The agenda
1. “The Black Box”
2. Memory hierarchies and cache
3. Optimization techniques for instructions and data
4. Aliasing and restriction
5. Closing thoughts and further reading
What won’t be covered
1. Data-Oriented Design1. Modern object-oriented programming is polluting programmer’s minds2. A refocus on creating better representations and computation around data rather than abstractions
2. SIMD or other approaches to vectorized code generation and usage1. Instruction level parallelism2. A deep dive into the losing war between processor speed and memory speed
The Black BoxChallenges in the modern era of computing
The downward spiral of performance
1. There is now an accelerating gap between CPU and memory performance1. CPU speed increasingly annually by ~ 60%2. Memory speed increasingly annually by ~10%
2. The gap has been closed by use of cache memory1. Recent renewed interest by the C++ community2. Unfortunately cache is still vastly unexploited3. Diminishing returns for large caches (physical locality)
3. Advances in instruction parallelism are overshadowing data performance1. Data consumption at run-time is astronomically high
4. Inefficient cache use is equal to lower performance1. Most obvious question, how do I increase cache utilization?2. Answer: cache aware programming/programmers (you after today’s slides)
Memory hierarchies and cacheA look at current architectures
An overview of cache
1. Memory hierarchy1. Discrete instruction cache2. Discrete data cache
2. Cache Lines1. Cache is physically divided into cache lines of N bytes (typically 32 and 64 bit) each2. The discrete unit for counting memory accesses
3. Example architecture – Direct mapping1. For an N kilobyte cache, bytes at position k, k+n, k+2n … map to a cache line
4. Example architecture – N-way associative1. Logical cache line corresponds to N physical lines2. Minimization of cache thrashing
Theoretical memory hierarchy
1 cycle
~ 2-5 cycles
~ 5-20 cycles
~ 40-100 cyclesMain memory
L2 cache
L1 cache
CPU
Example cache specifications
1. Emergence of L3 cache for high end processors2. Nothing magical about speed, strict physical locality requirements to the co-
processors and main memory
L1 cache (I &D) L2 cache
PlayStation 4 256 KB 4 MBWii U 64 KB 3 MB
Xbox One 256 KB 4 MBPC 512K 6 MB
Beware the three C’s of cache misses
1. Compulsory misses1. Unavoidable misses when reading in data for the first time.
2. Capacity misses1. Not enough cache space to hold all active data2. Too much data accessed in between successive use
3. Conflict misses1. Two blocks of memory are mapped to the same location and there is not enough room
to hold both, ultimately causing thrashing
Introduce the three R’s into your program
1. Rearrange (code and data)1. Change layouts to increase spatial locality
2. Reduce (size and number of cache lines read)1. Create smaller and smarter formats2. Compression
3. Reuse (cache lines)1. Increase temporal and spatial locality
Instruction and data cache optimizationStrategies for performance and cache-awareness
Instruction optimization strategy
1. Locality1. Reorder functions
1. Manually within the file2. Reorder object files during linker stage3. Visual studio intrinsic:
1. #pragma section("section-name" [attributes])2. Adapt coding style
1. Balance between monolithic functions and separation of logic2. Encapsulation and OOP are less cache friendly – usually not all
3. Implicit code generation1. Example (casting: cvttss2si)2. Study the code that your compiler generates3. Build intuition regarding how the compiler optimizes
Instruction optimization strategy … continued
1. Size1. Beware inlining, unrolling, large macros!
1. Always understand the cost-value tradeoffs of programming decisions2. Avoid unnecessary features and code paths3. Loop splitting and loop compounding
2. Again, always study the generated code.
Data optimization strategy
1. Compress data1. Does not necessarily mean compression algorithms2. Can you store more in less?
2. Cache conscious data layouts1. Padding to align to cache lines2. Reordering to align to cache lines3. Ordering variables by personal preference has no value
3. Linearizing data1. Array based data structures
Structure field data reordering
Data is likely accessed together, so store them together
Be aware of compiler padding
1. What are the values of size_one and size_two?
2. Hint: not the same, so how is member data aligned?
3. Ordering member data by self-enlightened organization is the result of bad programmer habits.
“Hot and cold” data division
1. Achieve much better cache coherence by striving for temporal locality among data members.
2. How often is your data in cache? Data access must scale towards the most common case, not the worst case.
“Hot and cold” data division … continued
Linearization of data
1. Nothing better than linear data1. Overall best spatial locality, values are right next to each other2. Easy to pre-fetch, and will result in better cache line hit probability.
2. What if my data can’t be represented linearized easily?1. Linearize at run-time, there is no excuse2. Fetch and store into a custom cache3. Great candidates for linearization
1. Hierarchy traversal2. Indexed data3. Random-access data
Matrix multiplication example
1. The result of bad programmer habits, and programming towards an assumed general case.1. How can it be better? 2. What options do we have?3. How can we save ourselves?
Matrix multiplication example … continued
1. But wait! There is a hidden assumption that result is not lhs is not rhs2. Compiler does not and cannot know this, more on this later
1. Line 3: lhs[0][0] and lhs[0][1] must be re-fetched2. Line 4: rhs[0][0] and rhs[1][0] must be re-fetched3. Line 5: lhs[1][0], lhs[1][1], rhs[0][1] and rhs[1][1] must be re-fetched
3. We can do even better
1. Let’s try unrolling the multiplication
Matrix multiplication example … continued
1. Cache all inputs, leave no room for unneeded indirection2. Write to needed memory locations once3. Result
1. No branches2. No conditionals3. No aliasing4. No side effects
Real Example
Aliasing and restricted pointersRun-time costs the compiler will never tell you
What is aliasing?
Aliasing is multiple references to the same storage location
What value is returned? Is it 1 or 2?Nobody knows
Penalties for introducing abstractions
1. Higher levels of abstraction have a negative effect on optimization1. Objected oriented code naturally inclines programmers toward cache obliviousness2. “Information hiding” key principle, potentially hiding insights into achieving optimal
performance
2. Inevitably lots of temporary objects
3. Objects live on the heap and stack1. Subject to aliasing problems2. Constant indirection to access and transform any meaningful data
4. Implicit aliasing through the this pointer1. Member variables are just as bad as globals
Penalties for introducing abstractions … continued
1. m_count is a member not local variable, therefore implicit this pointer2. m_count may be aliased by m_ptr
3. Every iteration is likely to refetch m_count from main memory
Penalties for introducing abstractions … continued
Are you sure the compiler does this optimization for you? Don’t leave it up to chance.
Restricted pointers
1. Restrict keyword1. Supported by many C++ compilers (MSVC, GCC)2. Controversial among standard committees
2. Restrict is a promise1. Tells the compiler that for the scope of the pointer, the target location of the pointer
will only be accessed through that pointer alone. It is a promise not to alias.
3. Important in C++1. Helps combat abstraction penalty problems2. Tricky semantics, easy to get wrong3. Compiler will never inform you about incorrect usage4. Incorrect usage results in agonizing pain
Restricted pointers … continued
What you really want is the compiler to generate this
… But because of aliasing, the compiler cannot do it
Restricted pointers … continued
The fix? Restrict the pointers
1. Prefer an explicit coding style, leave nothing to chance2. Be careful and pragmatic, understand what code paths can be taken with
functions3. Remember, a restrict qualified pointer can grant access to a non-restrict pointer
Restricted pointers … continued
Remember, despite intuition “const” doesn’t help
1. “Wait, since *rhs is const lhs[i] cannot write to it right?” … WRONG2. const promises that *rhs is const through rhs, NOT that *rhs is const in general3. const is for detecting programming errors, not fixing aliasing
Tips for avoiding aliasing
1. Minimize use of globals pointers and references1. Recall the semantics of the Matrix example2. Pass small variables by value3. Use local variables as much as possible
2. Restrict pointers and references when appropriate
3. Declare variables close to the point of use, and no further
4. Aim to write “pure” functions, and strive for const-correctness
5. Study generated code!
Optimization isn’t magic
1. Strive for explicitness in programming1. Leave no room for unintended side effects
2. Understand the hardware architecture that is being targeted1. Constant factors in programming matter2. Relevant for all platforms
1. Game consoles2. Mobile3. Servers4. ... Even normal desktops/laptops
3. Not over, many more topics to explore1. Branch prediction2. SIMD and vectorized code3. Cache-aware data structures
Further reading and references
1. Abrash, Michael. Zen of Code Optimization. Scottsdale, AZ: Coriolis Group, 1994. Print.
2. Ericson, Christer. Real-time Collision Detection. Amsterdam: Elsevier, 2005. Print.
3. Fabian, Richard. "Data-Oriented Design." Data-Oriented Design. N.p., 25 June 2013. Web. 03 Apr. 2014.
4. Hennessy, John L., David A. Patterson, and David A. Patterson. Computer Architecture: A Quantitative Approach. San Francisco, CA: Morgan Kaufmann, 2003. Print.
Thank you, any questions?