GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for...

GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for Ali Irturk University of California, San Diego 1

Thesis Objective 2 Design of a novel tool, GUSTO, for automatic generation and optimization of application specific matrix computation architectures from a given Matlab algorithm; Showing the effectiveness of my tool by Rapid architectural production of various signal processing, computer vision and financial computation algorithms,

Motivation Matrix Computations lie at the heart of most scientific computational tasks Wireless Communication, Financial Computation, Computer Vision. Matrix inversion is required in Equalization algorithms to remove the effect of the channel on the signal, Mean variance framework to solve a constrained maximization problem, Optical flow computation algorithm for motion estimation. QRD, A -1 3

Motivation 4 There are a number of tools that translate Matlab algorithms to a hardware description language; However, we believe that the majority of these tools take the wrong approach; We take a more focused approach, specifically developing a tool that is targeting matrix computation algorithms.

Computing Platforms 5 ASICsDSPsFPGAsGPUCELL BE Exceptional Performance Long Time to Market Substantial Costs Ease of Development Fast Time to Market Low Performance Ease of Development Fast Time to Market ASIC-like Performance

Field Programmable Gate Arrays FPGAs are ideal platforms High processing power, Flexibility, Non recurring engineering (NRE) cost. If used properly, these features enhance the performance and throughput significantly. BUT! A few tools exist which can aid the designer with the many system, architectural and logic design choices. 6

GUSTO General architecture design Utility and Synthesis Tool for Optimization An easy-to-use tool for more efficient design space exploration and development. GUSTO Matrix dimensions Bit width Resource allocation Modes Algorithm Required HDL files 7 GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, under review, Transactions on Embedded Computing Systems.

Outline Motivation GUSTO: Design Tool and Methodology Applications Matrix Decomposition Methods Matrix Inversion Methods Mean Variance Framework for Optimal Asset Allocation Future Work Publications 8

GUSTO Design Flow Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation + - */ Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static) Error Analysis 9

Mode 1 of GUSTO generates a general purpose architecture and its datapath. Can be used to explore other algorithms. Do not lead to high-performance results. GUSTO Modes Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units 10 Mode 2 of GUSTO creates a scheduled, static, application specific architecture. Simulates the architecture to Collect scheduling information, Define the usage of resources.

Matrix Multiplication Core Design Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis 11

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design 12

Matrix Multiplication Core Design Algorithm Analysis for i=1:n, for j=1:n, for k=1:n Temp = A(i,k)*B(k,j); C(i,j) = C(i,j) + Temp; end C = A * B Built-In Function 13

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design 14

Matrix Multiplication Core Design Instruction Generation C(1,1) = A(1,1) * B(1,1) Temp = A(1,2) * B(2,1) C(1,1) = C(1,1) + Temp [mul, C(1,1), A(1,1), B(1,1)] [mul, temp, A(1,2), B(2,1)] [add, C(1,1), C(1,1), temp] Operation Destination Operand 1 Operand 2 15

Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Instructions 16

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design 17

Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Number of Arithmetic Units 18

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design Error Analysis 19

Matrix Multiplication Core Design Error Analysis GUSTOMATLAB User Defined Input Data Error Analysis Metrics: 1)Mean Error 2)Peak Error 3)Standard Deviation of Error 4)Mean Percentage Error Fixed Point Arithmetic Results (using variable bit width) Floating Point Arithmetic Results (Single/Double precision) 20

Matrix Multiplication Core Design Error Analysis 21

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design Architecture Generation 22

Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Architecture Generation General Purpose Architecture Dynamic Scheduling Dynamic Memory Assignments Full Connectivity 23

Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design Architecture Generation 24

Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Architecture Generation Application Specific Architecture Static Scheduling Static Memory Assignments Required Connectivity 25

GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 A Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_A In_A1 In_A2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 26

GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 B Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_B In_B1 In_B2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 27

Matrix Multiplication Core Results 28 Inst. Cont. A A A A M M M M Mem. Cont. Inst. Cont. A A M M M M Mem. Cont. Inst. Cont. A A M M Mem. Cont. Design 1Design 2Design 3 Design 1 Design 2Design 3 Area (Slices) Throughput Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). A

Hierarchical Datapaths Unfortunately the organization of the architecture does not provide a complete design space to the user for exploring better design alternatives. This simple organization also does not scale well with the complexity of the algorithms: To overcome these issues, we incorporate hierarchical datapaths and heterogeneous architecture generation options into GUSTO. 29 Number of Instructions Optimization Performance Number of Functional Units Internal Storage and Communication

Matrix Multiplication Core Results 30 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 Inst. Cont. A A A A M M M M Mem. Cont. Core A_1

Matrix Multiplication Core Results 31 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 8 Inst. Cont. A A A A M M M M Mem. Cont. Core A_2

Matrix Multiplication Core Results 32 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Core A_4 Design 8 Core A_4 Core A_4 Core A_4 Core A_4 Design 9 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 8 4 Inst. Cont. A A A A M M M M Mem. Cont. Core A_4

Matrix Multiplication Core Results 33 Core A_1 Design 4 123 45678 9 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Core A_4 Design 8 Core A_4 Core A_4 Core A_4 Core A_4 Design 9 Area (Slices) Throughput Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 8 4

Matrix Multiplication Core Results 34 123 45678 9 Core A_2 Design 10 Core A_4 Core A_1 Design 11 Core A_2 Core A_1 Core A_1 Design 12 Core A_4 Core A_1 1011 12 Throughput Area (Slices) Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 3 2 3 5 5 2 4 4

M ATRIX D ECOMPOSITIONS QR, LU AND C HOLESKY Given Matrix Orthogonal Matrix Upper Triangular Matrix 37 Lower Triangular Matrix Given Matrix Upper Triangular Matrix Unique Lower Triangular Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix Given Matrix

M ATRIX I NVERSION Given Matrix Inverse Matrix Identity Matrix Full Matrix Inversion is costly! 38

Results Inflection Point Analysis 39 Automatic Generation of Decomposition based Matrix Inversion Architectures, Ali Irturk, Bridget Benson and Ryan Kastner, In Proceedings of the IEEE International Conference on Field-Programmable Technology (ICFPT), December 2009. Matrix Size

Results Inflection Point Analysis Inflection Point Analysis Implementation : Serial Parallel Bitwidths 16 bits 32 bits 64 bits Matrix Sizes 2 2 3 3 .. 8 8 40

Results Inflection Point Analysis: Decomposition Methods 41

Results Inflection Point Analysis: Matrix Inversion 42 An FPGA Design Space Exploration Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on Application Specific Processors (SASP), June 2008.

Results Finding the Optimal Hardware : Decomposition Methods General Purpose Architecture (Mode 1) Application Specific Architecture (Mode 2) QRLUCholesky Decrease in Area (Percentage) 94%83%86% 43 Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC 2009), April 2009.

Results Finding the Optimal Hardware: Decomposition Methods General Purpose Architecture (Mode 1) Application Specific Architecture (Mode 2) QRLU Cholesky Increase in Througput (Percentage) 68% 16% 14% 44

Results Finding the Optimal Hardware: Matrix Inversion (using QR) average of 59% decrease in area 3X increase in throughput 45

Results Architectural Design Alternatives: Matrix Inversion 46

Results Architectural Design Alternatives: Matrix Inversion 47

Results Comparison with Previously Published Work: Matrix Inversion Eilert et al. Our ImplA Our ImplB Our ImplC Edman et al. Karkooti et al. Our Method Analytic QR Bit width1620 1220 Data typefloating fixed floatingfixed Device type Virtex 4 Virtex 2Virtex 4 Slices1561209470214002808440091173584 DSP48s004816NR2212 BRAMsNR 000 1 Throughput (10 6 s -1 ) 1.040.830.380.721.30.280.120.26 J. Eilert, D. Wu, D. Liu, Efficient Complex Matrix Inversion for MIMO Software Defined Radio, IEEE International Symposium on Circuits and Systems. (2007). F. Edman, V. wall, A Scalable Pipelined Complex Valued Matrix Inversion Architecture, IEEE International Symposium on Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick, FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm, Asilomar Conference on Signals, Systems and Computers (2005). 48

Results Comparison with Previously Published Work: Matrix Inversion F. Edman, V. wall, A Scalable Pipelined Complex Valued Matrix Inversion Architecture, IEEE International Symposium on Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick, FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm, Asilomar Conference on Signals, Systems and Computers (2005). Edman et al. Karkooti et al. Our Method QR LUCholesky Bit width1220 Data typefixedfloatingfixed Device type Virtex 2Virtex 4 Slices44009117358427193682 DSP48sNR2212 BRAMsNR 111 Throughput ( 10 6 s -1 ) 0.280.120.260.330.25 49

Asset Allocation Asset allocation is the core part of portfolio management. An investor can minimize the risk of loss and maximize the return of his portfolio by diversifying his assets. Determining the best allocation requires solving a constrained optimization problem. 51 Markowitzs mean variance framework

Asset Allocation Increasing the number of assets significantly provides more efficient allocations. 52

High Performance Computing Higher number of assets and more complex diversification require significant computation. The addition of FPGAs to the existing high performance computers can boost the application performance and design flexibility. 53 Zhang et al. and Morris et al.Single Option Pricing Kaganov et al.Credit Derivative Pricing Thomas et al.Interest Rates and Value-at-Risk Simulations We are the first to propose hardware acceleration of the mean variance framework using FPGAs. FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008.

T HE M EAN V ARIANCE F RAMEWORK 54 Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio Computation of the Optimal Allocation 5 Phases MVF

Hardware Architecture for MVF Step 2 57 Random Number Generator 1 = [ 11, 12, , 1Ns ] Requires N s Multiplications Monte Carlo Block = M Objective Value Allocation =[ 1, 2,, Ns ] Market Vector ? Is this allocation the best? Expected Return Standard Deviation (RISK)

Hardware Architecture for MVF Step 2 58

Hardware Architecture for MVF Step 2 60 Satisfaction Function Calculator Blocks Parallel N s Multipliers Parallel N m Monte Carlo Blocks Parallel N m Utility Calculation Blocks Parallel N p Satisfaction Function Calculation Blocks Parallel Satisfaction Function Calculator Blocks

Results 61 Mean Variance Framework Step 2 1000 runs 10 Satisfaction Blocks (1 Monte-Carlo Block with 10 multipliers and 10 Utility Function Calculator Blocks) 151 - 221 100,000 scenarios and 50 Portfolios 10 Satisfaction Blocks (1 Monte-Carlo Block with 20 multipliers and 20 Utility Function Calculator Blocks) 302 - 442 FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008.

Thesis Outline and Future Work 1. Introduction 2.Comparison of FPGAs, GPUs and CELLs - Possible journal paper, - GPU implementation of Face Recognition for journal paper. 3. GUSTO Fundamentals 4. Super GUSTO - Journal paper for Hierarchical design and Heteregenous Core Design, - Employing different instruction scheduling algorithms and analysis of their effects on implemented architectures. 5. Small code applications of GUSTO - Matrix Decomposition Core (QR, LU, Cholesky) designs with different architectural choices - Matrix Inversion Core (Analytic, QR, LU, Cholesky) designs with different architectural choices - Design of an Adaptive Weight Calculation Cores. 6. Large code applications using GUSTO - Mean Variance Framework Step 2 implementation, - Short Preamble Processing Unit implementation, - Optical Flow Computation algorithm implementation. 7. Conclusions 8. Future Work 9. References 63

Publications [15] An Optimized Algorithm for Leakage Power Reduction of Embedded Memories on FPGAs Through Location Assignments, Shahnam Mirzaei, Yan Meng, Arash Arfaee, Ali Irturk, Timothy Sherwood, Ryan Kastner, working paper for IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. [14] Xquasher: A Tool for Efficient Computation of Multiple Linear Expressions, Arash Arfaee, Ali Irturk, Ryan Kastner, Farzan Fallah, under review, Design Automation Conference (DAC 2009), July 2009. [13] Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009), July 2009. [12] Energy Benefits of Reconfigurable Hardware for use in Underwater Sensor Nets, Bridget Benson, Ali Irturk, Junguk Cho, Ryan Kastner, under review, 16th Reconfigurable Architectures Workshop (RAW 2009), May 2009. [11] Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC 2009), April 2009. [10] FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008. [9] GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, under review (2 nd round of reviews), Transactions on Embedded Computing Systems. [8] Automatic Generation of Decomposition based Matrix Inversion Architectures, Ali Irturk, Bridget Benson and Ryan Kastner, In Proceedings of the IEEE International Conference on Field-Programmable Technology (ICFPT), December 2009. 65

Publications [7] Survey of Hardware Platforms for an Energy Efficient Implementation of Matching Pursuits Algorithm for Shallow Water Networks, Bridget Benson, Ali Irturk, Junguk Cho, and Ryan Kastner, In Proceedings of the The Third ACM International Workshop on UnderWater Networks (WUWNet), in conjunction with ACM MobiCom 2008, September 2008. [6] Design Space Exploration of a Cooperative MIMO Receiver for Reconfigurable Architectures, Shahnam Mirzaei, Ali Irturk, Ryan Kastner, Brad T. Weals and Richard E. Cagley, In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), July 2008. [5] An FPGA Design Space Exploration Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on Application Specific Processors (SASP), June 2008. [4] An Optimization Methodology for Matrix Computation Architectures, Ali Irturk, Bridget Benson, and Ryan Kastner, Unsubmitted Manuscript. [3] FPGA Implementation of Adaptive Weight Calculation Core Using QRD-RLS Algorithm, Ali Irturk, Shahnam Mirzaei and Ryan Kastner, Unsubmitted Manuscript. [2] An Efficient FPGA Implementation of Scalable Matrix Inversion Core using QR Decomposition, Ali Irturk, Shahnam Mirzaei and Ryan Kastner, Unsubmitted Manuscript. [1] Implementation of QR Decomposition Algorithms using FPGAs, Ali Irturk, MS Thesis, Department of Electrical and Computer Engineering, University of California, Santa Barbara, June 2007. Advisor: Ryan Kastner 66

Thank You 67 [email protected]

M ATRIX I NVERSION Ali Irturk -UC San DiegoSASP 2008 Use Decomposition Methods for Analytic Simplicity Computational Convenience Decomposition Methods QR LU Cholesky etc. Analytic Method

Matrix Inversion using QR Decomposition Ali Irturk -UC San DiegoSASP 2008 Given Matrix Orthogonal Matrix Upper Triangular Matrix

Matrix Inversion using QR Decomposition Ali Irturk -UC San DiegoSASP 2008 1 2 3 4 5 6 7 8 columns of the matrix Entry at the intersection of i th row with j th column Three different QR decomposition methods: Gram-Schmidt Orthogonormalization Givens Rotations Householder Reflections 00101010 00101110 00001010 11101010 00101110 00101100 Memory Euclidean Norm

Matrix Inversion using Analytic Method Ali Irturk -UC San DiegoSASP 2008 The analytic method uses The adjoint matrix, Determinant of the given matrix. Adj(A) det A Determinant of a 2 2 matrix

Adjoint Matrix UC San DiegoSASP 2008 * * * * * * - - - * * * A 33 A 44 A 34 A 43 A 32 A 44 A 34 A 42 A 32 A 43 A 33 A 42 A 22 A 23 A 24 * * C 11 Adjoint Matrix Calculation Cofactor Calculation Core

Different Implementations of Analytic Approach Ali Irturk -UC San DiegoSASP 2008 Cofactor Calculation Core Implementation AImplementation B Implementation C

Matrix Inversion using LU Decomposition Ali Irturk - UC San DiegoICFPT 2008 Given Matrix Lower Triangular Matrix Upper Triangular Matrix

Matrix Inversion using LU Decomposition Ali Irturk - UC San Diego ICFPT 2008 1 2 3 4 5 6 7 8 9

Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT 2008 Given Matrix Unique Lower Triangular Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix

Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT 2008 1 2 3 4 5 6 7

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio 81 Computation of the Optimal Allocation 5 Phases MVF

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier 1 23 Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 82 Computation the Optimal Allocation 5 Phases MVF

C OMPUTATION OF R EQUIRED I NPUTS Computation of Required Inputs Expected Prices E{M} Expected Covariance Cov{M} Publicly Available Data Prices Covariance # of Securities Reference Allocation Horizon Investor Objective Known Data The time that investment made Investment Horizon Estimation Interval 1) Detect the Invariants 2) Determine the Distribution of Invariants 3) Project the Invariants to the Investment Horizon 4) Compute the Expected Return and the Covariance Matrix 5) Compute the Expected Return and the Covariance Matrix of the Market Vector 83

C OMPUTATION OF R EQUIRED I NPUTS Computation of Required Inputs Expected Prices E{M} Expected Covariance Cov{M} Publicly Available Data Prices Covariance # of Securities Reference Allocation Horizon Investor Objective Investor Objectives Absolute Wealth Relative Wealth Net Profits (3) = M Objective Value Allocation =[ 1, 2,, Ns ] Market Vector 84

C OMPUTATION OF R EQUIRED I NPUTS S TEP 5 = M Objective Value Market Vector M is a transformation of the Market Prices at the Investment Horizon M a + BP T+ Standard Investor Objectives Absolute WealthRelative WealthNet Profits (a)Specific Form = W T+ () = W T+ ()-() W T+ () = W T+ ()- w T () (b)Generalized Form a 0, B I N = P T+ a 0, B K = KP T+ a -p T, B I N = (P T+ -p T ) 85 Allocation =[ 1, 2,, Ns ]

C OMPUTATION OF R EQUIRED I NPUTS Each step requires to make assumptions: Invariants Distribution of invariants Estimation interval. Our assumptions: Compounded returns of stocks as market invariants, 3 years of the known data, 1 week estimation interval, 1 year as our horizon. 86 Phase 5 is a good candidate for hardware implementation.

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 87 Computation of the Optimal Allocation 5 Phases MVF

MVF: S TEP 1 Computation of the Efficient Frontier Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Current Prices # of Portfolios # of Securities Budget Allocation Standard Deviation (RISK) Expected Return Efficient Frontier (v) arg max E{M}, v 0 constraints Cov{M} =v E{ } Var{ } 88

MVF: S TEP 1 Computation of the Efficient Frontier (v) arg max E{M}, v 0 constraints Cov{M} =v Standard Deviation (RISK) Expected Return Unachievable Risk-Return Space Efficient Frontier An investor does NOT want to be in this area! 89

T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 90 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF

MVF: S TEP 2 Computing the Optimal Allocation Determination of the Highest Utility Portfolio Optimal Allocation Current Prices # of Securities # of Portfolios # of Scenarios Satisfaction Index ? Is this allocation the best? 91 Expected Return Standard Deviation (RISK)

MVF: S TEP 2 Computing the Optimal Allocation Satisfaction Indices Represent all the features of a given allocation with one single number, Quantify the investors satisfaction. Satisfaction Indices Certainty-equivalent, Quantile, Coherent indices. Certainty-equivalent satisfaction indices are Represented by the investors utility function and objective, u(), We use Hyperbolic Absolute Risk Aversion (HARA) class of utility functions. Utility Functions Exponential, Quadratic, Power, Logarithmic, Linear. 92

MVF: S TEP 2 Computing the Optimal Allocation Hyperbolic Absolute Risk Aversion (HARA) class of utility functions are Specific forms of the Arrow-Pratt risk aversion model, Defined as where = 0. A() = 2++ Utility Functions Exponential Utility (>0 and 0) Quadratic Utility (>0 and -1) Power Utility ( 0 and 1) Logarithmic Utility (lim 1 ) Linear Utility (lim ) u() = -e (1/) u() = (1/2) 2 u() = 1- 1/ u() = ln()u() = 93

I DENTIFICATION OF B OTTLENECKS In terms of computational time, most important variables are: Number of Securities, Number of Portfolios, Number of Scenarios. Computation of Required Inputs Computation of the Efficient Frontier Determination of the Highest Utility Portfolio Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio 94

I DENTIFICATION OF B OTTLENECKS # of Securities dominates computation time over # of Portfolios. 95 # of Scenarios = 100,000

I DENTIFICATION OF B OTTLENECKS # of Portfolios dominates computation time over # of Scenarios. 96 # of Securities = 100

I DENTIFICATION OF B OTTLENECKS 97 # of Portfolios = 100# of Scenarios = 100,000

I DENTIFICATION OF B OTTLENECKS 98 # of Scenarios = 100,000 # of Securities = 100

I DENTIFICATION OF B OTTLENECKS 99 # of Portfolios = 100# of Securities = 100

Generation of Required Inputs Phase 5 /- - pTpT pTpT '' ININ 0 1 0 1 cntrl_a cntrl_b M Market Vector Calculator IP Core K Building Block pTpT KP T+ or P T+ Absolute Wealth Relative WealthNet Profits a 0, B I N = P T+ a 0, B K = KP T+ a -p T, B I N = (P T+ -p T ) Control Inputs Objectivecntrl_acntrl_b Absolute00 Relative10 Net Profits01 101 P T+

Generation of Required Inputs Phase 5 pTpT pTpT 102

Generation of Required Inputs Phase 5 /- pTpT pTpT ININ 103

Generation of Required Inputs Phase 5 / - pTpT pTpT '' ININ P T+ 0 1 cntrl_a 104

Generation of Required Inputs Phase 5 / - pTpT pTpT '' ININ P T+ 0 1 cntrl_a - cntrl_b pTpT 0 1 105

Generation of Required Inputs Phase 5 106

Hardware Architecture for MVF Step 1 (v) arg max E{M}, v 0 constraints Cov{M} =v A popular approach to solve constrained maximization problems is to use the Lagrangian multiplier method. 108

Hardware Architecture for MVF Step 1 Number of Securities amount of functions need to be computed for determination of the efficient allocation for a given risk. 109

Hardware Architecture for MVF Step 1 11 22 NsNs 1.Core 11 22 NsNs 2.Core 11 22 NsNs N p.Core v1v1 E{M} Cov{M} 110

Hardware Architecture for MVF Step 2 Random Number Generator 1 = [ 11, 12, , 1Ns ] Requires N s Multiplications Monte Carlo Block 112 Utility Functions Exponential Utility (>0 and 0) Quadratic Utility (>0 and -1) u() = -e (1/) u() = (1/2) 2

Hardware Architecture for MVF Step 2 Satisfaction Function Calculator Blocks Parallel N s Multipliers Parallel N m Monte Carlo Blocks Parallel N m Utility Calculation Blocks Parallel N p Satisfaction Function Calculation Blocks Parallel Satisfaction Function Calculator Blocks 115

Results Generation of Required Inputs Phase 5 1000 runs N s number of arithmetic resources in parallel 6 - 9.6 629 (for 50 Securities) 116

Results Mean Variance Framework Step 2 1000 runs 10 Satisfaction Blocks (1 Monte-Carlo Block with 10 multipliers and 10 Utility Function Calculator Blocks) 151 - 221 100,000 scenarios and 50 Portfolios 10 Satisfaction Blocks (1 Monte-Carlo Block with 20 multipliers and 20 Utility Function Calculator Blocks) 302 - 442 117

Conclusion Mean Variance Frameworks inherent parallelism make the framework an ideal candidate for an FPGA implementation; We are bound by hardware resources rather than by the parallelism Mean Variance Framework offers; However, there are many different architectural choices to implement Mean Variance Frameworks steps. 118

GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for...

Documents

Transcript of GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for...