GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for...

download GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for Ali Irturk University of California, San Diego 1.

If you can't read please download the document

  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for...

  • Slide 1
  • GUSTO: General architecture design Utility and Synthesis Tool for Optimization Qualifying Exam for Ali Irturk University of California, San Diego 1
  • Slide 2
  • Thesis Objective 2 Design of a novel tool, GUSTO, for automatic generation and optimization of application specific matrix computation architectures from a given Matlab algorithm; Showing the effectiveness of my tool by Rapid architectural production of various signal processing, computer vision and financial computation algorithms,
  • Slide 3
  • Motivation Matrix Computations lie at the heart of most scientific computational tasks Wireless Communication, Financial Computation, Computer Vision. Matrix inversion is required in Equalization algorithms to remove the effect of the channel on the signal, Mean variance framework to solve a constrained maximization problem, Optical flow computation algorithm for motion estimation. QRD, A -1 3
  • Slide 4
  • Motivation 4 There are a number of tools that translate Matlab algorithms to a hardware description language; However, we believe that the majority of these tools take the wrong approach; We take a more focused approach, specifically developing a tool that is targeting matrix computation algorithms.
  • Slide 5
  • Computing Platforms 5 ASICsDSPsFPGAsGPUCELL BE Exceptional Performance Long Time to Market Substantial Costs Ease of Development Fast Time to Market Low Performance Ease of Development Fast Time to Market ASIC-like Performance
  • Slide 6
  • Field Programmable Gate Arrays FPGAs are ideal platforms High processing power, Flexibility, Non recurring engineering (NRE) cost. If used properly, these features enhance the performance and throughput significantly. BUT! A few tools exist which can aid the designer with the many system, architectural and logic design choices. 6
  • Slide 7
  • GUSTO General architecture design Utility and Synthesis Tool for Optimization An easy-to-use tool for more efficient design space exploration and development. GUSTO Matrix dimensions Bit width Resource allocation Modes Algorithm Required HDL files 7 GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, under review, Transactions on Embedded Computing Systems.
  • Slide 8
  • Outline Motivation GUSTO: Design Tool and Methodology Applications Matrix Decomposition Methods Matrix Inversion Methods Mean Variance Framework for Optimal Asset Allocation Future Work Publications 8
  • Slide 9
  • GUSTO Design Flow Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation + - */ Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static) Error Analysis 9
  • Slide 10
  • Mode 1 of GUSTO generates a general purpose architecture and its datapath. Can be used to explore other algorithms. Do not lead to high-performance results. GUSTO Modes Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units 10 Mode 2 of GUSTO creates a scheduled, static, application specific architecture. Simulates the architecture to Collect scheduling information, Define the usage of resources.
  • Slide 11
  • Matrix Multiplication Core Design Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis 11
  • Slide 12
  • Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design 12
  • Slide 13
  • Matrix Multiplication Core Design Algorithm Analysis for i=1:n, for j=1:n, for k=1:n Temp = A(i,k)*B(k,j); C(i,j) = C(i,j) + Temp; end C = A * B Built-In Function 13
  • Slide 14
  • Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design 14
  • Slide 15
  • Matrix Multiplication Core Design Instruction Generation C(1,1) = A(1,1) * B(1,1) Temp = A(1,2) * B(2,1) C(1,1) = C(1,1) + Temp [mul, C(1,1), A(1,1), B(1,1)] [mul, temp, A(1,2), B(2,1)] [add, C(1,1), C(1,1), temp] Operation Destination Operand 1 Operand 2 15
  • Slide 16
  • Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Instructions 16
  • Slide 17
  • Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design 17
  • Slide 18
  • Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Number of Arithmetic Units 18
  • Slide 19
  • Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Error Analysis Matrix Multiplication Core Design Error Analysis 19
  • Slide 20
  • Matrix Multiplication Core Design Error Analysis GUSTOMATLAB User Defined Input Data Error Analysis Metrics: 1)Mean Error 2)Peak Error 3)Standard Deviation of Error 4)Mean Percentage Error Fixed Point Arithmetic Results (using variable bit width) Floating Point Arithmetic Results (Single/Double precision) 20
  • Slide 21
  • Matrix Multiplication Core Design Error Analysis 21
  • Slide 22
  • Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design Architecture Generation 22
  • Slide 23
  • Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Architecture Generation General Purpose Architecture Dynamic Scheduling Dynamic Memory Assignments Full Connectivity 23
  • Slide 24
  • Algorithm Analysis Architecture Generation Instruction Generation Resource Allocation XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of Arithmetic Resources Design Library Data Representation Area, Latency and Throughput Results Simulation Results Mode 1 General Purpose Architecture (dynamic scheduling) Resource Trimming and Scheduling XilinxMentor Graphics Area, Latency and Throughput Results Simulation Results Mode 2 Application Specific Architecture (static scheduling) Ali Irturk Error Analysis Matrix Multiplication Core Design Architecture Generation 24
  • Slide 25
  • Instruction Controller Arithmetic Unit Memory Controller Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units Matrix Multiplication Core Design Architecture Generation Application Specific Architecture Static Scheduling Static Memory Assignments Required Connectivity 25
  • Slide 26
  • GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 A Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_A In_A1 In_A2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 26
  • Slide 27
  • GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B In_B1In_B2 Out_B mem In_mem1 B Out_A Out_B Out_mem1 Out_mem2 Out_A Out_B Out_mem1 Out_mem2 Out_B In_B1 In_B2 Out_A Out_B Out_ mem1 Out_ mem2 Simulation runs 27
  • Slide 28
  • Matrix Multiplication Core Results 28 Inst. Cont. A A A A M M M M Mem. Cont. Inst. Cont. A A M M M M Mem. Cont. Inst. Cont. A A M M Mem. Cont. Design 1Design 2Design 3 Design 1 Design 2Design 3 Area (Slices) Throughput Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). A
  • Slide 29
  • Hierarchical Datapaths Unfortunately the organization of the architecture does not provide a complete design space to the user for exploring better design alternatives. This simple organization also does not scale well with the complexity of the algorithms: To overcome these issues, we incorporate hierarchical datapaths and heterogeneous architecture generation options into GUSTO. 29 Number of Instructions Optimization Performance Number of Functional Units Internal Storage and Communication
  • Slide 30
  • Matrix Multiplication Core Results 30 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 Inst. Cont. A A A A M M M M Mem. Cont. Core A_1
  • Slide 31
  • Matrix Multiplication Core Results 31 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 8 Inst. Cont. A A A A M M M M Mem. Cont. Core A_2
  • Slide 32
  • Matrix Multiplication Core Results 32 Core A_1 Design 4 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Core A_4 Design 8 Core A_4 Core A_4 Core A_4 Core A_4 Design 9 Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 8 4 Inst. Cont. A A A A M M M M Mem. Cont. Core A_4
  • Slide 33
  • Matrix Multiplication Core Results 33 Core A_1 Design 4 123 45678 9 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Core A_4 Design 8 Core A_4 Core A_4 Core A_4 Core A_4 Design 9 Area (Slices) Throughput Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 16 8 4
  • Slide 34
  • Matrix Multiplication Core Results 34 123 45678 9 Core A_2 Design 10 Core A_4 Core A_1 Design 11 Core A_2 Core A_1 Core A_1 Design 12 Core A_4 Core A_1 1011 12 Throughput Area (Slices) Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009). 3 2 3 5 5 2 4 4
  • Slide 35
  • Outline Motivation GUSTO: Design Tool and Methodology Applications Matrix Decomposition Methods Matrix Inversion Methods Mean Variance Framework for Optimal Asset Allocation Future Work Publications 35
  • Slide 36
  • Outline Motivation GUSTO: Design Tool and Methodology Applications Matrix Decomposition Methods Matrix Inversion Methods Mean Variance Framework for Optimal Asset Allocation Future Work Publications 36
  • Slide 37
  • M ATRIX D ECOMPOSITIONS QR, LU AND C HOLESKY Given Matrix Orthogonal Matrix Upper Triangular Matrix 37 Lower Triangular Matrix Given Matrix Upper Triangular Matrix Unique Lower Triangular Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix Given Matrix
  • Slide 38
  • M ATRIX I NVERSION Given Matrix Inverse Matrix Identity Matrix Full Matrix Inversion is costly! 38
  • Slide 39
  • Results Inflection Point Analysis 39 Automatic Generation of Decomposition based Matrix Inversion Architectures, Ali Irturk, Bridget Benson and Ryan Kastner, In Proceedings of the IEEE International Conference on Field-Programmable Technology (ICFPT), December 2009. Matrix Size
  • Slide 40
  • Results Inflection Point Analysis Inflection Point Analysis Implementation : Serial Parallel Bitwidths 16 bits 32 bits 64 bits Matrix Sizes 2 2 3 3 .. 8 8 40
  • Slide 41
  • Results Inflection Point Analysis: Decomposition Methods 41
  • Slide 42
  • Results Inflection Point Analysis: Matrix Inversion 42 An FPGA Design Space Exploration Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on Application Specific Processors (SASP), June 2008.
  • Slide 43
  • Results Finding the Optimal Hardware : Decomposition Methods General Purpose Architecture (Mode 1) Application Specific Architecture (Mode 2) QRLUCholesky Decrease in Area (Percentage) 94%83%86% 43 Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC 2009), April 2009.
  • Slide 44
  • Results Finding the Optimal Hardware: Decomposition Methods General Purpose Architecture (Mode 1) Application Specific Architecture (Mode 2) QRLU Cholesky Increase in Througput (Percentage) 68% 16% 14% 44
  • Slide 45
  • Results Finding the Optimal Hardware: Matrix Inversion (using QR) average of 59% decrease in area 3X increase in throughput 45
  • Slide 46
  • Results Architectural Design Alternatives: Matrix Inversion 46
  • Slide 47
  • Results Architectural Design Alternatives: Matrix Inversion 47
  • Slide 48
  • Results Comparison with Previously Published Work: Matrix Inversion Eilert et al. Our ImplA Our ImplB Our ImplC Edman et al. Karkooti et al. Our Method Analytic QR Bit width1620 1220 Data typefloating fixed floatingfixed Device type Virtex 4 Virtex 2Virtex 4 Slices1561209470214002808440091173584 DSP48s004816NR2212 BRAMsNR 000 1 Throughput (10 6 s -1 ) 1.040.830.380.721.30.280.120.26 J. Eilert, D. Wu, D. Liu, Efficient Complex Matrix Inversion for MIMO Software Defined Radio, IEEE International Symposium on Circuits and Systems. (2007). F. Edman, V. wall, A Scalable Pipelined Complex Valued Matrix Inversion Architecture, IEEE International Symposium on Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick, FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm, Asilomar Conference on Signals, Systems and Computers (2005). 48
  • Slide 49
  • Results Comparison with Previously Published Work: Matrix Inversion F. Edman, V. wall, A Scalable Pipelined Complex Valued Matrix Inversion Architecture, IEEE International Symposium on Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick, FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm, Asilomar Conference on Signals, Systems and Computers (2005). Edman et al. Karkooti et al. Our Method QR LUCholesky Bit width1220 Data typefixedfloatingfixed Device type Virtex 2Virtex 4 Slices44009117358427193682 DSP48sNR2212 BRAMsNR 111 Throughput ( 10 6 s -1 ) 0.280.120.260.330.25 49
  • Slide 50
  • Outline Motivation GUSTO: Design Tool and Methodology Applications Matrix Decomposition Methods Matrix Inversion Methods Mean Variance Framework for Optimal Asset Allocation Future Work Publications 50
  • Slide 51
  • Asset Allocation Asset allocation is the core part of portfolio management. An investor can minimize the risk of loss and maximize the return of his portfolio by diversifying his assets. Determining the best allocation requires solving a constrained optimization problem. 51 Markowitzs mean variance framework
  • Slide 52
  • Asset Allocation Increasing the number of assets significantly provides more efficient allocations. 52
  • Slide 53
  • High Performance Computing Higher number of assets and more complex diversification require significant computation. The addition of FPGAs to the existing high performance computers can boost the application performance and design flexibility. 53 Zhang et al. and Morris et al.Single Option Pricing Kaganov et al.Credit Derivative Pricing Thomas et al.Interest Rates and Value-at-Risk Simulations We are the first to propose hardware acceleration of the mean variance framework using FPGAs. FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008.
  • Slide 54
  • T HE M EAN V ARIANCE F RAMEWORK 54 Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio Computation of the Optimal Allocation 5 Phases MVF
  • Slide 57
  • Hardware Architecture for MVF Step 2 57 Random Number Generator 1 = [ 11, 12, , 1Ns ] Requires N s Multiplications Monte Carlo Block = M Objective Value Allocation =[ 1, 2,, Ns ] Market Vector ? Is this allocation the best? Expected Return Standard Deviation (RISK)
  • Slide 58
  • Hardware Architecture for MVF Step 2 58
  • Slide 59
  • Hardware Architecture for MVF Step 2 59
  • Slide 60
  • Hardware Architecture for MVF Step 2 60 Satisfaction Function Calculator Blocks Parallel N s Multipliers Parallel N m Monte Carlo Blocks Parallel N m Utility Calculation Blocks Parallel N p Satisfaction Function Calculation Blocks Parallel Satisfaction Function Calculator Blocks
  • Slide 61
  • Results 61 Mean Variance Framework Step 2 1000 runs 10 Satisfaction Blocks (1 Monte-Carlo Block with 10 multipliers and 10 Utility Function Calculator Blocks) 151 - 221 100,000 scenarios and 50 Portfolios 10 Satisfaction Blocks (1 Monte-Carlo Block with 20 multipliers and 20 Utility Function Calculator Blocks) 302 - 442 FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008.
  • Slide 62
  • Outline Motivation GUSTO: Design Tool and Methodology Applications Matrix Decomposition Methods Matrix Inversion Methods Mean Variance Framework for Optimal Asset Allocation Future Work Publications 62
  • Slide 63
  • Thesis Outline and Future Work 1. Introduction 2.Comparison of FPGAs, GPUs and CELLs - Possible journal paper, - GPU implementation of Face Recognition for journal paper. 3. GUSTO Fundamentals 4. Super GUSTO - Journal paper for Hierarchical design and Heteregenous Core Design, - Employing different instruction scheduling algorithms and analysis of their effects on implemented architectures. 5. Small code applications of GUSTO - Matrix Decomposition Core (QR, LU, Cholesky) designs with different architectural choices - Matrix Inversion Core (Analytic, QR, LU, Cholesky) designs with different architectural choices - Design of an Adaptive Weight Calculation Cores. 6. Large code applications using GUSTO - Mean Variance Framework Step 2 implementation, - Short Preamble Processing Unit implementation, - Optical Flow Computation algorithm implementation. 7. Conclusions 8. Future Work 9. References 63
  • Slide 64
  • Outline Motivation GUSTO: Design Tool and Methodology Applications Matrix Decomposition Methods Matrix Inversion Methods Mean Variance Framework for Optimal Asset Allocation Future Work Publications 64
  • Slide 65
  • Publications [15] An Optimized Algorithm for Leakage Power Reduction of Embedded Memories on FPGAs Through Location Assignments, Shahnam Mirzaei, Yan Meng, Arash Arfaee, Ali Irturk, Timothy Sherwood, Ryan Kastner, working paper for IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. [14] Xquasher: A Tool for Efficient Computation of Multiple Linear Expressions, Arash Arfaee, Ali Irturk, Ryan Kastner, Farzan Fallah, under review, Design Automation Conference (DAC 2009), July 2009. [13] Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009), July 2009. [12] Energy Benefits of Reconfigurable Hardware for use in Underwater Sensor Nets, Bridget Benson, Ali Irturk, Junguk Cho, Ryan Kastner, under review, 16th Reconfigurable Architectures Workshop (RAW 2009), May 2009. [11] Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC 2009), April 2009. [10] FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008. [9] GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, under review (2 nd round of reviews), Transactions on Embedded Computing Systems. [8] Automatic Generation of Decomposition based Matrix Inversion Architectures, Ali Irturk, Bridget Benson and Ryan Kastner, In Proceedings of the IEEE International Conference on Field-Programmable Technology (ICFPT), December 2009. 65
  • Slide 66
  • Publications [7] Survey of Hardware Platforms for an Energy Efficient Implementation of Matching Pursuits Algorithm for Shallow Water Networks, Bridget Benson, Ali Irturk, Junguk Cho, and Ryan Kastner, In Proceedings of the The Third ACM International Workshop on UnderWater Networks (WUWNet), in conjunction with ACM MobiCom 2008, September 2008. [6] Design Space Exploration of a Cooperative MIMO Receiver for Reconfigurable Architectures, Shahnam Mirzaei, Ali Irturk, Ryan Kastner, Brad T. Weals and Richard E. Cagley, In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), July 2008. [5] An FPGA Design Space Exploration Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on Application Specific Processors (SASP), June 2008. [4] An Optimization Methodology for Matrix Computation Architectures, Ali Irturk, Bridget Benson, and Ryan Kastner, Unsubmitted Manuscript. [3] FPGA Implementation of Adaptive Weight Calculation Core Using QRD-RLS Algorithm, Ali Irturk, Shahnam Mirzaei and Ryan Kastner, Unsubmitted Manuscript. [2] An Efficient FPGA Implementation of Scalable Matrix Inversion Core using QR Decomposition, Ali Irturk, Shahnam Mirzaei and Ryan Kastner, Unsubmitted Manuscript. [1] Implementation of QR Decomposition Algorithms using FPGAs, Ali Irturk, MS Thesis, Department of Electrical and Computer Engineering, University of California, Santa Barbara, June 2007. Advisor: Ryan Kastner 66
  • Slide 67
  • Thank You 67 [email protected]
  • Slide 68
  • M ATRIX I NVERSION Ali Irturk -UC San DiegoSASP 2008 Use Decomposition Methods for Analytic Simplicity Computational Convenience Decomposition Methods QR LU Cholesky etc. Analytic Method
  • Slide 69
  • Matrix Inversion using QR Decomposition Ali Irturk -UC San DiegoSASP 2008 Given Matrix Orthogonal Matrix Upper Triangular Matrix
  • Slide 70
  • Matrix Inversion using QR Decomposition Ali Irturk -UC San DiegoSASP 2008 1 2 3 4 5 6 7 8 columns of the matrix Entry at the intersection of i th row with j th column Three different QR decomposition methods: Gram-Schmidt Orthogonormalization Givens Rotations Householder Reflections 00101010 00101110 00001010 11101010 00101110 00101100 Memory Euclidean Norm
  • Slide 71
  • Matrix Inversion using Analytic Method Ali Irturk -UC San DiegoSASP 2008 The analytic method uses The adjoint matrix, Determinant of the given matrix. Adj(A) det A Determinant of a 2 2 matrix
  • Slide 72
  • Adjoint Matrix UC San DiegoSASP 2008 * * * * * * - - - * * * A 33 A 44 A 34 A 43 A 32 A 44 A 34 A 42 A 32 A 43 A 33 A 42 A 22 A 23 A 24 * * C 11 Adjoint Matrix Calculation Cofactor Calculation Core
  • Slide 73
  • Different Implementations of Analytic Approach Ali Irturk -UC San DiegoSASP 2008 Cofactor Calculation Core Implementation AImplementation B Implementation C
  • Slide 74
  • Matrix Inversion using LU Decomposition Ali Irturk - UC San DiegoICFPT 2008 Given Matrix Lower Triangular Matrix Upper Triangular Matrix
  • Slide 75
  • Matrix Inversion using LU Decomposition Ali Irturk - UC San Diego ICFPT 2008 1 2 3 4 5 6 7 8 9
  • Slide 76
  • Matrix Inversion using LU Decomposition Ali Irturk - UC San Diego ICFPT 2008 1 2 3 4 5 6 7 8 9
  • Slide 77
  • Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT 2008 Given Matrix Unique Lower Triangular Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix
  • Slide 78
  • Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT 2008 1 2 3 4 5 6 7
  • Slide 79
  • Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT 2008 1 2 3 4 5 6 7
  • Slide 80
  • Matrix Inversion using Cholesky Decomposition Ali Irturk - UC San DiegoICFPT 2008 1 2 3 4 5 6 7
  • Slide 81
  • T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio 81 Computation of the Optimal Allocation 5 Phases MVF
  • Slide 82
  • T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier 1 23 Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 82 Computation the Optimal Allocation 5 Phases MVF
  • Slide 83
  • C OMPUTATION OF R EQUIRED I NPUTS Computation of Required Inputs Expected Prices E{M} Expected Covariance Cov{M} Publicly Available Data Prices Covariance # of Securities Reference Allocation Horizon Investor Objective Known Data The time that investment made Investment Horizon Estimation Interval 1) Detect the Invariants 2) Determine the Distribution of Invariants 3) Project the Invariants to the Investment Horizon 4) Compute the Expected Return and the Covariance Matrix 5) Compute the Expected Return and the Covariance Matrix of the Market Vector 83
  • Slide 84
  • C OMPUTATION OF R EQUIRED I NPUTS Computation of Required Inputs Expected Prices E{M} Expected Covariance Cov{M} Publicly Available Data Prices Covariance # of Securities Reference Allocation Horizon Investor Objective Investor Objectives Absolute Wealth Relative Wealth Net Profits (3) = M Objective Value Allocation =[ 1, 2,, Ns ] Market Vector 84
  • Slide 85
  • C OMPUTATION OF R EQUIRED I NPUTS S TEP 5 = M Objective Value Market Vector M is a transformation of the Market Prices at the Investment Horizon M a + BP T+ Standard Investor Objectives Absolute WealthRelative WealthNet Profits (a)Specific Form = W T+ () = W T+ ()-() W T+ () = W T+ ()- w T () (b)Generalized Form a 0, B I N = P T+ a 0, B K = KP T+ a -p T, B I N = (P T+ -p T ) 85 Allocation =[ 1, 2,, Ns ]
  • Slide 86
  • C OMPUTATION OF R EQUIRED I NPUTS Each step requires to make assumptions: Invariants Distribution of invariants Estimation interval. Our assumptions: Compounded returns of stocks as market invariants, 3 years of the known data, 1 week estimation interval, 1 year as our horizon. 86 Phase 5 is a good candidate for hardware implementation.
  • Slide 87
  • T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 87 Computation of the Optimal Allocation 5 Phases MVF
  • Slide 88
  • MVF: S TEP 1 Computation of the Efficient Frontier Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Current Prices # of Portfolios # of Securities Budget Allocation Standard Deviation (RISK) Expected Return Efficient Frontier (v) arg max E{M}, v 0 constraints Cov{M} =v E{ } Var{ } 88
  • Slide 89
  • MVF: S TEP 1 Computation of the Efficient Frontier (v) arg max E{M}, v 0 constraints Cov{M} =v Standard Deviation (RISK) Expected Return Unachievable Risk-Return Space Efficient Frontier An investor does NOT want to be in this area! 89
  • Slide 90
  • T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 90 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF
  • Slide 91
  • MVF: S TEP 2 Computing the Optimal Allocation Determination of the Highest Utility Portfolio Optimal Allocation Current Prices # of Securities # of Portfolios # of Scenarios Satisfaction Index ? Is this allocation the best? 91 Expected Return Standard Deviation (RISK)
  • Slide 92
  • MVF: S TEP 2 Computing the Optimal Allocation Satisfaction Indices Represent all the features of a given allocation with one single number, Quantify the investors satisfaction. Satisfaction Indices Certainty-equivalent, Quantile, Coherent indices. Certainty-equivalent satisfaction indices are Represented by the investors utility function and objective, u(), We use Hyperbolic Absolute Risk Aversion (HARA) class of utility functions. Utility Functions Exponential, Quadratic, Power, Logarithmic, Linear. 92
  • Slide 93
  • MVF: S TEP 2 Computing the Optimal Allocation Hyperbolic Absolute Risk Aversion (HARA) class of utility functions are Specific forms of the Arrow-Pratt risk aversion model, Defined as where = 0. A() = 2++ Utility Functions Exponential Utility (>0 and 0) Quadratic Utility (>0 and -1) Power Utility ( 0 and 1) Logarithmic Utility (lim 1 ) Linear Utility (lim ) u() = -e (1/) u() = (1/2) 2 u() = 1- 1/ u() = ln()u() = 93
  • Slide 94
  • I DENTIFICATION OF B OTTLENECKS In terms of computational time, most important variables are: Number of Securities, Number of Portfolios, Number of Scenarios. Computation of Required Inputs Computation of the Efficient Frontier Determination of the Highest Utility Portfolio Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio 94
  • Slide 95
  • I DENTIFICATION OF B OTTLENECKS # of Securities dominates computation time over # of Portfolios. 95 # of Scenarios = 100,000
  • Slide 96
  • I DENTIFICATION OF B OTTLENECKS # of Portfolios dominates computation time over # of Scenarios. 96 # of Securities = 100
  • Slide 97
  • I DENTIFICATION OF B OTTLENECKS 97 # of Portfolios = 100# of Scenarios = 100,000
  • Slide 98
  • I DENTIFICATION OF B OTTLENECKS 98 # of Scenarios = 100,000 # of Securities = 100
  • Slide 99
  • I DENTIFICATION OF B OTTLENECKS 99 # of Portfolios = 100# of Securities = 100
  • Slide 100
  • T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 100 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF
  • Slide 101
  • Generation of Required Inputs Phase 5 /- - pTpT pTpT '' ININ 0 1 0 1 cntrl_a cntrl_b M Market Vector Calculator IP Core K Building Block pTpT KP T+ or P T+ Absolute Wealth Relative WealthNet Profits a 0, B I N = P T+ a 0, B K = KP T+ a -p T, B I N = (P T+ -p T ) Control Inputs Objectivecntrl_acntrl_b Absolute00 Relative10 Net Profits01 101 P T+
  • Slide 102
  • Generation of Required Inputs Phase 5 pTpT pTpT 102
  • Slide 103
  • Generation of Required Inputs Phase 5 /- pTpT pTpT ININ 103
  • Slide 104
  • Generation of Required Inputs Phase 5 / - pTpT pTpT '' ININ P T+ 0 1 cntrl_a 104
  • Slide 105
  • Generation of Required Inputs Phase 5 / - pTpT pTpT '' ININ P T+ 0 1 cntrl_a - cntrl_b pTpT 0 1 105
  • Slide 106
  • Generation of Required Inputs Phase 5 106
  • Slide 107
  • T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 107 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF
  • Slide 108
  • Hardware Architecture for MVF Step 1 (v) arg max E{M}, v 0 constraints Cov{M} =v A popular approach to solve constrained maximization problems is to use the Lagrangian multiplier method. 108
  • Slide 109
  • Hardware Architecture for MVF Step 1 Number of Securities amount of functions need to be computed for determination of the efficient allocation for a given risk. 109
  • Slide 110
  • Hardware Architecture for MVF Step 1 11 22 NsNs 1.Core 11 22 NsNs 2.Core 11 22 NsNs N p.Core v1v1 E{M} Cov{M} 110
  • Slide 111
  • T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs Computation of the Efficient Frontier Expected Prices E{M} Expected Covariance Cov{M} Allocations Optimal Allocation 111 Computation the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected Return Efficient Frontier Standard Deviation (RISK) Expected Return Highest Utility Portfolio MVF
  • Slide 112
  • Hardware Architecture for MVF Step 2 Random Number Generator 1 = [ 11, 12, , 1Ns ] Requires N s Multiplications Monte Carlo Block 112 Utility Functions Exponential Utility (>0 and 0) Quadratic Utility (>0 and -1) u() = -e (1/) u() = (1/2) 2
  • Slide 113
  • Hardware Architecture for MVF Step 2 113
  • Slide 114
  • Hardware Architecture for MVF Step 2 114
  • Slide 115
  • Hardware Architecture for MVF Step 2 Satisfaction Function Calculator Blocks Parallel N s Multipliers Parallel N m Monte Carlo Blocks Parallel N m Utility Calculation Blocks Parallel N p Satisfaction Function Calculation Blocks Parallel Satisfaction Function Calculator Blocks 115
  • Slide 116
  • Results Generation of Required Inputs Phase 5 1000 runs N s number of arithmetic resources in parallel 6 - 9.6 629 (for 50 Securities) 116
  • Slide 117
  • Results Mean Variance Framework Step 2 1000 runs 10 Satisfaction Blocks (1 Monte-Carlo Block with 10 multipliers and 10 Utility Function Calculator Blocks) 151 - 221 100,000 scenarios and 50 Portfolios 10 Satisfaction Blocks (1 Monte-Carlo Block with 20 multipliers and 20 Utility Function Calculator Blocks) 302 - 442 117
  • Slide 118
  • Conclusion Mean Variance Frameworks inherent parallelism make the framework an ideal candidate for an FPGA implementation; We are bound by hardware resources rather than by the parallelism Mean Variance Framework offers; However, there are many different architectural choices to implement Mean Variance Frameworks steps. 118