DSPBA: Flow Control, Design Style and Floating Point · Web viewThe word formats are different...

DSPBA: Flow Control, DesignStyle and Floating Point

September 27, 2011

© Altera Corporation

Copyright © 2011 Altera Corporation. All rights reserved. Altera, The Programmable Solutions Company, the stylized Altera logo, specific device designations, and all other words and logos that are identified as trademarks and/or service marks are, unless noted otherwise, the trademarks and service marks of Altera Corporation in the U.S. and other countries. All other product or service names are the property of their respective holders. Altera products are protected under numerous U.S. and foreign patents and pending applications, maskwork rights, and copyrights. Altera warrants performance of its semiconductor products to current specifications in accordance with Altera's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Altera assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Altera Corporation. Altera customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services.

DSPBA: Flow Control, Design Style and Floating Point

Table of Contents1 OVERVIEW............................................................................................................................ 52 DSPBA BASICS.................................................................................................................... 63 ESSENTIAL CHECKLISTS.................................................................................................11

3.1 TOP LEVEL DESIGN.......................................................................................................113.2 PRIMITIVE SUBSYSTEMS.................................................................................................113.3 VERIFICATION................................................................................................................12

4 RECOMMENDATION CHECKLISTS..................................................................................124.1 SIMULINK DESIGN SETTINGS...........................................................................................124.2 TOP LEVEL DESIGN.......................................................................................................134.3 PRIMITIVE SUBSYSTEMS.................................................................................................144.4 DESIGN STYLE...............................................................................................................144.5 VERIFICATION................................................................................................................14

5 FAQ...................................................................................................................................... 16DESIGN STYLE........................................................................................................................... 27

5.1 GETTING STARTED........................................................................................................275.1.1 createPrimitiveFIR...................................................................................................275.1.2 createDSPBADesign...............................................................................................29

5.2 USING VECTORS............................................................................................................305.2.1 Use of vectors: with ModelPrim blocks....................................................................305.2.2 Use of vectors: Additional Libraries > Vector Util Library.........................................335.2.3 Use of vectors: Other supported Vector blocks.......................................................36

5.3 IMPLEMENTING FLOW CONTROL.....................................................................................405.3.1 Flow control using latches.......................................................................................405.3.2 Flow control using simple LOOP.............................................................................505.3.3 Flow control using ForLoop blocks..........................................................................525.3.4 LOOP vs ForLoop...................................................................................................59

5.4 BUILDING SYSTEM COMPONENTS...................................................................................595.4.1 Avalon-ST Output....................................................................................................635.4.2 Avalon-ST Input.......................................................................................................655.4.3 Avalon-ST Input FIFO..............................................................................................665.4.4 Extending the interface definition............................................................................665.4.5 Restrictions on use..................................................................................................69

5.5 USING MODELPRIM SUBSYSTEMS...................................................................................705.5.1 Interfaces as subsystem boundaries.......................................................................705.5.2 Interfaces as scheduling boundaries.......................................................................715.5.3 ModelPrim subsystem design styles to avoid..........................................................785.5.4 Common Problems..................................................................................................785.5.5 ModelPrim Blocks outside primitive subsystems.....................................................805.5.6 Convert blocks vs. specifying output types via dialog..............................................82

6 DEBUGGING DESIGNS......................................................................................................867 FLOATING POINT...............................................................................................................87

7.1 SUPPORT OUTLINE........................................................................................................877.1.1 Blocks......................................................................................................................877.1.2 Interaction with other features.................................................................................90

7.2 FLOATING POINT FORMAT..............................................................................................92

© 2011 Altera Corporation of 114


7.2.1 Single Precision Word Formats...............................................................................927.2.2 Double Precision Word Formats..............................................................................957.2.3 Floating Point Type propagation..............................................................................98

7.3 SPECIAL CONSIDERATIONS WHEN USING FLOATING POINT................................................997.3.1 Flow Control, latency hiding and avoiding data dependencies................................99

APPENDIX: GENERATED TEST-BENCHES............................................................................108Appendix: Overriding Test-benches in Matlab.......................................................................111

FiguresFIGURE 1: MODEL FILE TOP LEVEL....................................................................................................8FIGURE 2: SYNTHESIZABLE SYSTEM TOP LEVEL.................................................................................9FIGURE 3: PRIMITIVE SUBSYSTEM TOP LEVEL..................................................................................10FIGURE 4: USING A MATRIX ALLOWS SIMPLE INITIALIZATION OF A VECTOR OF LUTS EACH WITH

DIFFERENT CONTENTS, WITHOUT HAVING TO SPECIFY EACH SEPARATELY..................................31FIGURE 5: VECTOR INITIALIZATION OF SAMPLEDELAY WHEN USED WITH VECTORS (AND EQUIVALENT

VERSION WITH INDIVIDUAL SAMPLE DELAYS).............................................................................32FIGURE 6: EXPAND SCALAR BLOCK FOR PARAMETERIZABLE SIGNAL REPLICATION TO VECTOR.............33FIGURE 7: VECTOR MUX MASKED SUBSYSTEM FOR DYNAMIC VECTOR SIGNAL SELECTION,

REPARAMETERIZABLE IN VECTOR WIDTH..................................................................................34FIGURE 8: ZERO LATENCY LATCH....................................................................................................41FIGURE 9: SINGLE-CYCLE LATENCY LATCH.......................................................................................41FIGURE 10: SET/RESET BOOLEAN LATCH WITH RESET PRIORITY.......................................................42FIGURE 11: SET/RESET BOOLEAN LATCH WITH SET PRIORITY...........................................................42FIGURE 12: SIMPLE FILTER WITH ENABLED DELAY CHAIN - FULL LAYOUT............................................44FIGURE 13: OUTPUT OF DEMO_FORWARD_PRESSURE, STALLED WHEN VALID IS LOW – BUT WITHOUT

THE NEED FOR A HIGH FANOUT ENABLE NET.............................................................................45FIGURE 14: VECTORISED FIR STAGE...............................................................................................47FIGURE 15: VECTORIZED FIR STAGE USING TAPPEDDELAYLINE FROM VECTOR UTILS LIBRARY..........47FIGURE 16: DEMO_BACK_PRESSURE EXAMPLE DESIGN.....................................................................47FIGURE 17: SIMPLE FIFO MODEL, USED TO ILLUSTRATE CONSIDERATIONS IN SAFE USAGE.................48FIGURE 18: HARDWARE SIMULATION OUTPUT FOR EXAMPLE..............................................................49FIGURE 19: LOOP BLOCK AND EQUIVALENT C CODE FOR TWO-DIMENSIONAL COUNT LIMIT VECTOR 'C'. 50FIGURE 20: DEMO_KRONECKER; USING NESTED LOOP TO GENERATE DATAPATHS THAT OPERATE ON

REGULAR DATA....................................................................................................................... 51FIGURE 21: RECTANGULAR NESTED LOOP........................................................................................54FIGURE 22: TRIANGULAR NESTED LOOP...........................................................................................56FIGURE 23: AVALON STREAM LIBRARY APPEARS UNDER ADDITIONAL LIBRARIES AS THE BLOCKS IN THE

LIBRARY ARE IN FACT JUST MASKED PRIMITIVE SUBSYSTEMS...................................................60FIGURE 24: THE AVALON STREAM INTERFACE LIBRARY....................................................................61FIGURE 25: DSP DESIGN WITH AVALON-ST INTERFACES..................................................................61FIGURE 26: TAG THE SIMULINK PORT INTERNAL TO THE AVALON-ST MASKED SUBSYSTEM TO DEFINE

THE PORT ROLE...................................................................................................................... 67FIGURE 27: EXAMPLE BLOCK PROPERTIES GUI FOR AVALON-ST MASKED SUBSYSTEM SHOWING

DESCRIPTION FIELD................................................................................................................68FIGURE 28: PACKING AND UNPACKING A VECTOR INTO A SINGLE DATA CONNECTION..........................70FIGURE 29: SIMPLE PRIMITIVE SUBSYSTEM WITH LINE OF IDENTICAL REGISTERS ACROSS ALL INPUT TO

OUTPUT PATHS.......................................................................................................................72FIGURE 30: ANOTHER SIMPLE PRIMITIVE SUBSYSTEM WITH LINE OF IDENTICAL REGISTERS ACROSS ALL

INPUT TO OUTPUT PATHS........................................................................................................73FIGURE 31: SIMPLE PRIMITIVE SUBSYSTEM WITH INDEPENDENTLY SYNCHRONIZED OUTPUTS...............75



FIGURE 32: SIMPLE PRIMITIVE DESIGN WITH MULTIPLE INPUT GROUPS AND MULTIPLE OUTPUT GROUPS............................................................................................................................................. 76

FIGURE 33: SCHEDULE, AFTER PIPELINING FOR ABOVE PRIMITIVE SUBSYSTEM...................................76FIGURE 34: EXAMPLE OF A MULTIPLE I/O PRIMITIVE SUBSYSTEM WHERE AN INPUT IS SCHEDULED AFTER

AN OUTPUT............................................................................................................................ 77FIGURE 35: BLOCKS NOT DRIVEN FROM CLOCKED BLOCKS................................................................78FIGURE 36: SYNCHRONIZING LOGIC DEPENDENT ON RESET (BAD).....................................................80FIGURE 37: SYNCHRONIZING LOGIC DEPENDENT ON VALID (GOOD)....................................................80FIGURE 38: CONVERT BLOCK CHANGES DATA-TYPE PRESERVING REAL-WORLD VALUE (AS FAR AS

POSSIBLE), WITH OPTIONS TO ROUND AND SATURATE. IT CAN GROW THE NUMBER OF BITS - SIGN EXTENDING OR ZERO-PADDING WHERE APPROPRIATE...............................................................82

FIGURE 39: CONVERT BLOCK CHANGES DATA-TYPE PRESERVING REAL-WORLD VALUE (AS FAR AS POSSIBLE), WITH OPTIONS TO ROUND AND SATURATE...............................................................83

FIGURE 40: SETTING AN OUTPUT TYPE EXPLICITLY VIA A PRIMITIVE DIALOG FOR ANY OTHER BLOCKS CHANGES TYPE WHILE PRESERVING THE BIT PATTERN. THE REAL WORLD VALUE WILL GENERALLY BE SCALED IN SUCH CASES.....................................................................................................84

FIGURE 41: SETTING AN OUTPUT TYPE VIA DIALOG AND REDUCING THE BIT-WIDTH WILL DISCARD THE TOP, 'MOST SIGNIFICANT' BITS................................................................................................85

FIGURE 42: OUTPUT DATA TYPE SELECTION UI - SHOWING SINGLE AND DOUBLE AS OPTIONS.............88FIGURE 43: NEW FLOATING POINT PRIMITIVE BLOCKS IN 10.1 ADVANCED BLOCKSET..........................89FIGURE 44: USE OF FIFOS (AND LOOPS) TO CONTROL RUNNING OF FLOATING POINT CALCULATIONS

WITHOUT EXPLICITLY WAITING FOR THE START-TO-FINISH CALCULATION LATENCY. RESULT CAN FEED INTO SIMILAR DOWNSTEAM PROCESSES.........................................................................100

FIGURE 45: FLOW CONTROL FOR MADELBROT CALCULATION..........................................................102FIGURE 46: INSERTION OF SUFFICIENT LUMPED 'SAMPLEDELAY' TO ALLOW FOR PIPELING.................103Figure 47: Generated Automatic TestBench files.......................................................................109



1 OverviewThis document describes some recommended design styles when using DSP Builder, Advanced Blockset as an incremental addition to the current documentation.

It mostly focuses on recent enhancements to the tool (10.0+).

In particular it covers use of vectors with ModelPrim blocks, how to implement efficient flow control, floating point and special design considerations when using it, and some design patterns to avoid. Examples are used to illustrate the general principles. The document is mostly restricted to designs using primitive subsystems.



2 DSPBA Basics

Top level: (See Figure 1)The top level of the Model consists of

Simulink test-bench - that is blocks to provide inputs, and to analyse inputs and outputs

Required DSPBA top level parameterization blocks: Signals (bus clock specification, system clock specification) & Control (RTL output directory, top level threshold parameters)

Links to open post-generation tools: Run Modelsim (open ModelSim to run generated RTL testbench and compare against Simulink at the synthesizable system level), Run Quartus (open the generated RTL project in Quartus to do a full Quartus compile)

Other blocks, such as: Run All Testbenches - a UI for the scripts to control run system-level and ModelIP and primitive subsystem level Automatic Test Benches (ATBs), optional short-cuts to edit parameterization files that run on model start-up and/or pre- or post-simulation

Synthesizable System Top Level: (See Figure 2) The part of the design to be synthesized is separated

hierarchically. What will form the top level of the synthesizable part is indicated by a Device block, which sets which family, part, speed grade etc to target.

This level can consist of further level of hierarchies that include Primitive Subsystems - scheduled domains for ModelPrim blocks (the low-level blocks such as delays, mults, adds) - and ModelIP Blocks - the standalone macro functions (NCO, FIR, CIC)

Optionally further LocalThreshold blocks can be included to override threshold settings defined higher up the hierarchy.

Primitive Subsystem Top Level: (See Figure 3)

Primitive Subsystems are scheduled domains for the ModelPrim blocks. A SynthesisInfo block is required. Blocks to delimit the Primitive subsystem are also required: ChannelIn (Channelized Input), ChannelOut



(Channelized Output), GPIn (General Purpose Input) and GPOut (General Purpose Output). Within these boundary blocks the tool will optimize the implementation specified by the schematic - including the insertion of pipelining registers required to achieve the specified system clock rate. When inserting pipelining registers, equivalent latency has to be added to parallel signals that are required to be kept synchronous so that they are scheduled together. Signals that go through the same input boundary block (ChannelIn or GPIn) are scheduled to start at the same point in time; Signals that go through the same output boundary block (ChannelOut or GPOut) are scheduled to finish at the same point in time. Any pipelining latency added to achieve Fmax is then added in balanced 'cuts' through the signals across the design. The correction to the simulation to account for this latency added in HDL generation is applied at the boundary blocks, such that the Primitive Subsystem as a whole will remain cycle accurate.

Note that further levels of hierarchy can be defined within primitive subsystems containing primitive blocks - (but no primitive boundary blocks or ModelIP blocks)




Figure 2: Synthesizable System Top Level



Figure 3: Primitive Subsystem Top Level


3 Essential Checklists

3.1 Top Level Design

1. There must be a Control block and a Signals block at the top level

2. The synthesizable part of your design must be a subsystem or contained within a subsystem of the top level.

3. There must be a Device block at the hierarchical level of the synthesizable part of the design.

4. Test-bench stimulus data-types feeding into the synthesizable design will be propagated – so ensure they are correct. Switch on all of ‘Port Data Types’, ‘Signal Dimensions’ and ‘Wide Non-Scalar Lines’ from the Simulink model Format > Port/Signal Displays menu to have these annotated to your model so they are visible.

3.2 Primitive Subsystems

1. Don’t try to pipeline the system yourself – this is what the tool does for you using its internal timing models and integer linear programming. Only add Sample Delays where they are part of the algorithm; that it where your algorithm explicitly requires you to think about combining data samples from different clock cycles. This can include feedback loops. If your design is not meeting timing, you may want to consider using the Clock Margin parameter on the top level Signals block, or on a LocalThreshold block.

2. The subsystem must contain a SynthesisInfo block with style set to Scheduled.

3. Primitive subsystems cannot contain ModelIP blocks.

4. All subsystem inputs with associated ‘Valid’ and ‘Channel’ signals that are to be scheduled together should be routed through the same ChannelIn blocks immediately following the subsystem inputs. Any other subsystem inputs should be routed through GPIn blocks.

5. All subsystem outputs with associated ‘Valid’ and ‘Channel’ signals that are to be scheduled together should be routed through the same ChannelOut blocks immediately before the subsystem outputs. Any other subsystem outputs should be routed through GPOut blocks.

6. Use Convert blocks to change data type preserving real-world value.

7. Use Set Via Dialog options to change data type preserving bit-pattern (with no bits added or removed), or to fix a data type.



8. Use Reinterpret Cast to change data type preserving bit-pattern (with no bits added or removed); for example if converting a ‘uint32’ to ‘single’

9. The valid signal is a scalar boolean or ufix(1).

10.The channel signal is a scalar uint(8).

3.3 Verification

1. Turn on ‘Create Automatic Testbenches’ and ‘Coverage in Testbenches’ on the Control block to use. Note that stimulus capture for test-benches is done on the inputs and outputs of ModelIP blocks and by ChannelIn, ChannelOut, GPIn and GPOut blocks.

2. Run all testbenches with

run_all_atbs(<model name>, <run simulation first? 1 : 0>, <run Quartus afterwards ? 1 : 0)

3. Run and individual subsystem / ModelIP block testbench with

run_modelsim_atb(<path to subsystem> or <’gcb’ if currently selected>)

4. Run the single device-level testbench from the Run ModelSim block.

5. Look at generated resource summaries. After simulation, right click and select ‘Help’ – on the ModelIP block for ModelIP blocks, on the SynthesisInfo block for primitive subsystems, or on the Control block for a top level design summary.

6. Use the Run All Testbenches block to control test-benches – and to access the override feature, where ModelSim results can be automatically imported back in the Matlab and a custom Matlab function used to verify and provide the pass/fail criteria. (See appendix).

4 Recommendation Checklists

4.1 Simulink Design settings

Set Simulation > Configuration Parameters > Solver Options to “Fixed-step” / “discrete (no continuous states)”, unless folding, or you have multiple clocks in your test-bench in which case set to “Variable-step” / “discrete (no continuous states)”. This gives faster simulation than continuous solvers and also correct results round loops.

Tick all options on Format > Port/Signal Displays, except “Storage Class”. It is then clear which signals are complex, which are vectors, and the data types.

Hide the names of unimportant blocks to de-clutter your design using Format > hide name



If using From and Goto blocks, color linked blocks the same by selecting all linked Froms & Gotos (hold down shift while clicking on each block), then right click > background color. This makes tracing the connectivity easier to see.

Annotate your designs. Just double click anywhere on the background and start typing. Use Simulink Documentation blocks to link to external documentation.

Matlab Window > File & Folder comparisons is a great way to see what has changed between versions of your design.

4.2 Top Level Design

1. Use workspace variables to set parameters you may want to vary; including clock rates, sample rates, bit-widths, channels, etc.

2. Set workspace variables in initialization scripts. It is suggested that these are executed on the model’s PreLoadFnc and InitFcn callbacks, such that the design opens with parameters set, and any changes will be reflected in the next simulation, without having to explicitly run the script or open & close the model.

3. Call your main initialization script for the model ‘setup_<model name>’, and – as a shortcut to editing it – include the Edit Params block in the top level of your design. (This can be found by right-clicking on the Base Blocks library in the Simulink Library Browser and Selecting ‘Open Library’.)

4. Build a test-bench that is parameterizable – i.e. will vary correctly with system parameters such as Sample Rate, Clock Rate, and number of channels. The Channelizer block in Beta Utilities Library may be useful for this.

5. Use the model’s StopFnc call back to run any analysis scripts automatically

6. Build systems that make use of the valid and channel signals for control and synchronization; not latency matching. For example by capturing valid output in FIFOs to manage data-flow.

7. Build up and use your own libraries of reusable components. You can even use the “Configurable Subsystem block” in libraries to provide a single link from which you can select library implementations in place. (See “Configurable Subsystem block” in the Simulink help).

8. Keep block and subsystem names short, but descriptive. Avoid names with special characters, slashes or beginning with numbers.

9. Use LocalThreshold blocks, in conjunction with the top level thresholds, for localized trade-offs or pipelining effort tweaks if necessary.



4.3 Primitive Subsystems

1. Make use of vectors to build parameterizable designs – that don’t need redrawing when parameters such as number of channels changes.

2. Ensure there is sufficient Sample Delays around loops to allow for pipelining.

3. Data-type, complexity and vector width propagation is done by Simulink. Sometimes this is not successfully resolved round loops, particularly multiple nested loops. If unsuccessful, look for where data-types are not annotated. You may have to explicitly set data types. Else Simulink provides a library of functions to help in such situations, which duplicate data types etc. This is fixpt_dtprop (type ‘open fixpt_dtprop’ from the Matlab command prompt to open). The ‘Data Type Prop Duplicate’ block is used in the Control library latches, for example. These are guides to Simulink on data-type propagation, and do not produce hardware.

4. If routing within a Primitive Subsystem is getting complex, you might consider using Simulink From / Goto blocks to replace connections. Make sure that the Tag Visibility on the Goto blocks is Global if crossing subsystems within a primitive subsystems. You can color code blocks too (right click > background color) to make connections more obvious.

4.4 Design Style

Don’t try to pipeline the design yourself – this is what the tool does for you using mathematical linear programming techniques. If you need more pipelining, use a positive Clock Margin (see Signals block).

Don’t try to synchronize the output of different parallel subsystems using explicit delays. Use FIFOs, as this will give a more device-portable, fmax target independent design.

Break designs up hierarchically to make your design understandable. However, keep consecutive primitive subsystems together within single ChannelIn/Out blocks, as this gives greater scope for scheduling and pipelining optimizations.

If you think you need complex control with complex feedback or cycle-counting from the data path, think again. Look at the Mandelbrot design and understand what it is doing. It creates command instructions which are placed in a FIFO. The instructions are consumed by the data-path as fast as it can run. The result is a design that runs as fast as it can and is portable between device families, and which reduces the complexity of the control logic.

4.5 Verification

1. Remember that output is only guaranteed to match hardware when valid is high.



2. run_modelsim_atb displays the command it executes. This command can be cut and paste into an open ModelSim UI open at the same directory and run manually. Using this you can analyze the behavior of particular subsystems in detail, and can force simulation to continue past errors if necessary.

3. If using FIFOs within multiple feedback loops, it is possible that while the data-throughput and frequency of invalid cycles is the same, their distribution over a frame of data many vary (due to the final distribution of delays around the loop). If a mismatch is found, it is therefore worth stepping past errors using the above process to check whether this is the case.

4. Floating point simulation is compared to within a tolerance. Differences are likely to be in the few least significant bits only, but could potentially be higher if the function you are implementing is ill-conditioned. Larger relative differences can also arise in complex multiplication with large complex numbers that lie close to the real or imaginary axis

5. Use the Run All Testbenches block to control test-benches – and to access the override feature, where ModelSim results can be automatically imported back in the Matlab and a custom Matlab function used to verify and provide the pass/fail criteria. (See appendix).


Small absolute error in complex number

Im

Re

Large relative error in real component


5 FAQ1) What do the ChannelIn and ChannelOut blocks do?

The optimizations performed by the tool operate within subdomains of the whole design: individually within each ModelIP block (FIR, NCO etc) and within each Primitive subsystem.

A Primitive subsystem is a Simulink subsystem with a ModelPim SynthesisInfo block, inputs and outputs (at the SynthesisInfo level) passing through the boundary blocks ChannelIn/Out or GPIn/Out at the subsystem I/Os, and containing that part of the design built from ModelPrim blocks. It can contain further Simulink subsystems, but no nested ModelIP blocks or further SynthesisInfo blocks.

The ChannelIn and ChannelOut blocks (and GPIn and GPOut) delimit the boundaries of a primitive subsystem. They group signals (either with (ChannelIn/Out) or without (GPIn/Out) related channel and valid signals) at the boundary to be scheduled together. When determining the pipelining to be added in order to achieve the desired Fmax, the tool needs to know which signals should be kept synchronized, such that adding latency to one will require balancing delays to be added to the synchronous signals. Added pipelining is then added in balanced ‘cuts’ through the synchronized signals, such that they added delay can be corrected for (in most cases) at the subsystem level just by adding simulation delays in the appropriate boundary blocks.

See below for further details.

2) If I have a block with data flow and a parameter (e.g. gain) where one single parameter (the gain) is given at any time without any timing relation to the data flow), how should those be used?

If the signals are independent and to do not have to remain synchronous then you can put them through separate boundary blocks.



3) What does the error message “Warning: Negative IO correction on block <design>/<subsystem>/ChannelOut. Simulink will not match hardware” mean?

See below.

4) What does the error message “Unable to determine data types for some ports, cannot continue” mean?

It means that the data type of a signal cannot be uniquely determined from the word growth and inheritance rules set on the blocks. This can arise in feedback loops with inherit or growth rules or in blocks with unconnected inputs. Consider the following example where the multiplier and the SampleDelay both have ‘Output data type mode’ as “Inherit via internal rule”.

The word growth rule for multipliers is to add the integer and fractional bit widths of the inputs to get the output type. The SampleDelay preserves the input type without change by default. It can be seen here then that the output type of the Mult cannot be determined:



If input A is sfix17_En16, and input B = Mult output type = is sfixP_Q then the output is sfix(17+P)_En(16+Q) under the default inheritance rules, i.e. P=P+17 and Q=Q+16 – which has no solution for P & Q.

We must fix the type in this loop explicitly. This can be done by explicitly setting a type on one of the blocks; by adding a Convert block to set the type, or by using the Simulink “Data Type Prop Duplicate” block from the Simulink fixpt_dtprop library, which copies the type from the signal attached to ‘Ref’ to the signal attached to ‘Prop’.

This method may be favorable as it is flexible to input type, though an alternative approach is to write a flexible Matlab expression that is evaluated to set the type. Note that other simple propagation type blocks can also be used from this library, for example;

5) What does the error message “Failed to distribute memory in your design” mean?



The tools automatically inserts pipelining required to meet the chosen clock speed, based on internal timing models. For example to run an 18bit multiplier at 400MHz, our timing models may suggest that 3 registering stages are required across the multiplier block.

Suppose you have a feedback loop where a latency of 5 clock cycles is specified (for example if your data is for 5 channels in sequence). We can satisfy both criteria: pipelining for fmax and re-circulating the data in 5 clock cycles. Rather than adding the 5 cycles of delay specified by the SampleDelay as 5 new registers, 3 of those delays will be formed from the delay required across the multiplier and only 2 will be implemented as external registers. The delay has been ‘distributed’ around the loop. More complex, multiply nested loops require more complex delay redistributions – but all of this is solved using standard mathematical linear programming techniques.

Now suppose you have a feedback loop where a latency of 1 clock cycle is specified. The two criteria: pipeline to achieve fmax (implying a latency of at least 3 clock cycles) and the loop criteria (re-circulate the data in 1 clock cycle), cannot simultaneously be satisfied, no matter where we distribute the 1 cycle of delay specified. In this case an error is given:

Failed to distribute memory in your design. Found insufficient delay attempting to satisfy fMax requirement for [subsystem]. Failed to satisfy the following latency constraints: (ParallelPathPair 0): Mult<3> SampleDelay (2 cycles deficient)

What this says is that the design as specified is imposing a restriction on the pipelining that is required to meet the clock speed requirement, such that both cannot be satisfied simultaneously.

It may be possible in some cases to re-implement the algorithm to avoid loops, or to run the designs faster but push the data through at the same rate. (for example, if running at 100MHz with a new data sample every clock cycle, instead run at 300MHz and have a new data sample every 3 clock cycles [DSPBA optimizations and timing characterization is currently targeted at high clock rates]. Folding (manual or automatic) can then be used to reduce hardware resources elsewhere in the design if clock rate > sample rate.

6) My design worked, I turned on folding for a Primitive Subsystem and got a Simulink error: “S-function '<design>/<subsystem>/ChannelOut' method mdlSetInputPortSampleTime cannot change the sample time of ports once they have been set.”


matlab:open_and_hilite_system%20('altera_cfr/CFR_Chip/CFR_Top/ChannelOut')


Simulink has propagation and setting rules for data types, sample rates, etc. that attempt to resolve and fix these fields for each port. Folding changes the Simulink sample rate at which the primitive subsystem runs. If you get this message it’s because there is a conflict in sample time settings: the ChannelOut has been set to run at the folded sample rate, but a block within the primitive subsystem itself has an explicit sample time set on it that conflicts when propagated forwards to the ChannelOut. Check that the sample times of the blocks in the primitive subsystem are set to ‘-1’ (inherit) where appropriate.

7) How should we use the Avalon Blocks?

The Avalon-MM (ModelBus) and Avalon-ST blocks are used in different ways. Refer to the DSP Builder documentation on how to use these. For flow control the Avalon-ST output “ready” signal should be looped back to the Avalon ST input “ready” signal. This is shown in the diagram below.

8) Is it possible to have portion of the graph depending on some variables? e.g. having clockrate/N adders, where clockrate is defined in the parameter file.

There are several ways to do this. The first is through the use of vectors, where the vector size determines the number of blocks that will be produced. Vectors are very useful in building parameterizable components. The other is to create a self initializing subsystem component – see 5.2.2.1 for an example of a block which is really a self initializing subsystem.

9) How do we initialize the value in a register (SampleDelay)?

This is not currently supported directly. Delays specified by SampleDelays can get redistributed around the system – and hence implemented as registers in memory blocks or multipliers where initialization is impossible.



10) What is the list of supported Simulink blocks that can be use for HW generation?

Mux, Demux, From, Goto, Subsystem ports (Out1, In1), Terminator, Constant, Selector (static Vector selection only), Complex to Real-Imag, Real-Imag to Complex, Configurable Subsystem (with some restrictions), Data-type propagation (with some restrictions).

11) Can I mix VHDL with Simulink (or other design languages?)

The HDL Import block can be used in a Standard Blockset level hierarchically above the DSPBA design. See documentation on HDL Import and mixed Blockset designs.

12) Can I create my own equivalent of the ‘Edit Params’ block

Yes. The Simulink documentation covers such matters.

It is deliberate that we do not show the Edit Params block in DSPBA library browser: the block itself does nothing other than open a file for editing. The user would have to create the script and set up the pre-load functions on the model properties to use correctly. It is not something you can just drag and drop onto your model.

You can achieve the same by creating any m script which is run in the models set-up stage (PreLoadFcn), or indeed any other such stage if necessary.

The Edit Params assumes that the name of this script is “setup_<model name>.m” – but this is just the way it’s done for this use case – you could call it whatever you like (e.g my_script.m)

Use File > Model Properites > Callbacks to get your script to run before simulation. For a design demo_duc using edit params you have to add setup_demo_duc to the PreLoadFunc (so that the parameters exist on loading, and in the InitFnc, so that changes you make with the model open before running simulation will be included in the simulation run.. (If you called the script my_script.m, then just put my_script; at the appropriate stages.



The edit params block is a masked subsystem that has been given an OpenFnc to open an m-script “setup_<model_name>” so you can edit it

s = sprintf('edit setup_%s', eval('gcs')); this bit is setting up the name of the script in a ‘edit’ command

eval(s); this bit is executing the command

Drop in a Simulink subsystem. Go in and remove the default ports. Back out, right click ... block properties … and set the OpenFnc



Alternatively if you called you set-up script ‘my_script.m’ the OpenFnc would be …



The block will now open the script for editing when clicked, and will look like this:



You can call this subsystem block what you like … or hide the name it doesn’t matter. All it is is a way of opening the set up script for editing. The important thing is the OpenFnc.

You can even add a picture or graphic for it. For example if you have a picture “ant.jpg” in a directory which is included in you matlab path (file > set path …) then you can right click on the subsystem block > Edit Mask … and add something like the following (which sets an image, sets the text color to white, and writes “Antonnios Set Up Script” across it)

To give



You can debate whether this is an improvement.

13) Is it possible to restrict the scope of the variables defined in the setup script to the model they apply to only?

The recommendation is to create a structure of variables for the model to avoid ambiguity if running multiple models. The Simulink help also has some information on the scope of workspace and model variables.



Design Style

5.1 Getting Started

We recommend you follow the checklists outlined above in setting up your system.

The demonstration designs often make good starting points, and can be copied, renamed and saved into a new working directory along with any setup scripts.

There are also a couple of scripts to get you started on building the basics of a DSPBA design.

5.1.1 createPrimitiveFIR

createPrimitiveFIR creates a complete FIR filter design using DSPBA primitive blocks. There are several ways to pass in parameters to this: an ordered short-list of parameters, a MATLAB struct, or as name – value pairs.

The command line call

createPrimitiveFIR(NAME,COEFFS,NUMCHANS,COEFTYPE,COEFSIGNALTYPE,DATASIGNALTYPE)

creates a design called NAME with COEFFS taps.

createPrimitiveFIR(NAME,PARAMSTRUCTURE),where PARAMSTRUCTURE is a MATLAB struct, allows you to pass the parameters in as a struct with any unset parameters reverting to the defaults listed below. The structure should have fields with the names of the parameters as below (case-insensitive).

createPrimitiveFIR(NAME, PARAMNAME1, PARAMVALUE1, PARAMNAME2, PARAMVALUE2, ...) is as above but with the parameters passed in as name-value pairs.

Parameter Description / values

NAME Name of model to create

COEFTYPE

Affects how the coefficients are stored as follows:Constant (default) - stored in constant blocksRead - stored in RegField blocks and can be read via the bus interface.Write - stored in RegField blocks and can be written via the bus interface.Readwrite - stored in RegField blocks and can be read and written via the bus interface.

COEFSIGNALTYPEThe simulink type for coefficient values. It defaults to 'sfix16_En15'.



DATATYPE The simulink type for the input sources. It defaults to 'sfix16_En15'

COEFFS If this is an array these are the coefficients of the filter. The filter will have as many taps as there are coefficients.

TYPE'single' creates a single rate FIR (default)'decim' creates a decimating FIR'interp' creates an interpolating FIR

SYMMETRY'off' creates a non-symmetric FIR (default)'on' creates a symmetric FIR'anti' creates an anti-symmetric FIR

BAND'full' creates a full-band FIR (default)'half' creates a half-band FIRnum2str(X) creates 1/X band FIR. e.g. '4' creates a quarter band FIR

DECHAN false wires the resulting FIR directly to a scopetrue wires the resulting FIR to a ChanView block which is then wired to a scope

CLOCKRATE This is the clock rate in MHz. It defaults to 200.

SAMPLERATE This is the rate of the data in MHz. It defaults to the same as CLOCKRATE except if interpolating, in which case it defaults CLOCKRATE/2

COMPAREMODELIP If set to true, this generates an equivalent ModelIP filter along side and creates assertion blocks to verify that both systems are equivalent.

MODELIPONLY If set to true, a ModelIP FIR is created with no primitive FIR

RUNCHECK If set to true then the design is simulated immediately after creation

QUARTUSCOMPILE If set to true then the designs is run through quartus. (requires RUNCHECK=true

RAMTHRESHOLDBITS This is the threshold set in the "CDelay RAM Block Threshold" parameter on the control block

FAMILY Device family. Accepted values are: Stratix, Stratix GX, Stratix II, Stratix II GX, Stratix III, Stratix IV, Cyclone II, Cyclone III, Arria II GX, Cyclone III LS

SPEEDGRADE Device speed grade. Accepted values are: fast, medium and slow

REPLACEMODEL If set to true, existing models with the same name will be closed and replaced



5.1.2 createDSPBADesign

createDSPBADesign creates an empty primitive subsystem design with required blocks in place. The default parameters are:

defaults.dataInputs = 1;defaults.dataOutputs = 1;defaults.chanCount = 1;defaults.sourceType = 'constant';defaults.sourceValue = 'fixed';defaults.sourceScale = 1;defaults.sourceImpulseGap = 10;defaults.sourceSignalType = 'sfix(32)';defaults.wireUpValidAndChan = true;defaults.dechan = false;defaults.clockRate = 200.00;defaults.sampleRate = 200.00;defaults.scopeInputs = false;defaults.primitiveSampleRate = 200.00;defaults.primitiveChannels = 1;defaults.subsystemNames = {'subsystem'};defaults.subsystemTypes = {'prim'};defaults.matchDelays = false;defaults.filterReference = 'DSPBAFilters/SingleRateFIR';defaults.filterParams = {'nInputRate', 'SampleRate', 'nchan', 'ChanCount', 'symmetry', 'Non Symmetrical', 'addr', '0'};defaults.ramThresholdBits = '-1';defaults.family = 'Stratix II';defaults.replaceModel = false;defaults.speedGrade = 'fast';

These are used if just called with a design name to create, e.g. createDSPBADesign(‘foo’) will create:



The default can be overridden by creating a corresponding struct of parameters and passing this as a second argument; e.g. createDSPBADesign(‘foo’, myparams)

5.2 Using Vectors

5.2.1 Use of vectors: with ModelPrim blocks

The use of vectors has many advantages; making designs more parameterizable, speeding up simulation and simplifying the schematic. Vectors avoid cut-and-paste duplication in many instances – and enables flexible designs which scale with input vector width.

This section illustrates the use of some vector features for building more parameterizable design components. Most also have an associated design example.

5.2.1.1 Matrix initialisation of vector memories

Demo: demo_dualmem_matrix_init

Both the dual memory and LUT primitive blocks can be initialized with matrix data.

This feature is useful in designs that handle vector data and require individual components of each vector in the dual memory to be initialized uniquely.

The addressable size of the dual memory is determined by the number of rows in the 2D matrix provided for initialisation. The number of columns must match the width of the vector data. So the nth column specifies the contents of the nth dual memory. Within each of these columns the ith row specifies the contents at the (i-1)th address (since first row is address zero, second row address 1 an so on).

The exception for this row / column interpretation of the initialization matrix is for 1D data, where the initialization matrix consists of either a single column or single row. In this case the interpretation is flexible and maps the vector (row or column) into the contents of each dual memory defaults (i.e. the previous behaviour, in which all dual memories have identical initial contents.

The demo_dualmem_matrix_init example shows use of this feature. It also uses complex values in both the initialisation and the data that is later written to the dual memory. The contents matrix is set up in the model’s set-up script, run on model initialization. Click on ‘Edit Params’ to see this.



5.2.1.2 Matrix initialization of LUT demo

Demo: demo_lut_matrix_init

LUTs (Look Up Tables) can be initialized in exactly the same way. The demonstration example feeds a vector of addresses to the primitive block such that each vector component is given a different address. This also shows LUTs working with complex data types.

The figure below shows the equivalent system, with each LUT initialized individually. Using the Matrix avoids having to demux – connect – and mux, so that parameterizable systems can be built.


Figure 4: Using a matrix allows simple initialization of a vector of LUTs each with different contents, without having to specify each separately

=


5.2.1.3 Vector initialization of sample delay demo

Demo: demo_sample_delay_vector

When the sample delay primitive block receives vector input, it is possible to independently specify a different delay for each of the components of the vector.

The demo_sample_delay_vector design example shows that one sample delay can replace what would have previously required a DeMUX-SampDelay-MUX combination.

Individual components may even be given zero delay resulting in a direct feed through of only that component. Care must still be taken to avoid algebraic loops if some components are chosen to be zero delays.

This of course only applies when vector data is being read and output. A scalar specification of delay length still has the prior behaviour of setting all the delays on each vector component


=

Figure 5: Vector initialization of SampleDelay when used with vectors (and equivalent version with individual sample delays)


to the same value. It is an error to specify a vector that is not the same length as the vector on the input port. A negative delay on any one component is also an error. However, as in the scalar case, it is allowable to specify a zero length delay for one or more of the components.

5.2.2 Use of vectors: Additional Libraries > Vector Util Library

Often the ability to build a vector parameterizable library component is stopped by the need for a parameterizable way to go to and from single connections to vectors – either by replication or selection. For example replicating a single signal N times to form a vector. If you had to draw and connect this when the desired vector with changed, the ability to parameterize is lost. Fortunately it is fairly straight-forward to use Simulink commands in the initialization of Masked Subsystems to do the parameterization and reconnection automatically.

There are some examples of this in the Vector Util Library. They all use standard Simulink commands for finding blocks, deleting blocks and lines, adding blocks and lines and positioning blocks. As such users could use this technique to build parameterizable utility functions themselves.

5.2.2.1 Expand Scalar

Expand scalar just takes a single connection and replicates it N times to form a width N vector. This is done by passing on the width parameter to a Simulink mux under the mask, and using some standard Simulink commands to add the connections lines.


Figure 6: Expand Scalar block for parameterizable signal replication to vector

set_param(gcb, 'MaskSelfModifiable', 'on');delete_line(find_system(gcb, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'FindAll', 'on', 'Type', 'line'));if ~isempty(find_system(gcb, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'Name', 'Mux')) delete_block([gcb '/Mux']);end add_block('built-in/Mux', [gcb '/Mux'], 'Inputs', int2str(vecWidth), 'DisplayOption', 'bar'); for n=1:vecWidth add_line(gcb, 'scalar/1', sprintf('Mux/%d', n));endadd_line(gcb, 'Mux/1', 'vector/1');


5.2.2.2 Vector Mux


Figure 7: Vector Mux Masked Subsystem for dynamic vector signal selection, reparameterizable in vector width

block = gcb;set_param(block, 'MaskSelfModifiable', 'on');delete_line(find_system(block, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'FindAll', 'on', 'Type', 'line'));if ~isempty(find_system(block, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'Name', 'Mux')) delete_block([block '/Mux']);endif ~isempty(find_system(block, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'Name', 'Demux')) delete_block([block '/Demux']);end add_block('built-in/Demux', [block '/Demux'], 'Outputs', int2str(vecWidth), 'DisplayOption', 'bar');add_block('DSPBAPrim/Mux', [block '/Mux'], 'value', int2str(vecWidth)); % position is [left top right bottom]set_param([block '/Demux'], 'Position', [100, 100, 105, 100 + vecWidth * 20]);set_param([block '/Mux'], 'Position', [200, 100 - 20, 270, 100 + vecWidth * 20]);demuxPos = get_param([block '/Demux'], 'Position');muxPos = get_param([block '/Mux'], 'Position');midLineDemuxY = (demuxPos(2) + demuxPos(4)) / 2;midLineMuxY = (muxPos(2) + muxPos(4)) / 2;set_param([block '/in'], 'Position', [10, midLineDemuxY - 7, 40, midLineDemuxY + 7]);set_param([block '/out'], 'Position', [400, midLineMuxY - 7, 430, midLineMuxY + 7]);set_param([block '/sel'], 'Position', [10, demuxPos(1) - 20, 40, demuxPos(1) - 5]); add_line(block, 'in/1', 'Demux/1');add_line(block, 'sel/1', 'Mux/1');for n=1:vecWidth add_line(block, sprintf('Demux/%d', n), sprintf('Mux/%d', n + 1));endadd_line(block, 'Mux/1', 'out/1');


5.2.2.3 Tapped Delay Line


block = gcb;% This block modifies itselfset_param(block, 'MaskSelfModifiable', 'on');% delete all current connectionsdelete_line(find_system(block, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'FindAll', 'on', 'Type', 'line'));% If there are any Zero-latency latches, delete themzeroLatch = find_system(block, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'ReferenceBlock', 'DSPBAControl/latch_0L');if ~isempty(zeroLatch) delete_block(zeroLatch);end; % If there are any Single-cycle-latency latches, delete themsingleLatch = find_system(block, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'ReferenceBlock', 'DSPBAControl/latch_1L');if ~isempty(singleLatch) delete_block(singleLatch);end; % Delete the Simulink mux (by name)if ~isempty(find_system(block, 'SearchDepth', 1,... 'FollowLinks', 'on', 'LookUnderMasks', 'all',... 'Name', 'Mux')) delete_block([block '/Mux']);end;% Now have an empty subsystem cleared of any previous parameterization % add in the new blocksif (numTaps < 1) % No taps add_line(block, 'e/1', 'qv/1', 'AUTOROUTING','ON'); add_line(block, 'a/1', 'q/1', 'AUTOROUTING','ON');else % First a zero-latency latch add_block('DSPBAControl/latch_0L', [block '/tap0'], 'Position', [130, 220, 190, 260]); % Then the remaining single-cycle-latency latches for n=1:(numTaps-1) add_block('DSPBAControl/latch_1L', [block '/tap' num2str(n)], 'Position', [130, 220 + 100*n, 190, 260 + 100*n]); end %Then the Simulink Mux add_block('built-in/Mux', [block '/Mux'], 'Inputs', int2str(numTaps), 'DisplayOption', 'bar'); set_param([block '/Mux'], 'Position', [235, 190, 245, 190 + numTaps * 100]); % Position the ports set_param([block '/e'], 'Position', [65, 135, 95, 150]); set_param([block '/a'], 'Position', [180, 135, 210, 150]); set_param([block '/qv'],'Position', [73, 145 + 100*numTaps, 87, 175 + 100*numTaps]); set_param([block '/q'], 'Position', [290, 180 + numTaps * 50, 320, 195 + numTaps * 50]); % Now to connect - enable add_line(block, 'e/1', 'qv/1', 'AUTOROUTING','ON'); add_line(block, 'e/1', 'tap0/1', 'AUTOROUTING','ON'); for n=1:(numTaps-1) add_line(block, 'e/1', sprintf('tap%d/1', n), 'AUTOROUTING','ON'); end % Now to connect - data chain add_line(block, 'a/1', 'tap0/2', 'AUTOROUTING','ON'); for n=1:(numTaps-1) add_line(block, sprintf('tap%d/1', n-1), sprintf('tap%d/2', n), 'AUTOROUTING','ON'); end % Now to connect - mux to output add_line(block, 'Mux/1', 'q/1'); % Now to connect - chain to mux for n=1:numTaps add_line(block, sprintf('tap%d/1', n-1), sprintf('Mux/%d', n)); endend;


The Tapped Delay Line makes use of Latches from the Control library. See the section on Flow Control later. Again this is an auto-generating subsystem. The initialization script is shown above, alongside the subsystem it will generate for 4 taps. Note, gcb is Simulink shorthand for ‘get current block’ – i.e. get the current subsystem we’re parameterizing.

Note that vector signal input is not supported for this block.

5.2.3 Use of vectors: Other supported Vector blocks

5.2.3.1 Simulink Selector (partial support) for static Vector selection

The Simulink Selector block enables selection of some signals out of a vector of signals, including operations such as reordering. Currently only dialog selection is support – equivalent to a static selection. Port selection is not currently supported. The index and input port size can be set via WorkSpace variables.



Supported Features Number of input dimensions – 1

only Index mode : Zero-based or One-

based Index Options

Select all Index Vector (dialog) Starting index (dialog)

Unsupported Features Multi-dimensional input The following Index Options are

unsupported Index Vector (port) Starting index (port)



5.2.3.1.1 Examples

5.2.3.1.1.1Reverse Order


…note input size and Index parameterized by workspace variable selectWidth. For brevity other examples only show the Index Option entry.


5.2.3.1.1.2Select every third wire

5.2.3.1.1.3Select every third wire – reverse order

5.2.3.1.1.4Interleaved replication


…where selectWidth is a workspace variable also used to set the vector width.


5.2.3.1.1.5Just the first

5.2.3.1.1.6First half of vector signals

© 2011 Altera Corporation of 114… same as above, except Output Size is set, rather than inherited


This is a Simulink block, so you can refer to Simulink help for further information. In hardware, this just synthesizes to wiring.



5.3 Implementing Flow ControlThe Advanced block-set tool encourages the use of Valid and Channel signals alongside data to indicate when data is valid and for synchronization. The user is encouraged to build designs using these to process valid data and ignore invalid data cycles in a streaming style that makes best use of the FPGA. This way designs can be built that run as fast as the data allows, that are not sensitive to latency or devices fmax

and which can be responsive to back-pressure.

The style involves use of FIFOs for capture and flow control of valid outputs, Loops and For Loops for simple and complex nested counter structures, and ‘latches’ to enable only components with state – thus minimizing enable line fan-out, which can otherwise be a bottleneck to performance.

5.3.1 Flow control using latches

A latch normally has bad connotations for hardware designers. Here, however, these subsystems just synthesize to enabled flip-flops, so ‘flip-flop’ or ‘sample-and-hold’ might be other ways to describe these.

5.3.1.1 Additional Libraries > Control

Often designs require that signals are stalled or enabled. The approach of having an enable signal routed to all the blocks in the design can lead to high-fan-out nets, which become the critical timing path in the design. A way to avoid this is to enable only blocks with state, while marking output data as invalid when necessary.

To do this a number of utility functions have been created in the Additional Blocks > Control library. These are all just Masked Subsystem. Looking underneath shows the blocks used.

Aside: Note that some of these blocks make use of the Simulink Data Type Prop Duplicate block. This takes the data type of a reference signal ‘Ref’ and back-propagates it to another signal ‘Prop’. This is a good way of matching data-types without forcing an explicit type that can be used in other areas of your design.



5.3.1.1.1 Zero Latency Latch

.

For the zero latency latch the enable signal has an immediate effect on the output. While the enable is high the data passes straight through. When it goes low, the data input from the previous cycle is output and held.

5.3.1.1.2 Single-Cycle Latency Latch

For the single-cycle latency latch the enable signal affect the output on the following clock cycle.

These latches work for any data type, and for vector and complex.

5.3.1.1.3 Reset Priority Latch


Figure 8: Zero Latency Latch

Figure 9: Single-cycle latency latch


There are also 2 single-cycle latency latch subsystems for common operations for the valid signal, latching with set and reset. The SR latch gives priority to the reset input signal, whereas the SRlatch Priority Set gives priority to the set input signal. In both cases if set and reset inputs are both zero the current output state is maintained.

Table 1: Truth table for SRLatch (reset priority)

S R Q0 0 Q1 0 10 1 01 1 0

5.3.1.1.4 Set Priority Latch

Table 2: Truth table for SRlatch Priority Set

S R q0 0 q1 0 1


Figure 10: Set/Reset Boolean latch with reset priority

Figure 11: Set/Reset Boolean latch with set priority


0 1 01 1 1

5.3.1.2 Using latches to implement forward flow control

Example: demo_forward_pressure

Here we have a sequence of three FIR filters that stall when the valid signal is low, preventing invalid data polluting the data-path.

If we look inside one of these subsystems we see a regular filter structure, but with a delay line implemented in single-cycle latches; effectively an enabled delay line.

Note: We don’t need to enable everything in the filter (multipliers, adders etc), just those blocks with state (the registers), then take account of the output valid signal – pipelined alongside the logic by the tool - and look at the valid output data only.



Figure 4: Simple filter with enabled delay chain - full layout

Note here how the first latch is a zero latency latch, and all others are single cycle.



Figure 5: Output of demo_forward_pressure, stalled when valid is low – but without the need for a high fanout enable net.

Of course we could also use vectors to simply the constant mults and adder tree – which would also speed up simulation.



Figure 6: Vectorised FIR stage



This can be improved further by making use of another Masked Subsystem utility block from the Vector Utils library – the TappedDelayLine. See above.

Figure 7: Vectorized FIR stage using TappedDelayLine from Vector Utils Library

5.3.1.3 Flow control using FIFOs

FIFOs can be used to build flexible, ‘self-timed’ designs insensitive to latency, and are an essential component in building parameterizable designs with feedback, such as those that implement back-pressure.

This section describes the basic operation of the FIFO block. A specific example is used to illustrate some of the behavior and the requirements for safe operation.

Note that the DSP Builder Advanced Blockset FIFO is a single clock FIFO in ‘show-ahead’ mode. That is the read input, r, is a read acknowledge which means ‘I have read the output data, q, from the FIFO, so you can get rid of it and show the next data output on q.’ The data presented on q is only valid if the output valid signal, v, is high.

5.3.1.4 FIFOs for flow control and back-pressure

Figure 8: demo_back_pressure example design

This design shows how back pressure from a downstream block can halt upstream processing. There are 3 FIRs that are designed using conventional DSPBA techniques (see Figure 4 above). Each FIR is followed by a FIFO that can buffer any data that is flowing through the FIFO. If the FIFO becomes half-full then the ready signal back to the upstream block is asserted. This prevents any new input (as flagged by valid) entering the FIR block. The FIFOs always show the next data if it is available and the



valid signal is asserted high. This FIFO valid signal must be ANDed with the ready signal to actually consume the data at the head of the FIFO. If the AND result is high then we can consume data because (1) it is available and (2) we are ready for it.

Several blocks can be chained together in this way, and no ready signal has to feed back further than one block. This allows modular design techniques with local control to be used.

The delay in the feedback loop represents the lumped delay that will be spread throughout the FIR block. The delay must be at least as big as the delay through the FIR. This delay is not critical. Experiment with some values to find the right one. The FIFO must be able to hold at least this many data items after full has been asserted. This means that the full threshold must be at least this delay amount below the size of the FIFO (64-32 in this example).

The final block uses an external ready signal that will come from a downstream block in the system.

5.3.1.5 Some notes on safe operation of FIFOs

Figure 9: Simple FIFO model, used to illustrate considerations in safe usage

Where the user has to be careful is in acknowledging reading of invalid output data. This will be illustrated with an example. In the design shown above, the FIFO parameters are depth = 8, fill_threshold = 2, fill_period = 7. The resulting ModelSim behavior of the FIFO hardware:



Note that there is a three-cycle latency between the first write and valid going high. The q output has a similar latency in response to writes. The latency in response to read acknowledgements is only 1 cycle for all output ports. The valid out goes low in response to the first read, even though two items have already been written to the FIFO. This is because the 2nd write is not older than 3 cycles when the read occurs.

Note also that with the fill_threshold set to a low value, that the t output can go high even though the v out is still zero. Also, the q output stays at the last value read when valid goes low in response to a read.

Problems can occur when no feedback is used on the read line, or if the feedback is taken from the t output instead with fill_threshold set to a very low value (< 3). A situation may arise where a read acknowledge is received shortly following a write but before the valid output goes high:

In this situation, the internal state of the FIFO doesn't recover for many cycles. Instead of attempting to reproduce this aberrant behavior, the Simulink implementation issues a warning when a read acknowledge is received while valid output is zero. This intermediate state between the first write to an empty FIFO and the valid going high, highlights another aspect of FIFO behavior to be aware of: that the input to output latency across the FIFO in different in this case. This is the only situation when the FIFO behaves with a latency greater than 1 cycle. With other primitive blocks – which have consistently constant latency across each input to output path the model designer never has to consider these intermediate states. This is not so for the FIFO.

This issue can be sufficiently mitigated by proper care when using the FIFO. The model needs to ensure that the read is never high when valid is low using the simple feedback as shown above. And if the read input is derived from the t output, ensure that a sufficiently high threshold is used. This is made explicit in the following points.

1. Due to differences in latency across different pairs of ports: from w to v is 3 cycles, from r to t is 1 cycle, from w to t is 1 cycle; it is possible to set fill_threshold to a low number (<3) and arrive at a state such that output t is high and output v is low. Should this situation arise, it is very important not to send a read acknowledge to the FIFO.


Figure 10: Hardware simulation output for example


Best practice is to ensure that when the v output is low, the r input is also low. A warning will appear in the Matlab command window if this is ever violated. If the read acknowledge signal is being derived via a feedback from the t output, please ensure that the fill_threshold is set to a sufficiently high number (e.g 3 or above). Likewise for the f output and the full_period.

2. If is allowable to supply vector data to the d input, and vector data on the q output will be the result. Vector signals on the w or r inputs are not really supported, and the behavior is unspecified. The v, t, and f outputs are always scalar. Loops

5.3.2 Flow control using simple LOOP

Often designs require counters, or nested counters to implement, for example, indexing of multidimensional data. The Loop block provides a simple nested counter – equivalent to a simple software loop.

The Loop block maintains a set of counters that implement the equivalent of a nested for loop in software. The counted values range from 0 to limit values provided with an input signal. The dimension of the counter limit values vector determines the number of counters (nested loops).

When the go signal is asserted on the g input, limit-values are read into the block with the c input. When DSP Builder enables the block with the e input, it presents the counter values as a vector value at the q output each cycle. The valid output is set to 1 to indicate that a valid output is present.


Figure 19: Loop block and equivalent C code for two-dimensional count limit vector 'c'

For a two dimensional loop the equivalent C code to describe the general loop is:for (int i = 0; i < c[0]; i++)

for (int j = 0; j < c[1]; j++) {

q[0] = i;

q[1] = j;

f[0] = (i==0);

f[1] = (j==0);

l[0] = (i==(c[0]-1));

l[1] = (j==(c[1]-1));

}

Go

Counter limit values

Enable

Valid

Counter output values

First loop flags

Last loop flags


There are vectors of flags indicating when first values (output f) and last values (output l) occur.

A particular element in these vector outputs is set to 1 when the corresponding loop counter is set at 0 or at count-1 respectively.

The loop block can be used to drive data-paths that operate on regular data either from an input port or data stored in a memory. The enable input, and Product example (demo_kronecker) demonstrates this.

Figure 11: demo_kronecker; using nested loop to generate datapaths that operate on regular data.

5.3.3 Flow control using ForLoop blocks

The ForLoop block extends the basic loop, providing a more flexible structure capable of implementing all common loop structures, including for example triangular loops, parallel loops and sequential loops. Each ForLoop block manages a single counter together with a token-passing scheme that allows these counters to be linked in a variety of simple and not-so-simple ways.

Each ForLoop block has a static loop test parameter, which may be <=, <, > or >=. Loops that count up should use <= or <, depending on whether the limit value, supplied by the limit signal (see below) is considered to be within the range of the loop. Loops that count down should use >= or >.


Calculate Kronecker product for vectors A and B stored in memory


The ForLoop also block has a large number of input and output signals, which are best understood by grouping them by function:

Loop parameterization inputs: the signals i, s and l set the initial value, step and limit value (respectively) of the loop. In conjunction with the loop test parameter, these signals control the operation of the loop. The loop parameter signals must be held constant while the loop is active (see below) but may be changed when the loop is inactive. This allows different activations of a ForLoop block to have different start or end points, which is useful for creating nested triangular loops, for example.

Loop outputs: the signal c is the count output from the loop. Its value is reliable only when the valid signal, v, is active.

Auxiliary loop outputs: the signals fl and ll are active on the first loop iteration and last loop iteration respectively. The signal el is active when the ForLoop block is processing an empty loop.

Enable input: the enable input, e, may be used to suspend and resume operation of the ForLoop block. When the loop is disabled the valid signal, v, will go low but no changes will be made to the internal state of the block. When the block is re-enabled, it will resume counting from the state at which it was suspended.

Token-passing inputs and outputs: the four signals ls (loop start), bs (body start), bd (body done) and ld (loop done) are used to pass a control token between different ForLoop blocks in order to create a variety of different control structures.

When a token is received on the ls port, the ForLoop block initializes; the loop counter is set to its initial value (specified by the i signal). When a token is received on the bd port, the loop counter is incremented by the step value (s). In either case, the new value of the counter is compared with the limit value (l) using the statically-configured loop test.

If the loop test passes, the ForLoop block outputs the control token on the bs port to initiate execution of the loop body and the valid signal, v, becomes active. If the loop test fails, the ForLoop block outputs the control token on ld port to indicate that execution of the loop is complete and v becomes inactive.

The ForLoop block becomes active when it receives a token on its ls port, and remains active until it finally outputs a token on its ld port. Changing any of the loop parameterization inputs (i, s or l) while the loop is active is not supported and will produce unpredictable results.



The latency of the ForLoop block is non-zero, which means that there is some overhead required to build nested loop structures. The second activation of an inner loop won't necessarily begin immediately after the end of the first activation. The user should therefore make sure the valid output of the loop block is used. Use of the ForLoop block is best illustrated through simple examples. Note that further, more complex examples of using FOR loops are given in reference to floating point.

5.3.3.1 Rectangular Nested Loop Example


Figure 21: Rectangular nested loop

for (uint8 countA = 0; countA <= 7; countA++) {

for (uint8 countB = 0; countB <= 15; countB++) {

qc1 = countA;qc2 = countB;

}}

CountB

CountA


In this example all initialization, step and limit values are constant. At the corners – that is at the end of loops - there may be cycles where the count value goes out of range. Where this occurs, the output valid signal from the loop is low. This can be seen in the scope of the output.

Note the token-passing structure used in this example, which is typical for a nested loop structure. The bs port of the innermost loop (ForLoopB) is connected to the bd port of the same loop, so that the next loop iteration of this loop starts immediately after the previous iteration.

The bs port of the outer loop (ForLoopA) is connected to the ls port of the inner loop and the ld port of the inner loop loops back to the bd port of the outer loop. This ensures that each iteration of the outer loop runs a full activation of the inner loop before continuing on to the next iteration.

Finally, the ls port of the outer loop is connected to external logic and the ld port of the outer loop is left unconnected. This is typical of applications where the control token is generated afresh for each activation of the outermost loop.



5.3.3.2 Triangular Nested Loop Example

The initialization, step and limit values do not have to be constants. By using the count value from an outer loop as the limit of an inner loop, the counter effectively walk through a triangular set of indices.


Figure 22: Triangular nested loop

for (uint8 countA = 0; countA < 16; countA++) {for (uint8 countB = 0; countB <= countA; countB++) {

qc1 = countA;qc2 = countB;

}}

CountBCountA


Note that the token-passing structure for this loop is identical to that for the rectangular loop shown in the previous section; the only thing that needed to be changed was the parameterization of the loops.


for (uint8 cOuter = 0; cOuter < 10; cOuter++) {for (uint8 cInnerA = 0; cInnerA < cOuter; cInnerA ++) {

qc1 = cInnerA;}for (uint8 cInnerB = cOuter; cInnerB < 10; cInnerB ++) {

qc2 = cInnerB;}

}

Note that the valid signal is low when the qc2 value is temporarily out of the < 10 range


5.3.3.3 Sequential loops


cInnerAcInnerBc

Outer


In this example, we have two inner loops (InnerLoopA and InnerLoopB) both nested within the outer loop. To achieve this we daisy-chain the ld port of InnerLoopA to the ls port of InnerLoopB rather than (as previously) connecting it directly to the bd port of OuterLoop. This ensures that each activation of InnerLoopA is followed by an activation of InnerLoopB.

5.3.4 LOOP vs ForLoop

With two components (LOOP and ForLoop) available for building nested loops, which should the designer choose?

The advantages of the LOOP block are:

A single LOOP block can implement an entire stack of nested loops.

There are no wasted cycles when the loop is active but the count isn't valid.

The implementation cost is lower because there's no overhead for the token-passing scheme.

It accepts vector inputs

The advantages of the ForLoop block are:

Loops may count either up or down

It's possible to specify the initial value and the step, not just the limit value.

The token-passing scheme allows the construction of control structures that are more sophisticated than just nesting rectangular loops.

The conclusion is that when a stack of nested loops is the appropriate control structure (matrix multiplication would be an example of this) the best choice is a single LOOP block. When a more complex control structure is required multiple ForLoop blocks should be used instead.

5.4 Building System Components

In many systems the desire is to use the DSPBA design as a sub-component in a larger system design. The DSPBA component will be one part in a chain of sub-components all with streaming interfaces with which it must interact. The system will be composed in SOPC Builder, where the Avalon interfaces can be simply connected. The interfaces themselves are described by ‘Hardware Tcl’ (_hw.tcl) files.



Building system components that can be chained together in SOPC Builder is greatly simplified by defining Avalon interfaces1. DSP Builder Advanced builds a memory mapped interface for the memory bus, and now, Avalon-ST Input and Output blocks allow the generation of the appropriate hw.tcl files to define Avalon-ST for the data plane.

This section refers to upstream and downstream components which are parts of the system outside of the DSPBA design.

Figure 12: Avalon Stream Library appears under Additional Libraries as the blocks in the Library are in fact just Masked Primitive subsystems

1 For specification of Avalon Streaming, see http://www.altera.com/literature/manual/mnl_avalon_spec.pdf


http://www.altera.com/literature/manual/mnl_avalon_spec.pdf


Figure 13: The Avalon Stream Interface Library

A design may have multiple Avalon-ST input and output blocks. However in general it will look something like this;

Figure 14: DSP design with Avalon-ST interfaces



where all paths across the ‘DSP Part’ must be registered in order to avoid algebraic loops.

The output of the DSPBA design is a source of Avalon Streaming data for downstream components. It supplies data (and corresponding valid, channel and start and end of packet information) and accepts as input from the downstream component(s) a Boolean flag as to whether the downstream block is ready to accept data.

The input of the DSPBA design is a sink of Avalon Streaming data for upstream components. It accepts data (and corresponding valid, channel and start and end of packet information) and provides as output to the upstream component(s) a Boolean flag as to whether the DSPBA component is ready to accept data.

When the hw.tcl file is generated, the name of the Avalon-ST masked subsystem block is used as the name of the interface.

The blocks themselves are just Masked Subsystems. You can look under the mask to see the implementation. If necessary you can break the library link and extend the definition further, by adding further ports that will be declared in the hw.tcl file, or by adding text that will be written unevaluated directly into the interface declaration in the hw.tcl file.

Note that these blocks do not enforce Avalon-ST behavior. What they do is encapsulate the common Avalon-ST signals into an interface, add FIFOs on the output (and if required on the input) to facilitate building designs supporting back-pressure, and declare the collected signals as an Avalon-ST interface in the hw.tcl file generated for the device level.



5.4.1 Avalon-ST Output

5.4.1.1 Output FIFO

The signals that interface to the external system are:

source_ready input indication from downstream components that they can accept source_data on this rising clock edge

source_valid output indicates that source_data, source_channel, source_sop and source_eop are valid

source_channel output channel number

source_sop output indicates start of packet

source_eop output indicates end of packet

source_data output the data to be output (which may be, or include control data)



The signals that interface internally with the DSPBA design component are:

output_ready output indication from the output of the DSPBA component that it can accept sink_data on this rising clock edge

output_valid input indicates that output_data, output_channel, output_sop and output_eop are valid

output_channel

input channel number

output_sop input indicates start of packet

output_eop input indicates end of packet

output_data input the output data (which may be, or include control data)

The downstream system component may not be able to accept data and so may backpressure this block by forcing Avalon ST signal source_ready = 0.

However the DSPBA design may still have lots of valid outputs in the pipeline. These must be stored in memory. Along with the generation of the Avalon-ST source interface in the hw.tcl file, this is the purpose of this block.

The output data for the design is written into Data FIFO, along with Avalon ST signals channel, sop and eop being written into respective channel, SOP and EOP FIFOs.

The back pressure signal (source_ready) from downstream component should be connected to port ready in this sub-system: so the FIFOs are only read when downstream block can accept data (read_fifo = 1) and there is data in FIFO to output (fifo_empty_n = 1).

If the downstream component was continually back-pressuring this DSPBA design, then these FIFOs will start to fill up. If data continues to be fed into the DSPBA component, then eventually the FIFOs will overflow. This must not be allowed to happen, therefore when the FIFOs reaches a certain fill level, they assert signal nearly_full = 1. This signal should be used by apply backpressure to upstream component (i.e. by forcing Avalon ST signal sink_ready = 0). So upstream component will stop sending in more data and so the FIFO should not overflow the fill level at which nearly_full = 1 should be set to a value that depends on the latency of this DSPBA design. For example, if the design contains a single primitive subsystem and the ChannelOut component indicates a latency of L, then the nearly_full flag should be asserted at the latest when there are L free entries in the FIFO. Currently setting this threshold is a manual process: full_threshold >= Depth of FIFO – L.



5.4.2 Avalon-ST Input

The signals that interface to the external system are:

sink_ready output indicates to upstream components that the DSPBA component can accept sink_data on this rising clock edge

sink_valid input indicates that sink_data, sink_channel, sink_sop and sink_eop are valid

sink_channel input channel number

sink_sop input indicates start of packet

sink_eop input indicates end of packet

sink_data input the data (which may be, or include control data)



The signals that interface internally with the DSPBA design component are:

input_ready input indication from the output of the DSPBA component that it can accept sink_data on this rising clock edge

input_valid output indicates that input_data, input_channel, input_sop and input_eop are valid

input_channel output channel number

input_sop output indicates start of packet

input_eop output indicates end of packet

input_data output the data (which may be, or include control data)

5.4.3 Avalon-ST Input FIFO

Another version of the input interface includes FIFOs.

5.4.4 Extending the interface definition

These Avalon-ST interfaces are provided as Masked Subsystems. As such the user can look at the internals and make edits if required. Too look under the mask right click on the block and select ‘Look Under Mask’. Under the mask the user will see a primitive subsystem. The user is free to edit this; but will first have to break the library link if they wish to do so. (Right click on the block and select ‘Link Options’ > ‘Disable Link’, then right clock again and select ‘Link Options’ > ‘Break Link’). If editing the interface blocks, the user should not edit the ‘Mask Type’ field. This is used internally to identify the subsystems defining the interfaces.

5.4.4.1 Adding further ports to the Avalon-ST blocksFurther ports can be added in the Avalon-ST Marked Subsystems. Internally these would have to be connected up by the user; most likely in the same fashion as the existing signals – for example through FIFOs. If adding further inputs or output ports which are to be connected to the device level ports, then these should be ‘tagged’ with the role the port will take in the Avalon-ST interface. Simulink ports are tagged via the



Block Properties, General tab for the individual port. (This is used internally to get the port role – valid, ready, endofpacket, startofpacket etc. that the particular port will take in the interface). Other ports you may want to add are ‘error’ and ‘empty’, for example.

5.4.4.2 Adding custom textAny text written to the Description field of the Masked Subsystem (Block Properties, General tab on the subsystem block itself) will be written verbatim - with no evaluation - into the hw.tcl file immediately after the standard parameters for the interface and before the port declarations. It is the user’s responsibility to get the text of any addition correct.


Figure 26: Tag the Simulink Port internal to the Avalon-ST Masked Subsystem to define the port role


Figure 15: Example Block Properties GUI for Avalon-ST Masked Subsystem showing Description field



5.4.5 Restrictions on use

5.4.5.1 Intervening blocks

Although the Avalon-ST interface blocks can be put in different level of hierarchy, no blocks – Simulink, ModelIP or primitive – should be placed between the interface and the device level ports.

5.4.5.2 Interfaces with multiple data portsThe Avalon-ST specification only allows a single data port per interface. This means that adding further data ports, or even using a vector through the interface and device-level port (which creates multiple data ports) is not allowed.

To handle multiple data ports through a single Avalon-ST interface they must be packed together into a single (not vector or bus) signal, then unpacked on the other side of the interface. The maximum width for a data signal is 256 bits. The packing and unpacking can be done with BitCombine and BitExtract blocks, see below for example.


# +-----------------------------------# | connection point AStInputFIFO# | add_interface AStInputFIFO avalon_streaming endset_interface_property AStInputFIFO errorDescriptor ""set_interface_property AStInputFIFO maxChannel 0set_interface_property AStInputFIFO readyLatency 0set_interface_property AStInputFIFO ASSOCIATED_CLOCK clockset_interface_property AStInputFIFO ENABLED true

add_interface_port AStInputFIFO sink_c channel Input 8add_interface_port AStInputFIFO sink_d0 data Input 8add_interface_port AStInputFIFO sink_d1 data Input 8add_interface_port AStInputFIFO sink_eop eop Input 1add_interface_port AStInputFIFO sink_ready ready Output 1add_interface_port AStInputFIFO sink_sop sop Input 1add_interface_port AStInputFIFO sink_v valid Input 1

Name from block

Standardparameters

Port names from device-level ports which connect to Avalon-ST blocks

Port roles from Simulink port tags internal to Avalon-ST blocks

Verbatim text from the Avalon-ST block’s Description field will be written here


5.5 Using ModelPrim subsystems

5.5.1 Interfaces as subsystem boundaries

Primitive subsystems allow users flexibility to build their own custom designs, while taking advantage of the optimizations applied by the tool. Optimizations operate hierarchically – that is each primitive subsystem or ModelIP block is optimized individually.

The boundary of a Primitive Subsystem is delineated by the primitive I/O blocks – ChannelIn, ChannelOut, Gpin or GPOut. A primitive subsystem should always include I/O blocks from this set.

A SynthesisInfo block, with style set to Scheduled should also be used at the same hierarchical level as these I/O blocks.

Further subsystems can be used within the Primitive subsystem. These are flattened and treated holistically with the primitive subsystem in the optimizations.

ModelIP blocks cannot be used inside primitive subsystems.


Pack (outside device level)

Unpack (between interface and

core design)

Figure 28: Packing and unpacking a vector into a single data connection


5.5.2 Interfaces as scheduling boundaries

The type of I/O boundary block used determines how the set of signals through it are scheduled during register pipeline insertion. I/O signals through the same I/O block (ChannelIn, ChannelOut, GPIn or GPOut) are scheduled together – i.e. will remain in sync. Using ChannelIn and ChannelOut allows specification of Advanced Blockset protocol valid and channel signals alongside the data. GPIn and GPOut is for other – general purpose – data which doesn’t necessarily have to be scheduled to start or end on the same clock cycle as other I/O signals.

So if you want all your signals to be pipelined such that inputs are all on the same clock cycle and outputs are all synchronized together on the same output clock cycle, use a single ChannelIn and a single ChannelOut block. This is the usual mode of operation. If your subsystem requires inputs appearing on different clock cycles, or outputs grouped on to different clock cycles, you can use multiple ChannelIns and ChannelOuts, or GPIns and GPOuts.

Note that to maintain cycle accuracy at a level outside the primitive subsystem, the pipelining inserted by the tool must be accounted for in simulation. This added latency is calculated by the scheduler and depends on factors such as vector widths, data types, and fmax requirements. So Simulink can only model this after the scheduler has run. Since each pipelining stage is added in slices, or cuts, through all parallel signals, this can be modeled by just adding a latency on the inputs or outputs.

Note that in some cases this accounting for the latency may give different results to hardware – in particular a) in the case of optimizing away parallel registering that just delays all signals by the same amount and b) in the case of multiple inputs and outputs where an input is scheduled after an output.

In these cases a warning is given:

“Warning: Negative IO correction on block <design>/<subsystem>/ChannelOut. Simulink will not match hardware.”

5.5.2.1 Optimizing away parallel registering

Suppose a primitive subsystem design had a line of registers on each input to output path (see the example below). These registers do not alter the function of what the primitive subsystem does – only alter when the output comes out – i.e. the latency.

Advanced Blockset seeks to optimize the hardware it produces, while minimizing the latency required to achieve the same functionality and still achieving the desired fmax. SampleDelays that specify relative differences in latency on paths are therefore important, but cuts of SampleDelays that specify identical latency on all paths are effectively ignored.



Figure 16: Simple primitive subsystem with line of identical registers across all input to output paths

As such, the registers in the above design which delay all paths by 1 cycle would be optimized away – as they can be removed without changing the functionality of the subsystem (only its latency).

In this case the generated hardware contains no registers – and has zero latency – but the Simulink model of the subsystem as a whole will have a latency of 1 from the Sample Delays. The ChannelOut block cannot correct to simulate the actual latency of the hardware produced by adding negative latency and displays the message:


If you want the latency to be higher than the optimal (lowest latency) solution the way to do this is to use the constraints on the SynthesisInfo block:



Here the latency for this subsystem is constrained to be greater or equal to 1. Now on generation the vertical line of pipelining is preserved in hardware such that simulation and hardware both have latency 1 and there is no error message. (Note; the SampleDelays aren’t needed at all in this case - it is sufficient to use the constraint alone to add latency).

Here is another example.

Figure 17: Another simple primitive subsystem with line of identical registers across all input to output paths



Remember that adding vertical cuts of delays across all paths is not the recommended way of adding latency – if that is really what is desired. The way to do this would be to constrain the latency to the higher value using the SynthesisInfo block.

DSP Builder Advanced solves for the minimal pipelining delay required to attain the target fmax while maintaining the relative delay differences implied by the user-inserted SampleDelay blocks.

Adding vertical cuts of extra SampleDelays across all signals does not change the relative latency of the signals so does not alter the optimization problem, or the hardware that would be produced.

In the design shown there are no relative differences in the latencies between the paths (each has 10), so the optimization is free to remove them, and the hardware produced for this design will be the same as if no SampleDelays were specified. In this example, to be sure of attaining the desired fmax the multiplier needs a latency of 4 clock cycles (that is 4 pipelining registers), and this will be balanced by delays inserted on the other paths (the valid and the channel signal) to maintain no relative latency difference.

The final solution will therefore be that the HDL will have a multiplier with 4 pipelining stages, to meet timing on the data path, and 4 registers on each of the valid and channel paths, such that the outputs maintain their relative synchronization.

So to simulate this cycle accurately in Simulink the tool would have to delay the signal by a total of 4 clock cycles across the subsystem. However, the Simulink schematic design already has SampleDelays that are delaying the signal by 10 clock cycles – and the tool cannot correct for negative delays. It can’t reset these Sample Delays in simulation, or jump the simulation forward 6 cycles (applying a negative correction) in the ChannelOut to compensate.

What you will get now is zero latency correction applied in the ChannelOut block, and the warning given;


The above designs break two ‘good design rules’ that should be followed when designing with Advanced Blockset

a) Don’t try to pipeline the design yourself – this is what the tool does for you using mathematical linear programming techniques.

b) Don’t try to synchronize the output of different parallel subsystems using explicit delays. Use FIFOs, as this will give a more device-portable, fmax target independent design.



5.5.2.2 Multiple Scheduled Outputs

Figure 18: Simple primitive subsystem with independently synchronized outputs

Consider the above model. Here there is a single set of inputs, scheduled to be input on the same clock cycle. Any pipelining inserted ensures these signals stay in sync. For the particular fmax chosen in this example the multipliers require a latency of 3 (that is 3 pipelining stages). Parallel signals are pipelined by the same amount to keep the signals in sync. Thus on ‘ChannelOut’ there are three cycles of latency added to all signals through this I/O block and scheduled to be output together. For the ‘ChanelOut1’ signals however, the latency of the path is 6 clock cycles (3 for each multiplier), so the output here has a latency of 6 – appearing three cycles after the outputs from ‘ChannelOut’.

The latencies are modelled by delaying the signals through ‘ChannelOut’ by 3 cycles, and through ‘ChannelOut1’ by 6 clock cycles.

5.5.2.3 Multiple Scheduled Inputs & Outputs

Things are a little more complex in the case of multiple inputs and multiple outputs. Models such as demo_back_pressure have more than one input block and more than one output block in the single primitive subsystem.

Models with multiple input blocks imply that in hardware the data entering the system will not be synchronised between the different points of entry.



Figure 19: Simple primitive design with multiple input groups and multiple output groups

Simulink is unaware of the 1 cycle latency needed by the adders. This has to be worked out in the scheduler and corrected for afterwards during simulation using simulation delays in the input/output blocks A, B, C, X and Y.

This model implies the following constraints:

A + X = 1A + Y = 0B + X = 2B + Y = 1C + X = 1C + Y = 1

This system of equations has no solution! However, this situation never actually occurs because of the scheduling constraints that have been enforced over the primitive subsystem. The scheduler actually inserts an extra sample delay between Add2 and ChannelOut X which makes the whole system solvable.


Cycle 0

A X

CY

B

Cycle 1 Cycle 2

Figure 33: Schedule, after pipelining for above primitive subsystem


Note that there are some configurations of input and output blocks that, if accounting for latency at the inputs and outputs, would require latency correction of negative size, implying discarding the first few samples. Consider:

Figure 20: example of a multiple I/O primitive subsystem where an input is scheduled after an output

Assume that multipliers have a delay need 3 times that of adders. Any solution for latency correction requires a negative sized buffer on at least one of the input/output blocks. A bigger example of this is demo_back_pressure.

Discarding the first few samples is not possible in the discrete simulation that Simulink uses, however this can still be done for the stimulus files. When this occurs a warning message is given to make the user aware that their Simulink model will not behave exactly as the hardware does.

5.5.3 ModelPrim subsystem design styles to avoid

5.5.3.1 Primitive Subsystems with logic not driven from clocked inputs

This section describes some design styles to avoid. Usually this is because either hardware behavior will be determined by reset behavior or the hardware will be inefficient.



Figure 21: Blocks not driven from clocked blocks

In this model, the will start straight out of reset. Perhaps you don’t care the specific phase through the repeating count with respect to your data, but even so because of the out-of-reset start the design simulation in Simulink may not match that of the generated hardware. A better option here would be to start the counter off the valid signal, rather than the constant. If the counter were to repeat without stopping after the first valid, then this could be achieved by adding a zero-latency latch from the Additional Libraries > Control into this connection.

Similarly, loops driven without clocked inputs should be avoided for the same reason.

5.5.4 Common Problems

5.5.4.1 Timed Feedback Loops

Care also has to be taken with feedback loops generally, in particular in providing sufficient delay around the loop.

In this model, there is a cycle containing two adders with only a single sample delay which is not sufficient. In automatically pipelining designs, a schedule of signals through the design is created. From internal timing models, the tool knows how fast



certain components, such as wide adders can be run – or rather how many pipelining stages they require in order to run at a chosen clock frequency. The tool must account for the required pipelining while not changing the order of the schedule. The single sample delay is not enough to pipeline the path through the two adders at the chosen clock frequency. The tool is also not free to insert more pipelining here, as this would change the algorithm, accumulating every n cycles, rather than every cycle. The scheduler detects this and gives an appropriate error indicating how much more latency would be required in the loop for it to run at the chosen clock rate. In multiple loops this error may be hit a few times in a row as each loop is balanced and resolved.

5.5.4.2 Loops, clock cycles and data cyclesIt is important not to confuse clock cycles and data cycles - particularly in relation to feedback loops where, for example, you may want to accumulate with previous data from the same channel. The Multi-Channel IIR Filter design (demo_iir) shows an example of feedback accumulators processing multiple channels. Note that in this example each consecutive data sample on any particular channel is 20 clock cycles apart. This number is derived from clock rate / sample rate.

Supposing we only had one channel, at a low data rate. This is explored in the Folded IIR Filter Demonstration (demo_iir_fold2)

This model implements a single-channel infinite impulse response (IIR) filter with a subsystem built from primitive blocks folded down to a serial implementation.

The design of the IIR is identical to that in the multi-channel example, demo_iir. Note that as the channel count is 1, the lumped delays in the feedback loops are all one. This would present a scheduling problem if running at full speed, i.e. with new data arriving every clock cycle, as the lumped delay of one cycle would not be enough to allow for pipelining round the loops. However, here the data is arriving at a much slower rate than the clock rate, in this example 32 times slower (the clock rate in the design is 320 MHz, and the sample rate is 10MHz) - giving us 32 clock cycles between each sample.

One way to design for this would be to set the lumped delays to 32 cycles long - the gap between successive data samples; but this would obviously be very inefficient both in terms of register use and in underutilized multipliers and adders. Instead, we use folding to schedule the data through a minimum set of fully utilized hardware.

Set the SampleRate on both the ChannelIn and ChannelOut blocks to 10MHz. This informs the synthesis for the Primitive Subsystem of the schedule of data through the design – that even thought the clock rate is 320MHz, each data sample per channel is arriving only at 10MHz. The produced RTL is folded down – in terms of multiplier use – at the expense of extra logic for signal muxing and extra latency.



5.5.5 ModelPrim Blocks outside primitive subsystems

ModelPrim blocks can also be used outside of primitive subsystems (i.e. outside subsystems delineated with ModelPrim I/O blocks, but these will not be scheduled or pipelined for fmax. A common use is for constants, and inside the synthesizable part of the design, a ModelPrim constant blocks should always be used in preference to a Simulink constant block.

As with inside primitive subsystems, logic dependent on initial behavior out of reset for synchronizing should be avoided. For example,

The logic here driving the sample delay was intended to produce a single pulse after reset. A better solution is to use a 1-cycle latch on the valid signal


Figure 36: Synchronizing logic dependent on reset (bad)

Figure 37: Synchronizing logic dependent on valid (good)


5.5.6 Convert blocks vs. specifying output types via dialog

There are two ways to change data type with Advanced Blockset primitives:

1. preserving real world value (Convert block)

2. preserving bit pattern (set ‘Output data type mode’ on any other primitive)

The Convert block converts a data type preserving the real word value – optionally rounding and saturating where this is not possible. Convert blocks can therefore sign extend or discard bits as necessary. For example, the following Convert block will discard 11 LSBs and sign extend the MSBs by 27 bits while preserving the real world value, as far as possible.

Here for example you can see that some truncation has occurred.


Figure 38: Convert block changes data-type preserving real-world value (as far as possible), with options to round and saturate. It can grow the number of bits - sign extending or zero-padding where appropriate


Similarly you can convert the same number of bits while preserving the real world value (as far as possible, subject to rounding and saturation) (see below).


Figure 39: Convert block changes data-type preserving real-world value (as far as possible), with options to round and saturate


We can contrast that with setting the type using the “Specify via dialog” option on any other primitive. We can do this without any generated hardware by using a zero-length Sample Delay for example.


Figure 40: Setting an output type explicitly via a Primitive dialog for any other blocks changes type while preserving the bit pattern. The real world value will generally be scaled in such cases


WARNING: If you want to reinterpret the bit pattern and also discard bits, note that if the type specified via dialog in the Output data type mode is smaller than the natural (inherited) output type, MSBs (most significant bits) will be discarded. In the example below the output type is set to ufix14_En1 and the top two MSBs are discarded, giving a very different result.

Users should NOT set the type via dialog to be bigger than the natural (inherited) bit pattern: no zero-padding or sign extension will be done, and the result may generate hardware errors due to signal width mismatches. Any sign extension or zero padding should always be done with a Convert primitive block.


Figure 41: Setting an output type via dialog and reducing the bit-width will discard the top, 'Most Significant' bits


Often you may want to do both – sign extend or zero pad, then reinterpret the bit pattern (or vice-versa), in which case you can combine these methods.

In some instance all that may be desired is to set a specific format so that types can be resolved; in feedback loops for example. This is where setting a type via a dialog on an existing primitive or inserting a zero-cycle Sample Delay with type specified is useful (where we choose a zero-cycle delay as this generates no hardware and just casts the type interpretation).

Not that in some cases, you may just want to ensure the data-type is equal to some other signal data-type. In such cases you can force the data-type propagation using a Simulink data-type propagation block. An example of this is in the Latch masked subsystems from the Control library covered above.

6 Debugging Designs(to be added)



7 Floating PointThe Advanced Blockset ModelPrim blocks now support floating point ‘single’ and ‘double’ data types. The tool generates a parallel data-path optimized for Altera FPGAs from the Simulink model.

In many cases, a 50% reduction in logic resources and a 50% reduction in latency are possible, over using discrete IEEE754 operators. The Advanced Blockset achieves these improvements by optimizing over the entire data-path: considering the sequence of operations. By using the hard logic resources (DSP Blocks) effectively, and by grouping functions in the data-path, many steps in IEEE754 implementations effectively become redundant.

The input and output values to and from the data-path will be IEEE754 compliant for floating point numbers, but a different format is used internally. There will likely be some small differences between the output generated by the data-path, and the Simulink simulation of the input file. As the Advanced Blockset generated data-path generally uses greater mantissa and exponent precision than IEEE754, many of these errors will be because floating point operations are non-associative.

7.1 Support Outline

7.1.1 Blocks

The Advanced Blockset now supports the single and double precision floating point data types. This section details the initial limitation of the support. First note that only primitives support floating point, not yet ModelIP blocks such as FIR, NCO and CIC.

7.1.1.1 Support in existing ModelPrim blocks

Most of the existing primitive blocks now support floating point, as well as the I/O blocks, these are shown below



Single and double can be found as selections on the data type field

Figure 22: Output data type selection UI - showing single and double as options

Note that in most cases, the output data type is used to fix the type, rather than convert.

7.1.1.2 Conversion between floating and fixed

For conversion to and from floating point ReinterpretCast, Convert or BitExtract can be used. For example floating point format numbers can be converted to a flat 32bit



representation using the bit extract for transmission through to a higher level DSP Builder Standard design. Reinterpret Cast generates no hardware – just changes how the bit pattern is interpreted and propagated by the tool

7.1.1.3 New floating point ModelPrim blocks

In addition new floating point blocks have been added to the primitive library to support common math functions. Most of these have multiplier-based implementations and have a size typically about 3 to 4 times that of a corresponding floating point multiply.

Further details of each block can be found in the help for the specific block.

Figure 23: New floating point primitive blocks in 10.1 Advanced Blockset

Note that in the first release the trigonometric functions support single precision only, and that none of these blocks support fixed point. If desired they can be used in otherwise fixed point designs by converting from and to floating point either side of the block.

7.1.2 Interaction with other features

7.1.2.1 Folding

Currently the folding feature is not enabled for use with floating point blocks.



7.1.2.2 Pipelining flexibility within floating point operations

Currently some of the floating point functions, such as the trigonometric functions, are of fixed latency. As such the depth of pipelining within these does not vary with target fmax. These functions are targeted at high clock rates. Flexible pipelining control within floating point operations will be supported in a future release.

7.1.2.3 Accuracy, Testing & Automatic Test-benches

The Advanced block-set uses IEEE floating point format at the inputs and outputs. Simulation is handled on the primitive blocks themselves using Matlab single and double precision arithmetic. Internally – within the hardware generated for the data path – more bits of precision are used. It is possible therefore that the hardware result, as seen when running a hardware simulation, may occasionally differ in least significant bit to the Simulink simulation.

In the automatic test-benches therefore, we compare the numeric results to within a tolerance.

7.1.2.3.1 Understanding arithmetic accuracy

Note that while only a difference in the least significant bit would normally be expected, because floating point arithmetic is non-associate it is possible to get larger differences.

With floating point arithmetic, algorithms that are iterative and have large dynamic range are now implementable. In such algorithms is possible that the designs themselves may be ill-conditioned, that is sensitive to very small errors or differences.

For example consider the problem of QR decomposition and back/forward substitution using an ill-conditioned matrix. Matlab functions are available for checking for such cases. For example, cond() gives the condition number of a matrix, which measures the sensitivity of the solution of a system of linear equations to errors in the data. This gives an indication of the accuracy of the results from matrix inversion and the linear equation solution, with condition values near 1 indicating a well-conditioned matrix.

For single precision, the HDL internal floating-point representation (which uses a 32-bit mantissa of which 26 bits are in use most of the time) is compared to Simulink single precision (24-bit mantissa, counting the sign bit).

At each individual step, it can be confirmed that the floating point additions and subtractions are being performed correctly, and that the differences are no larger than what one would expect.

Relatively large differences can still occur when subtracting numbers that are very close in value (i.e. such that after alignment of mantissas to equalize the exponent, the subtraction would zero out the first 6 to 16 most significant bits). Here then we may



introduce a deviation in the output of the Simulink model and the generated HDL, largely due to numerical round-off.

The measuring of results against that produced using IEEE single precision computation has to be understood in terms of this accuracy, and not in terms of absolute error. Given the generally higher precision of the internal floating point format used by the generated HDL, it could be that the Simulink single precision answer is more "wrong" in this case – but the reason for potential differences should be understood.

Such numeric differences can be exacerbated by an ill-posed problem – for example by the ill-conditioning of the matrix used in forward/backward substitution. Here such differences can be iterated and multiplied. For this case, typically the way to address this ill-conditioning is to improve it via pivoting at the QR decomposition stage, which involves reordering matrix columns.

Users designing floating point algorithms should therefore understand concepts such as ill-conditioning and use Matlab features such as cond() to check their design and in the analysis and understanding of results.

7.1.2.4 Device Support

While all devices are supported, the hardware generated is currently most optimized for Stratix III, IV and V. Future releases will also optimize for other device families.



7.2 Floating Point Format

The internal word formats are important to understanding the generated hardware, should you need to debug it. The word formats are different during addition and subtraction, multiplication, division, and functions. Cast blocks are automatically inserted by the tool to convert to one format to another

In the case of single precision, the internal mantissa is 32 bits wide with 1 sign bit, and the exponent is 8 bits.

7.2.1 Single Precision Word Formats

Internally a number of extended floating point formats are used across different floating point operations.

7.2.1.1 IEEE 754

minimum positive (subnormal) value

minimum positive normal value

maximum representable value

2−149 ≈ 1.4 × 10−45 2−126 ≈ 1.18 × 10−38 (2−2−23) × 2127 ≈ 3.4 × 1038

In IEEE754 format, the sign bit is in the most significant bit, followed by an 8 bit exponent, followed by the 23 bit fractional part of the mantissa.

7.2.1.2 Internal Single Precision Floating Point Number

In addition to IEEE754 used at the subsystem boundaries and memories, there are two internal single precision formats; a signed one for addition and subtraction, and another unsigned for multiplication and division. Both formats have a 32 bit mantissa followed by the 10 bit exponent.

Signed Single Precision Format - Addition and subtraction

Unsigned Single Precision Format - Multiplication and division



Also there are 3 flag bits for Saturation (Inf), Zero, and ‘Not a Number’ (NaN).

7.2.1.2.1 Addition and Subtraction Format

For addition and subtraction operations, the format upon conversion from IEEE754 single precision is:




2−536 ≈ 4.4 × 10−162 2−510 ≈ 2.98 × 10−154 (32−2−26) × 2511 ≈ 2.1 × 10155

The format is just fixed point, plus an exponent. Conversion from IEEE is then easy – just pad with sign and zeros. Conversion from this format back to IEEE is harder, requiring detection of sign, use of absolute values, counting leading zeros, shifting etc. This is all done internally by the tool in the generated hardware.

Adding numbers together is also simple, with word growth into the overflow bits.

These four overflow bits (one is sign) allow for 16 un-normalized additions to feed into a single node without overflow. Underflow may happen more quickly, due to bit cancellation, but the effects of underflow are reduced by normalizing more often where necessary, again handled in the generated hardware.


32 bits

sfix32_En26


7.2.1.2.2 Multiplication and division format

The multiplier has a slightly different input number format. A fully normalized multiplier input format for the 32 bit mantissa is a signed number:




2−540 ≈ 2.8 × 10−163 2−510 ≈ 2.98 × 10−154 (2−2−30) × 2511 ≈ 1.34 × 10154

The multiplier input is always normalized to prevent overflow. If there is significant underflow in the part of the data-path feeding the multiplier, the number could be very small. If the other number is very small as well, the multiplier could produce a zero output, as the new mantissa will be expected in the top half of the multiplier output.

In the internal format, the sign bit is part of the mantissa. The mantissa is a 32 or 36 bit signed number, with the entire mantissa (including the implied ‘1’) rather than just the fractional part. The exponent follows the mantissa.

In addition, two bits are always associated with every internal floating point number; a saturation signaling bit and a zero signaling bit. Rather than calculating an infinity or zero condition at every operation, the functions forward saturation and zero conditions detected at the input of the data-path. These are then combined with the conversion (cast) back to IEEE754 at the output of the data-path to determine special conditions.


32 bits

sfix32_En30


7.2.2 Double Precision Word Formats

Generally, the double precision word formats are analogous to the single precision word formats.

7.2.2.1 IEEE




2−1075 ≈ 2.5 × 10−324 2−1022 ≈ 2.2 × 10−308 (2−2−52) × 21023 ≈ 1.8 × 10308

In IEEE754 format, the sign bit is in the most significant bit, followed by an 11 bit exponent, followed by the 52 bit fractional part of the mantissa.



7.2.2.2 Internal Double Precision Floating Point Number

In addition to IEEE754 used at the subsystem boundaries and memories, there are two internal double precision formats; a signed one for addition and subtraction, and another unsigned for multiplication and division. In the signed format, the 64 bit signed mantissa is followed by the 13 bits exponent, while the unsigned format has the 54 bit mantissa followed by the 13 bit exponent.

Signed Double Precision Format - Addition and subtraction

Unsigned Double Precision Format - Multiplication and division

The saturation and zero signaling bits operate in the identical way to the single precision case.

Also there are 3 flag bits for Saturation (Inf), Zero, and ‘Not a Number’ (NaN).

7.2.2.2.1 Addition and Subtraction Mantissa

A signed 64 bit mantissa is used internally. The mantissa from the IEEE format becomes part of the sfix64_En58 signed fractional number -




2−4152 ≈ 1.3 × 10−1250 2−4094 ≈ 3.8 × 10−1233 (32−2−58) × 24095 ≈ 1.7 × 101234


64 bits

sfix64_En58


As with the single precision mantissa, there are four overflow bits (i.e. 4 additional integer bits compared to IEEE) so that 16 additions can feed into any node without overflow. There are six underflow (guard) bits.

7.2.2.2.2 Multiplication, Division and Function Mantissas

The multiplier and divider have the same format, which is different from the signed mantissa.


54 bits

sfix54_En52





2−4146 ≈ 8.5 × 10−1249 2−4094 ≈ 3.8 × 10−1233 (32−2−52) × 24095 ≈ 1.67 × 101234

The sign bit is packed with the mantissa, but the multiplication or division operation is performed on an unsigned 54 bit mantissa. As with single precision, the function library mantissa is the same as the division mantissa, except that some functions only have a valid positive output.

The mantissa is 54 bits wide, consisting of a leading “01” and a 52 bit fractional part. The exponent is 13 bits wide, and is signed. As with the single precision internal format, the additional width is used for local overflow and underflow: i.e. the exponent can exceed 2046 locally and be less than 0 locally before normalization. As with the IEEE754 format, the exponent is offset, where a value of 1023 denotes 1 (20), and 0 denotes (2-1023). De-normalized numbers are not supported, but in cases where a node temporarily is less than (2-1023) can be accommodated if the node increases to (2-1022) before the next conversion to a IEEE754 number (i.e. an output).



7.2.3 Floating Point Type propagation

Cast blocks are automatically inserted to convert between formats, optimal for the type of operation. Below is an example:

Input & Output is always IEEE 754 format, single or double

The IEEE format is propagated though to memory Note, memory always stores data in IEEE, even in feedback

loops

Multipliers can take IEEE format Multiplier can generate multiplier format

In this example the 2nd multiplier produces add-format

Adder needs add-format Generates add-format too

Output is IEEE A cast operation is inserted just before output



7.3 Special considerations when using floating point

Algorithms are often hand folded down to reduce the total resources used, while maintaining the required data throughput.

For example, most folded algorithm implementations assume single-cycle accumulators, which permit partial calculations to be performed in adjacent clock cycles, and for the control to be written in a natural way.

However, to meet high fmax, floating point accumulators are at least 6 cycles for single-, and 10 cycles for double-precision. This then requires a rethink of how such algorithms should be implemented.

A delay-line adder-tree is a typical structure in DSP designs. But with the latency required for floating point, this could be quite sizable in resources, and would add also latency to the overall calculation. If calculations are performed ‘out-of-order’ however, we can often build a more hardware efficient implementation, at the expense of thinking carefully about the control.

The goal when designing with floating point in the Advanced Blockset is to build simple designs that are still efficient. The following sections disclose a set of structures that can be used for efficient floating point design, and algorithmic transformations to build them automatically. It covers;

Using FIFO based flow control to eliminate need for state-machines

Data-flow structures for processing iterative algorithms

Latency insensitive implementation

These techniques apply to simple designs, as well as to more complex linear algebra functions such as Cholesky and QR Decomposition. They may also be applied to fixed-point designs.

7.3.1 Flow Control, latency hiding and avoiding data dependencies

FIFOs are used to provide self-timed control. Rather than either relying on cycle-counting or on state-machines, FIFOs offer simple controlled access to memories. The aim is to have the floating point arithmetic running as fast as it can but, rather than issuing a command to ‘start processing’, then waiting for the latency of the calculation before being able to use the result, have the arithmetic unit running continually in advance. Results are continually pushed onto the back of the FIFO queue and pulled from the front by the downstream process. If this queue becomes too big (the FIFO is getting full) – ensure we feed this back to stall the processing for a while, such that we don’t lose any results, while still being able to store any results currently in mid-calculation.



An example of this can be seen in the Mandelbrot demonstration design, where such units are used together.

Note that pipelining cannot add extra latency around loops – only balance and redistribute existing algorithmic latency. Therefore, although we do not care particularly about the latency round the loop, we have to specify sufficient delay round it in the design that the pipelining solver will be able to redistribute it to meet timing without needing to add further delay. In the Mandelbrot example, this is seen as the ‘loop slack’ sample delays in each loop.

7.3.1.1 Example: Floating Point Mandelbrot Set calculation

This example plots the Mandelbrot set for a defined region of the complex plane.

A complex number C is in the Mandelbrot set if

zn+1 = zn2 + c

remains bounded. That is, if the value remains finite when repeatedly squared and added to the original number. Further we can shade values of C depending on the speed of divergence.


Figure 44: Use of FIFOs (and loops) to control running of floating point calculations without explicitly waiting for the start-to-finish calculation latency. Result can feed into similar downsteam processes.

Use result when ready

FIFO of

results

Floating Point Math

Control, Loop or count

Stall?

GO


Single precision floating point complex numbers are used.

One thing to note is that the latency of the system is longer performing floating point calculations than would be for the corresponding fixed point calculations. You can’t afford therefore to wait around for partial results to be ready if you want to achieve maximum efficiency. Instead you must design to keep the floating point math calculation engines of your algorithm busy and fully utilized. In the summary below you can see there are two floating point math subsystems: one for scaling and off-setting pixel indices to give a point in the complex plane, and the other to do the main square-and-add iteration operation.

For this simple design, the total latency is approximately 25 clock cycles - depending on target device and clock speed – not excessive; but long enough that it would be very inefficient to wait around for partial results.

Instead we have the circulation of data through the iterative process controlled by FIFOs. The FIFOs ensure that if a partial result is available for a further iteration in the zn+1 = zn

2 + c progression, then that point is worked on; otherwise a new point (new



value of c) is started. Thus a full flow of data is maintained through the floating point arithmetic. This main iteration loop can exert back-pressure on the new point calculation engine. Here if new points are not being read off the ‘CommandQueue’ FIFOs quick enough, such that they fill up, the loop iteration over points will be stalled. In this way we don’t explicitly signal the calculation of each point when it is required (and then pay the penalty of waiting around through the latency cycles before we can use it), nor do we attempt to exactly calculate this latency in clock cycles and try to issue ‘generate point’ commands the exact number of clock-cycles before we need it) – which would take two compiles to do, and have to be changed each time we re-targeted device, or changed target clock rate. Instead we calculate the points as fast as we can from the start, catch them in a FIFO, then only if the FIFO starts to get full to we catch this – a sufficient number of cycles ahead of being full that we can stop the calculation upstream without loss of data. This is a self regulating flow, that mitigates latency while remaining flexible.

Not designing algorithm implementation around the latency and availability of partial results would lead to significant inefficiencies. If you’re not careful, data dependencies in processing can stall processing.

There are several other things of note in this design.


Figure 45: Flow control for Madelbrot calculation

Hold iteration data here until its ready for the

next iteration

Z(n+1) = zn2 + C

Increment iteration count

Finished with this point?

Iterate again with a previous point, or start

processing a new complex point?

If a previous point is ready for next

iteration, choose that; else start the iterations

for the next point in the queue.

out

Key:

Floating point math

FIFO flow

control

Mux

Store new pixel coordinates and corresponding

complex numbers until required

Generate pixel coordinates and corresponding

complex number (unless stalled)

in


1. The 'FinishedThisPoint' signal is used as the valid. Thus although the system constantly produces data on the output, only when we have finished a point do we mark the data as valid. Downstream components can then just process valid data – just as the enabled subsystem in the design test-bench captures and plot the valid points.

2. In both feedback loops, we need to allow sufficient delay for the scheduler to redistribute as pipelining. In feed-forward paths pipelining can be added without changing the algorithm itself – just the timing of the algorithm. But in feedback loops, insertion of delay can alter the meaning of an algorithm. (think for example of adding N cycles of delay to an accumulator loop – this would then increment N different numbers each incrementing every N clock cycles). So in loops we have to give the scheduler in charge of pipelining for timing closure enough ‘slack’ in the loop to be able to redistribute this delay to meet timing, while not changing the total latency round the loop, and thus ensuring the function of the algorithm is unaltered. Such ‘slack’ delays can be seen in the top level of the synthesizable design in the feedback loop controlling the generation of new points, and in the FeedBackFIFO subsystem controlling the main iteration calculation.

These slack delays are set to the minimum possible delay that satisfies the tool’s scheduling solver using the Minimum Delay feature on the SampleDelays.


Figure 46: Insertion of sufficient lumped 'SampleDelay' to allow for pipeling.


The Sample Delay is set to minimum latency that satisfies schedule, which is solved as part of the integer linear programming problem used to find an optimum pipelining and scheduling solution for the design.

Delays can be grouped into numbered ‘Equivalence Groups’ to match other delays. In the Mandelbrot_S example, the single delay around the coordinate generation loop is in one equivalence group, and all the slack delays round the main calculation loop are in another equivalence group. The equivalence group field allows any Matlab expression that evaluates to a string.

The actual delay that is used is displayed on SampleDelay block.

3. The FIFOs operate in showahead mode - that is they display the next value to be read. The 'read' signal is a read acknowledgement - i.e. a signal to say 'I've read the output value, you can now discard it and show me the next'. Also note here that multiple FIFOs are used with the same control so will be FULL and present valid output at the same time. Thus we only need the output control signals from one of the FIFOs and can ignore the corresponding signals from the other FIFOs.

4. As floating point simulation is not bit-accurate to the hardware, it could be that some points in the complex plane take fewer or more iterations to complete in hardware compared to the Simulink simulation. This means that the results – when we have decided we are finished with a particular point – may come out in a different order. We therefore have to build a test-bench mechanism that is robust to this. To do this we use the test-bench override feature detailed in the appendix. We set the condition on mismatches to ‘Warning’ and use the Run All Testbenches block to set an import variable – to bring the ModelSim results back into Matlab, and a customer verification function which will be responsible



for setting the pass/fail criteria. The example script for Mandelbrot_S is also given in the appendix.

7.3.1.2 Floating Point Matrix Multiply Example

For a matrix multiplication we need to do row x column dot product for each output element. Here each element in the red row in A is multiplied by the corresponding element in the red column in B to produce the red result element in AB.

Here, for 8x8 matrices A and B,

The naive approach would be to accumulate the adjacent partial results, or build an adder trees, without consideration of any latency. However, suppose we want to implement this using a smaller dot product; folding to use a smaller number of multipliers, rather than doing everything in parallel. We would do this by splitting up the loop over k into smaller chunks, as below for example. We then need to accumulate the red and blue partial products we can re-order the calculations to avoid adjacent accumulations.


8

8

A B AB

(AB)ij =AikBkj

k=1

8


A traditional implementation of a matrix multiply design would be structured around a delay line and an adder tree.

A11B11 + A12B21 + A13B31 + A14B41 + …..

The length and size grow as Folding Size (typically 8-12) Implies adder tree of 7-10 adders that are only used once every O(10) cycles. Each matrix size needs different length, so must provision for worst case

A better implementation is to use FIFOs to provide self-timed control. Here new data is accumulated when both FIFOs have data. The advantages are that the design

Runs ‘as fast as it can’ Is not sensitive to latency of dot-product on devices/fmaxes Is not sensitive to matrix-size (hardware just stalls for small N) Can be responsive to back-pressure which stops FIFOs emptying & full

feedback to Control (not shown)



Appendix: Generated Test-benchesThe Automatic TestBench (ATB) for an entity under test foo consists of:

foo.vhd – this is the HDL that is generated as part of the design (regardless of ATBs)

foo_stm.vhd – this is an HDL file that reads in data files of captured Simulink simulation inputs and outputs on foo

foo_atb.vhd – this is a wrapper HDL file that declares foo_stm and foo as components, wires the input stimuli read by foo_atb to the inputs of foo, and the output stimuli and the outputs of foo to a validation process that checks the captured Simulink data and channel matches the VHDL simulation of foo for all cycles where valid is high, and that the valid signals match.

<input>/<output>.stm – this is the captured Simulink data, written by the ChannelIn, ChannelOut, GPIn, GPout and ModelIP blocks. Each block writes out a single stimulus file capturing all the signals through it writing them out in columns as doubles with 1 row for each timestep. For example:

The device-level testbenches make use of these same stimulus files, following connections from device level ports to where the signals are captured. Device level testbenches are therefore restricted to cases where the device-level ports are simply connected to stimulus capturing blocks. The picture below chows how these components are used to build a testbench around the generated HDL code

.


1 0 0 0 0 01 0 0 0 0 01 0 1 1 1 11 0 1 1 1 11 0 1 1 1 11 0 1 1 0 01 0 0 0 0 01 0 -2 -2 -2 -2 1 0 -2 -2 -2 -2 1 0 -4 -4 -4 -41 0 -4 -4 -4 -4 1 0 -3 -3 -3 -3 1 0 -2 -1 0 01 0 0 -1 -2 -2 1 0 -4 -5 -6 -6 1 0 -9 -10 -11 -12 1 0 -17 -19 -20 -22

qv qc q0 q1 q2 q3

0123456789

10111213141516

Time-step

ChannelOut1.stm



Figure 47: Generated Automatic TestBench files

_stm.vhdReads in stm files

xOut.stmv c xOut_0 xOut_1 : : : :

component

stm

_atb.vhdcomponent

entity

checkxOut:process

clk

areset

h_areset

xIn_v_stm

xOut_0_stm

xOut_1_stm

xOut_v

xOut_c

xIn_c_stm

xIn_0_stm

xOut_v_stm

xOut_c_stm

xOut_0

xOut_1

.vhd

Generated HDL

checkxOut : process (clk, areset)begin IF (areset = '1') THEN -- do nothing during reset ELSIF (clk'EVENT AND clk = '0') THEN -- falling clock edge to avoid transitions assert (xOut_v = xOut_v_stm) report "mismatch in xOut_v signal" severity Failure; if (xOut_v = "1") then assert (xOut_c = xOut_c_stm) report "mismatch in xOut_c signal" severity Failure; assert (xOut_0 = xOut_0_stm) report "mismatch in xOut_0 signal" severity Failure; assert (xOut_1 = xOut_1_stm) report "mismatch in xOut_1 signal" severity Failure; end if; end if;end process;

_atm.doCreate & run ModelSim project

_atm.wav.doSetup ModelSim signals display

checkxOut : process (clk, areset) begin IF (areset = '1') THEN -- do nothing during reset ELSIF (clk'EVENT AND clk = '0') THEN -- falling clock edge to avoid transitions assert (xOut_v = xOut_v_stm) report "mismatch in xOut_v signal" severity Failure; if (xOut_v = "1") then assert (xOut_c = xOut_c_stm) report "mismatch in xOut_c signal" severity Failure; assert (xOut_0 = xOut_0_stm) report "mismatch in xOut_0 signal" severity Failure; assert (xOut_1 = xOut_1_stm) report "mismatch in xOut_1 signal" severity Failure; end if; end if; end process;

xIn.stmv c xIn_0: : :


Appendix: Overriding Test-benches in Matlab

Override Verification Feature OverviewThis feature allows the ModelSim simulation output to be imported into Matlab for verification and subsequent processing as required by the application. This offers the user complete freedom over what verification and post-simulation processing is to be applied; setting new pass/fail criteria for designs. The Matlab verification functions that a user might create here are expected to be very specific to the application domain.

This feature is useful in verifying designs that are not expected to be bit accurate or cycle accurate. Example designs include those using DSPBA’s floating point system.

Default Verification The current automated test bench (ATB) generated by DSPBA consists of a hybrid of Tcl and VHDL. It applies one of three checks to each output signal:

1. For traditional fixed point data the value must be an exact match with the stimulus files produced by the Simulink simulation

2. For floating point data-types, a relative error threshold is applied (currently set to 0.1% for single precision, 0.0001% for double precision)

3. For fixed point signals in a model that also uses floating point, a fuzzy comparison is made using a threshold equivalent to the sum of the least two significant bits. (e.g. for integer data 4 and 7 are considered equal but not 4 and 8. For sfix8_En3, 3.125 and 2.750 are considered equal but not 2.5 and 3.0)

These comparisons are made independently for each signal. The ATB checks the real and imaginary parts of a complex number separately, and vectors as individual components. This limits the utility of the ATB when applied to applications where vectors and complex outputs are.

How To UseTo improve the flexibility of the ATBs, a new experimental feature allows users to verify their ModelSim simulation using a custom Matlab function. To do this, the ModelSim output has to be written to a file that can be imported into Matlab after the vsim process completes. The following steps enable this feature:

1. If the model does not already contain a [Run All testbenches] block, add one from the Additional Libraries > Beta Utilities. Double click on it to reveal the following dialogue window:



Example: verification function for Mandelbrot design

function passed = verify_mb(vsim_mb) % Verify Mandelbrot results% The order of results is dependent on the floating point comparison,% pixel colors can appear in a different order in Simulink and HDL. % This function captures the outputs and plots both the ModelSim HDL % and Simulink simulation results and where there are any pixels that % differ. passed = 0; % In this design there is just one ChannelOut block % ... 'DUT/ChannelOut' % and the variables are % qv: [120000x1 embedded.fi] % qv_stm: [120000x1 embedded.fi] % qc: [120000x1 embedded.fi] % qc_stm: [120000x1 embedded.fi] % qCoord: [120000x2 embedded.fi] % qCoord_stm: [120000x2 embedded.fi] % qColor: [120000x1 embedded.fi] % qColor_stm: [120000x1 embedded.fi] results = vsim_mb('DUT/ChannelOut'); if ~isempty(results) % Loop through the results, capturing the pixel colors from the valid data % associated with each coordinate for i = 1:length(results.qCoord) if (results.qv(i) == 1) % hdl_p is the valid output from the ModelSim HDL simulation

hdl_p(int(results.qCoord(i,2))+1, int(results.qCoord(i,1))+1) = int(results.qColor(i));

end if (results.qv_stm(i) == 1) % sim_p is the valid output from the Simulink simulation sim_p(int(results.qCoord_stm(i,2))+1, int(results.qCoord_stm(i,1))+1)

= int(results.qColor_stm(i)); end end if isempty(hdl_p) error('No valid ModelSim data generated. Aborting plot.'); else % Plot the Modelsim simulation results figure('Name','ModelSim Results'); imagesc(hdl_p); end if isempty(sim_p) error('No valid Simulink data generated. Aborting plot.'); else % Plot Simulink simulation results figure('Name','Simulink Results'); imagesc(sim_p); end


http://www.mathworks.com/help/techdoc/ref/containers_map.html


if ~isempty(hdl_p) && ~isempty(sim_p) % Create an array of differences. This will be 0 at every coordinate that % matches, non-zero at every difference. diff_array = (hdl_p - sim_p); % Plot this to visualize the location of differences. figure('Name','Differences'); imagesc(diff_array); % Count the number of mismatched pixels num_mismatches = sum(sum(diff_array ~= 0)); % The number of mismatches should ideally be zero. % However, the algorithm determines the pixel color of a coordinate % not in the Mandelbrot set according to how many iterations before % the sequence is known to be unbounded. The simple test for this is % by comparing the magnitude squared to 4. % For some pixels this iterative value may be very close to 4 on some % iteration, such that in HDL the number of iterations before exiting % may differ compared to that in Simulink simulation. % This is especially likely near the unit circle, for points that take % near the maximum number of iterations to determine whether they remain % bounded. Which is correct? Perhaps which is closest to what a double % precision calculation would give. passed = (num_mismatches <= 3); endendend


DSPBA: Flow Control, Design Style and Floating Point · Web viewThe word formats are different...

Documents

Transcript of DSPBA: Flow Control, Design Style and Floating Point · Web viewThe word formats are different...