# CSE4210 Architecture and Hardware for DSP

Chapter 6
Folding

# Folding

- The folding transformation is used to systematically determine the control circuits in DSP architecture where multiple algorithm operations are time-multiplexed to a single functional unit.
- The hardware is reduced by a factor of N, the time is increased by the same factor.
- May lead to a large number of registers, thus registers minimization techniques are studied.

#### **York University**

#### **CSE**



- The objective is to provide a systematic technique for designing control circuits for hardware where several algorithm operations are mapped to the same piece of hardware via time-multiplexing of course.
- · We start with a DFG for the algorithm.
- · We need the following definitions

- U and V are two nodes in the original DFG.
- U and V are connected via an edge e with a delay w(e)  $U \xrightarrow{w(e)} V$
- Folding factor is N
- Node (computation) U  $l^{th}$  iteration is performed at time Nl +u
- Node (computation)  $V_l^{th}$  iteration is performed at time  $N_l + v$ , u and v are the folding order of U and V < N-1 (time partition scheduled for).
- $H_u$  and  $H_v$  are the hardware units U and V are performed at
- $H_u$  and  $H_v$  are pipelined by  $P_u$  and  $P_v$  stages

- The results of the  $l^{th}$  iteration of node U is available at  $Nl+u+P_u$
- Since there are w(e) delays between U and V, the result is needed in the  $(l+w(e))^{th}$  iteration of V, which is executed at N(l+w(e))+v, we need to store it for

$$D_F(U \xrightarrow{e} V) = [N(l+w(e))+v]-[Nl+P_u+u]$$
$$= Nw(e)-P_u+v-u$$



#### Folding Set

- Is an ordered set of operations executed by the same functional unit.
- Each folding set contains N entries (some of which may be null operations)
- The J<sup>th</sup> position within the folding set is executed in the time partition j
- For example the folding set  $S_1 = \{A_1, \phi, A_2\}$  for N=3
- $A_1$  is performed during the  $0^{th}$  time partition  $S_1|0$ , while  $A_2$  is done in the  $2^{nd}$  time partition  $S_1|2$
- Folding set is obtained using a scheduling and allocation algorithm

# Example



- $\square N=4$
- □ Folding sets are adder  $S_1$ ={4,2,3,1} and a multiplier  $S_2$ ={5,8,6,7}
- ☐Addition takes 1 and multiplication 2 time units
- □1-stage adder and 2-stage multiplier

#### Example



$$D_{F}(1 \to 2) = 4(1) - 1 + 1 - 3 = 1$$

$$D_{F}(1 \to 5) = 4(1) - 1 + 0 - 3 = 0$$

$$D_{F}(1 \to 6) = 4(1) - 1 + 2 - 3 = 2$$

$$D_{F}(1 \to 7) = 4(1) - 1 + 3 - 3 = 3$$

$$D_{F}(1 \to 8) = 4(2) - 1 + 1 - 3 = 5$$

$$D_{F}(3 \to 1) = 4(0) - 1 + 3 - 2 = 0$$

$$D_{F}(4 \to 2) = 4(0) - 1 + 1 - 0 = 0$$

$$D_{F}(5 \to 3) = 4(0) - 2 + 2 - 0 = 0$$

$$D_{F}(6 \to 4) = 4(1) - 2 + 0 - 2 = 0$$

$$D_{F}(7 \to 3) = 4(1) - 2 + 2 - 3 = 1$$

$$D_{F}(8 \to 4) = 4(1) - 2 + 0 - 1 = 1$$





- What if some of the  $D_F$ 's are negative
- · Of course we can not implement that
- A condition:  $D_F \ge 0$
- · We can use retiming of the original graph to get a valid  $D_F$ 's
- · Recall, retiming equation

$$w_r(e) = w(e) + r(V) - r(U) \ge 0$$

$$\begin{split} D_F' & (U \stackrel{e}{\longrightarrow} V) \text{ is the delays in the folded retimed graph} \\ D_F & (U \stackrel{e}{\longrightarrow} V) = Nw(e) - P_u + v - u \\ D_F' & (U \stackrel{e}{\longrightarrow} V) = N\Big(w(e) + r(V) - r(U)\Big) - P_u + v - u \\ D_F' & (U \stackrel{e}{\longrightarrow} V) = Nw(e) - P_u + v - u + Nr(V) - Nr(U) \\ D_F' & (U \stackrel{e}{\longrightarrow} V) = D_F & (U \stackrel{e}{\longrightarrow} V) + Nr(V) - Nr(U) \ge 0 \\ r(U) - r(V) \le \frac{D_F & (U \stackrel{e}{\longrightarrow} V)}{N} \\ r(U) - r(V) \le \frac{D_F & (U \stackrel{e}{\longrightarrow} V)}{N} \\ \end{split}$$

- We can use the techniques in Chapter 4 to solve for retiming.
- Then we fold the graph → valid transformation

#### Exercise



#### Exercise

| $\overline{\text{Edge}}$ | Folding Equation    | Retiming for Folding Constraint |
|--------------------------|---------------------|---------------------------------|
| $1 \rightarrow 2$        | $D_F(1\to 2)=-3$    | $r(1)-r(2) \leq -1$             |
| $1 \rightarrow 5$        | $D_F(1 \to 5) = 0$  | $r(1) - r(5) \le 0$             |
| $1 \rightarrow 6$        | $D_F(1 \to 6) = 2$  | $r(1) - r(6) \le 0$             |
| $1 \rightarrow 7$        | $D_F(1 \to 7) = 7$  | $r(1)-r(7)\leq 1$               |
| $1 \rightarrow 8$        | $D_F(1 \to 8) = 5$  | $r(1)-r(8)\leq 1$               |
| $3 \rightarrow 1$        | $D_F(3\to 1)=0$     | $r(3) - r(1) \le 0$             |
| $4 \rightarrow 2$        | $D_F(4\to 2)=0$     | $r(4) - r(2) \le 0$             |
| $5 \rightarrow 3$        | $D_F(5\to 3)=0$     | $r(5) - r(3) \le 0$             |
| $6 \rightarrow 4$        | $D_F(6 \to 4) = -4$ | $r(6)-r(4)\leq -1$              |
| $7 \rightarrow 3$        | $D_F(7\to 3)=-3$    | $r(7) - r(3) \le -1$            |
| $8 \rightarrow 4$        | $D_F(8\to 4)=-3$    | $r(8)-r(4)\leq -1$              |

#### Exercise



# Registers Minimization Techniques

- The objective is to minimize the number of registers in the implementation of a DSP algorithm. Topics
  - >Life time analysis
  - Data allocation using forward-backward register allocation
  - > Register minimization in folded architecture
  - > Examples

#### Life Time Analysis

- A data sample (variable) is alive from the time it is produced, until the time it is consumed (dead).
- During that time, the variable is stored in a register.
- The maximum number of live variables at any time is the minimum number of registers required for the implementation.
- We use the convention that the variable is not alive during the cycle it is produced in, and alive during the cycle it is consumed in.

#### **York University**

**CSE** 

| cycle | a           | b | C        | # live     |
|-------|-------------|---|----------|------------|
| 0 -   |             |   |          | 0          |
| 1 -   |             |   |          | <b>-</b> 1 |
| 2 -   |             | + |          | - 2        |
| 3 -   |             | 十 |          | - 2        |
| 4 -   | $\multimap$ | + | 1        | - 2        |
| 5 -   |             | + | ╁        | - 2        |
| 6 -   |             | + | ╂        | - 2        |
| ž -   |             | O | <u>\</u> | - 2        |





# Example

$$\begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \qquad \text{Transpose} \qquad \begin{bmatrix} a & d & g \\ b & e & h \\ c & f & i \end{bmatrix}$$



# Example

| Sample | $T_{input}$ | $T_{zout}$ | $T_{diff}$ | Toutput | Life               |
|--------|-------------|------------|------------|---------|--------------------|
| a      | 0           | 0          | 0          | 4       | 0->4               |
| b      | 1           | 3          | 2          | 7       | $1 \rightarrow 7$  |
| c      | 2           | 6          | 4          | 10      | $2 \rightarrow 10$ |
| d      | 3           | 1          | -2         | 5       | $3 \rightarrow 5$  |
| e      | 4           | 4          | 0          | 8       | $4 \rightarrow 8$  |
| f      | 5           | 7          | 2          | 11      | $5 \rightarrow 11$ |
| g      | 6           | 2          | -4         | 6       | $6 \rightarrow 6$  |
| h      | 7           | 5          | -2         | 9       | $7 \rightarrow 9$  |
| i      | 8           | 8          | 0          | 12      | $8 \rightarrow 12$ |

#### Circular Life-Time Chart



#### Data Allocation

- · Determine the min. number of rgisters
- Input each variable at the time its life starts. If more than one use multiple registers such that the longest lifetime is allocated to the initial register.
- · Each variable is allocated in a forward manner until it is dead or reaches the last register
- Allocation is periodic, all allocation to current iteration repeats itself after after N
- If reaches the last register and not dead allocate backward ((if more than one, choose one that has been allocated backward before), then forward again and so on.

| cycle | 2  | ı l | ) ( | e d      | e 1 | f g | h        | i | # live    |
|-------|----|-----|-----|----------|-----|-----|----------|---|-----------|
| 0     |    |     |     |          |     |     |          | _ | 0         |
| 1     | _  |     |     |          |     |     |          | _ | 1         |
| 2     | -  | Н   | Н   |          |     |     |          | _ | 2         |
| 3     | -  | Н   | Н   |          |     |     |          | _ | 3         |
| 4     | _( | ЪН  | Н   | -        | _   |     |          | _ | 4         |
| 5     | _  | _   | Ш   |          | L   |     |          | _ | 4         |
| 6     | _  |     | Ш   |          | Н   |     |          | _ | 4         |
| 7     |    | لم_ | Ъ   |          | Ц   |     | -        | _ | 4         |
| 8     | _  |     | _   |          | Ч   |     | Д,       | _ | 4         |
| 9     |    |     |     |          | 1   |     | 싀        | L | 4+0=4     |
| 10    |    |     | _,  | <u> </u> | _   |     | <u> </u> |   | 3+1=4     |
| 11    |    |     | _   |          | _/  |     |          | L | 2+2=4     |
| 12    |    |     |     |          |     |     |          | Ļ | 1+3=4     |
| 12    |    |     |     |          |     |     | •        | • | 1 1 J - T |

| Cycle | I/P | R1     | R2  | R3 | R4  | O/P |
|-------|-----|--------|-----|----|-----|-----|
| 0     | a < |        |     |    |     |     |
| 1     | b   | a .    |     |    |     |     |
| 2     | c \ | p,     | a   |    |     |     |
| 3     | d 、 | c      | Ъ   | à  |     |     |
| 4     | e 、 | d      | c   | b  | a   | a   |
| 5     | f   | ,<br>e | d   | Ĉ  | b   | d   |
| 6     | g   | f,     | è   | b  | _ c | g   |
| 7     | h 、 | c .    | f,  | e  | b   | b   |
| 8     | i、  | ħ      | c   | f  | e   | e   |
| 9     |     | 1 ,    | ĥ   | c  | f   | h   |
| 10    |     |        | i , | f  | c   | c   |
| 11    |     |        |     | i  | f   | f   |
| 12    |     |        |     |    | 1   | i   |
|       |     |        |     |    |     |     |
| ļ     |     |        | I   | I  | I   | 1   |