Extracting Steering Logic Duo Liu The project I am working on is to extract steering logic from HLS. I will use GAUT to do the initial scheduling with respect to resource and latency constraints and use TDS to improve it. With the optimized cdfg file, the final VHDL will be generated using GAUT. Then the expected DOT file will be transformed from the steering logic description which is extracted from the achieved VHDL file. The conversion process is totally automatic, which takes in a VHDL file (shall be generated by GAUT or it may not work so well) as input and produces exactly the expected DOT description of steering logic as output. Below is the concrete example of the implementation flow. In this example I choose the design of FIR8, it is adapted from an example, FIR16, in the test folder in GAUT just for simplicity. The codes are shown as follows in Fig.1. From the codes we can see that the input data is given to echantillon, and output = temporaire = echantillon*Coeffs[7] + Valeurs[1]*Coeffs[6] + Valeurs[2]*Coeffs[5] + Valeurs[3]*Coeffs[4] + … + Valeurs[7]*Coeffs[0] We also get the information that Valeurs[7]= Valeurs[6]= Valeurs[5]=…= Valeurs[1]=echantillon=data_in, thus the output temporaire = data_in*( Coeffs[7] + Coeffs[6] + Coeffs[5] + … + Coeffs[0] ). Generation of CDFG File and Optimization Using TDS First of all, we will use the Compiler block of GAUT. The compiler will extract all potential parallelism from the codes at the basis of ensuring the correctness of the algorithm syntactically and semantically. After this initial checking step, we can get the preliminary DFG generated by GAUT by compiling the codes. The fir8.cdfg file is shown in Fig.2. 1 Fig.1 Example Codes Fig.2 fir8.cdfg File The dark yellow circles represent Coeffs integer array and intermediate values, the bright yellow circle which at the most right is output, the green one in the middle is input data and the red ones are Valeurs variables. The blue boxes represent operations, addition and multiplication in this case. 2 This DFG can now be used as the input file to TDS, which provides behavioral transformations of specific designs for a particular objective. The transformations manipulated by TDS are based on graph-based canonical representation, Taylor Expansion Diagram (TED), and focus on the transformation of the DFG network. First of all, I copied the fir8. cdfg file from the folder in GAUT and pasted it into TDS’s folder. Then, I ran TDS in terminal and used command “read --no_dff fir8.cdfg” to read the file as a netlist without discovering the DFFs. Through command “show -n” we can observe the netlist: Fig.3 Netlist of fir8.cdfg in TDS Then I transformed the file into TED formation, as shown in Fig.4. In the subsequent steps I reordered the variables (as few variables in this case, the influence of reordering is not significant, at least as I see), and linearized and balanced the TED function. In the initial try I didn’t replace the multipliers with shifter, so only utilized multipliers, adders and subtractors. After transforming from ted to dfg using command “ted2dfg”, I balanced the dfg thus acquired the minimum-latency dfg, finally the dfg was transformed to netlist by running the command “dfg2ntl” and the cdfg file was generated by running “write” command. The cdfg with minimum-latency was then copied to GAUT, the DFG graphs in TDS (Fig.5) and GAUT (Fig.6) are as follows. In this process, I wasted a lot of time here for trying to find a correct combi nation of commands and a right flow to optimize TED and DFG by trial and error, although the result is still not the most satisfactory one, through many times of attempts I understand that the optimal result needs the combination of optimization both on TED level and DFG level, what we did in homework 1 is just on the surface. I gained the experience that the conversions among TED, DFG and NTL at the right time are also very important, and that’s why at the very beginning I always could not acquire a usable CDFG file for GAUT. 3 Fig.4 TED Formation in TDS Fig.5 Minimum-Latency DFG Graph in TDS Fig.6 Minimum-Latency DFG Graph in GAUT 4 After achieving the first graph successfully, I then implemented the fir8 function with only one adder, one multiplier and one subtractor. The process is similar but the latency is much higher while the area is decreased to the low est. The DFG graphs in TDS (Fig.7) and GAUT (Fig.8) are as follows. Fig. 7 Minimum-Area DFG Graph in TDS Fig.8 Minimum-Area DFG Graph in GAUT When all these are done, I decided to implement this circuit with shifters. To achieve this goal, some modifications are done with TDS commands, especially adding 2 important commands: “shifter” and “remapshift”. The DFG graphs in TDS (Fig.9) and GAUT (Fig.10) are shown as follows. 5 Fig.9 DFG Graph Using Shifters in TDS Fig.10 DFG Graph Using Shifters in GAUT Generation of Gantt and VHDL The next step is to use the synthesis block in GAUT. The purpose of this context is to synthesize under constraints of rate the formerly analyzed circuit carrying out the calculation in the algorithm. In this step, we can specify the cadency, which equals sampling rate, of the design, the memory and I/O constraints (which are less useful in this case because the goal network is the steering logic), and also the allocation and scheduling strategy. Step 1: the CDFG file which is generated aiming at achieving the minimum latency is synthesized. In this method no shifter is used. We choose the “operator optimization” option; other selected options of the configurations are illustrated in Table.1. 6 allocation strategy global_lb: the number of resources given by the global allocation is considered as a lower-bound value scheduling strategy ASAP promise the least latency Register allocation MWBM vhdl output Fsm_sig (at the very first I planned to use Structural + FUS_LEA4 as the output, because it utilizes a modified leftedge algorithm which reduces the number of registers as well as the cost of MUXs. However, through running the synthesis several times, I found that the operator resources and latency needed for Fsm_sig and Structural + FUS_LEA4 are exactly the same. As Structural + FUS_LEA4 always collapses when using large cadency, I choose fsm_sig as the output.) Table.1 Configuration of Minimum-Latency Synthesis Using GAUT As for which register allocation method to use, we can compare by running the same cdfg file (I choose the one with minimum latency) with different register allocation methods (other options remain unchanged), a table can be drew as Table.2. MWBM MLEA Left Edge None Adder No. 2 2 2 2 Substractor No. 2 2 2 2 Multiplier No. 8 8 8 8 Total Operator Area 696 696 696 696 Simple Register No. 14 10 10 0 2 to 1 MUX No. 96 176 256 96 Latency 60 60 60 60 Flip-Flop No. 224 160 160 368 7 Table.2 Comparison of Different Register Allocation Methods From table above we can observe that without any allocation algorithm, a great number of flip-flop are needed, and the middle 2 algorithms need a large number of MUX, MWBM also needs many flip-flops, but less MUXs, and the cost of MUX is apparently higher than flip-flop, hence I choose MWBM strategy. Through further investigation, we can find the fact that as the ASAP is used the minimum latency=60 is fixed, but the number of steering logic components differs due to different cadency settings. The details can be observed from the Table.3. As for allocation strategies, only distributed_lb, distributed_reuse_lb and global_lb can be used and the scheduling results are exactly the same given other parameters are configured the same way. Through comparison we set the cadency to 70 ns. Cadency(n Latency(ns s) ) + * - Area_oprtr Regs s FFs MUX2_1 Databus 50 60 2 8 2 696 14 224 96 1 60 60 2 8 2 696 12 192 96 2 70 60 2 8 2 696 12 192 96 1 80 60 2 8 2 696 12 192 96 1 90 … … … … … … … … ... Table.3 Steering Logic Components Comparison Using Different Cadencies The Gantt graph for minimum latency is as follows: 8 Fig.11 Gantt Graph for Minimum Latency Step 2: the DFG with minimum area will be synthesized. I tried using only 1 adder, 1 multiplier, 1 substrctor. As the latency is 180, the cadency must be no less than 180. And the ASAP algorithm cannot be used. The total operator area is 99. Moreover, we choose the “operator optimization” option. Other configurations are as follows. Allocation strategy manual Scheduling strategy ASAP Register allocation MEBM Vhdl output Fsm_sig Table.4 Configuration For Minimum Area Synthesis The resulting steering logic composition is as follows. Cadency(n Latency(ns + * - Area_oprtr Regs s) ) 180 180 1 1 1 99 4 FFs 64 Mux2_1 DataBus 256 Table.5 Steering Logic Components Under Minimum Area 9 1 However this is not the minimum area can be achieved because in the process of TED optimization the TDS recognizes the adder and subtractor as different operators. Thus when the optimized cdfg file transferred back to GAUT, the synthesis block will use different operators for scheduling which increases the total area. In comparison, the original cdfg file which is not op timized by TDS, however, can achieve an even smaller area by reusing the operator that is used in former time steps in a complete scheduling process. According to the configuration in Table.6 we can acquire the corresponding steering logic components in Table.7. Allocation strategy automatic Scheduling strategy Default/no_pipeline/no_more_stage Register allocation MWBM Vhdl output Fsm_sig Table.6 Another Configuration For Minimum Area Synthesis Cadency(n Latency(ns + s) ) 180 180 1 * - 1 0 Area_oprtr Regs 91 6 FFs Mux2_1 Datanus 96 64 2 Table.7 Final Steering Logic Components After Minimum Area Synthesis The Gantt view is illustrated below. Fig.13 Gantt View for Final Minimum Area Synthesis The VHDL files of all the syntheses above are achieved, as the length is a little verbose now, I don’t list them here. Extracting Steering Logic From VHDL File and Convert It to DOT Language In this part, I will first extract the steering logic from hardware description, VHDL file generated by GAUT and then convert the extracted VHDL-format 10 logic to a visible file format, the DOT file. Dot file can be read as a picture using Graphviz package which is pre-installed in UNIX operating system. All I need to do is changing the VHDL description into DOT description which has its own syntax, grammar and semantic meaning. The language I use to implement the automatic-converting program is Python due to its simplicity in grammar and sentence pattern. The whole program can be divided into 7 parts in order, each of which has its own function. We start from the first part. 1. The First Extraction The first extraction will extract the case bodies from the original VHDL file. Part of the extracted file is shown below. when S0 => -- time 0 when S1 => -- time 10 -- outputs of operation op0 reg_2 <= comp_1_mul_op_o; -- static memory read from buses reg_1 <= BUS_DONNEES_1_fir8; reg_3 <= BUS_DONNEES_2_fir8; when S2 => -- time 20 when S3 => -- time 30 -- outputs of operation op1 reg_5 <= comp_1_mul_op_o; -- static memory read from buses reg_1 <= BUS_DONNEES_1_fir8; reg_3 <= BUS_DONNEES_2_fir8; when S4 => -- time 40 when S5 => -- time 50 -- outputs of operation op2 reg_8 <= comp_1_mul_op_o; -- static memory read from buses reg_1 <= BUS_DONNEES_1_fir8; reg_3 <= BUS_DONNEES_2_fir8; when S6 => -- time 60 -- outputs of operation op7 reg_11 <= comp_0_add_op_o; when S7 => -- time 70 -- outputs of operation op3 reg_5 <= comp_1_mul_op_o; In the above lines, “reg_n” (n ∈ Z+) represents one register, “comp_1_mul_op_o” represents the output of multiplier 1. “ comp_1_mul_op_a” and “comp_1_mul_op_b” are two inputs of the multiplier operator. The reason why extract the case body is because all the steering logic information are contained here. For example, looking at the top of the above VHDL sentences, at S1 state, the output of the area should be sent to reg_2. The VHDL 11 file is easy to understand, but when extracting these sentences, attention shall be paid to Python language. In the sentence “pattern = re.compile('case \S+ is(.*?)\s*end case', re.DOTALL)”, “re” refers that the regular expression library is used, “'case \S+ is(.*?)\s*end case'” is the part of the whole VHDL file that we want to extract which is the case bodies. “re.DOTALL” means the search will go through all VHDL file (or else the search will terminate at the end of each line). “reg = pattern.findall(data) ” returns a list of strings, by “text = '\n'.join(reg)” the list is transformed to consecutive strings which are exactly what we want to extract in the first step. 2. Second Extraction In this step, I further extract the file we get in the first step. After this step what I get is pure data transfer relationship without state information. (We still need state information to help assure the data transfer occurs at the right time, this will be handled in the following steps) part of the achieved sentences is as follows: reg_11 <= comp_0_add_op_o; reg_5 <= comp_1_mul_op_o; reg_1 <= BUS_DONNEES_1_fir8; reg_3 <= BUS_DONNEES_2_fir8; reg_11 <= comp_0_add_op_o; reg_11 <= comp_0_add_op_o; reg_8 <= comp_1_mul_op_o; reg_11 <= comp_0_add_op_o; reg_1 <= BUS_DONNEES_1_fir8; comp_1_mul_op_a <= reg_1; comp_1_mul_op_b <= reg_0; comp_1_mul_op_a <= reg_1; comp_1_mul_op_b <= reg_0; comp_1_mul_op_a <= reg_1; comp_1_mul_op_b <= reg_3; comp_1_mul_op_a <= reg_1; The method I used to do the extraction is finding out all sentences that contains the key word “reg”. The reason why I don't use this means in the first step is because apart from the case bodies there are many other sentences in the original VHDL file that contain word “reg”, many useless information would be extracted in the meanwhile. 3. Convert The Extracted Contents to DOT Sentence Pattern In DOT program, the relationship between 2 nodes is expressed as “A -> B;” which means a directed edge from node A to node B. Part of converted sentences is as follows: 12 comp_1_mul_op_o -> reg_8 ; comp_0_add_op_o -> reg_11 ; BUS_DONNEES_1_fir8 -> reg_1 ; reg_1 -> comp_1_mul_op_a ; reg_0 -> comp_1_mul_op_b ; How I achieve this effect is simple, just switch the positions of names at different sides of sign “<=”, then change the sign “<=” to “->”. 4. Delete Redundant Lines The reason of adding this part is clear, because in the original VHDL file, the relationship between two names, for example, “reg_1 -> comp_1_mul_op_a ;”, may exist more than once, for this relationship may be used in different time state. Then according to the method I provided in the former steps, redundant lines may be extracted unnecessarily which may make the coming steps difficult to be implemented. The simplified steering logic is quite simple: comp_1_mul_op_o -> reg_2 ; BUS_DONNEES_1_fir8 -> reg_1 ; BUS_DONNEES_2_fir8 -> reg_3 ; comp_1_mul_op_o -> reg_5 ; comp_1_mul_op_o -> reg_8 ; comp_0_add_op_o -> reg_11 ; reg_1 -> comp_1_mul_op_a ; reg_0 -> comp_1_mul_op_b ; reg_3 -> comp_1_mul_op_b ; reg_8 -> comp_0_add_op_a ; reg_5 -> comp_0_add_op_b ; reg_11 -> comp_0_add_op_a ; reg_2 -> comp_0_add_op_b ; reg_11 -> comp_0_add_op_b ; reg_11 -> BUS_DONNEES_2_fir8 ; Above are all relationships that are needed to draw the steering logic for a minimum area scheme. 5. Construct MUX Structures and Place MUXs into Graph (1 st and 3rd levels are registers and operators) This step may be the trickiest one. First I need to know how many MUXs are needed. Through observing the result in Step 4, actually, each relationship requires one MUX (except the last one in which data is transferred to BUS). According to the experience from homework 1 in Synthesis and Verification of Digital Systems, the contents at the left of sign “->” can stay as they are because those are indeed where data come from, however, the contents at the right side need to be replaced with MUX. But things are not as easy as just replacing them with corresponding MUX, for the reason that there are 13 data in and out of MUX, together with control signals, if simply adding MUX to the picture, the structure of the whole picture will be in a mess and make the picture hard to read and to understand. Through studying the relevant contents in DOT tutorial, I solve this problem by establishing a structure for each MUX. In DOT language, a structure is a rectangle whose area can be split into several subparts with literal notations. One example is as follows: Fig.14 A Record Structure Example I split one MUX structure into 3 subparts: input, output and control signals. For a typical steering logic, the MUX for register receives input data from operators or BUS and forwards data to corresponding register. Fig.15 Register MUX The MUX for operator receives data from registers and forwards data to corresponding operator input (there are two inputs for an operator so two MUXs for one operator). Fig.16 Operator MUX Both of two kinds of registers have control signals connected. By consulting with Jiajun who handle logic extraction part, we reach an agreement, assign 3 control signals to each MUX. I only need to know which signals control which MUX, I don't need to ascertain how the signals control the MUX. This resolves the state identification problem we stated in Step 2. The corresponding DOT language to describe a MUX structure is like: “struct_MUX_reg_2[shape=record,color=red,label="b1,b2,b3| {<up>MUX_reg_2|<down>out}"];”. The process of choosing control sig- 14 nal is also very interesting. We scan the file in Step 4 from top to bottom, for each part by the right side of sign “->” in one line that never existed before, I assign 3 control signals for its corresponding MUX in sequence. Thus Jiajun and me can reach the agreement on which signals control which MUX. The connection for one MUX is as follows: Fig.17 Connection of One MUX From Fig.17 we can see the data flow which is related with Reg_5 (in the middle of the first level) clearly. For each MUX we can see one line goes from the bottom of itself to the top, this is done in case there is no new data into MUX and meanwhile it can keep original value, when this happens, the control signal for it is “000”. 6. Add DOT subgraphs for operators One operator has 3 ports, 2 for inputs and one for output. I make use of the subgraph format of DOT language and construct one subgraph for each operator. An example for the subgraph description is like “subgraph cluster_comp_1_mul_op{comp_1_mul_op_a->comp_1_mul_op_o; 15 comp_1_mul_op_b->comp_1_mul_op_o;}” and the corresponding graph is as follows: Fig.18 Operator Subgraph 7. Rearrange DOT Graph The steering logic graph achieved by the above steps still have some problems, one of them is that the layout is not very methodical. The registers sometimes appear at the first level, sometimes the fourth, the same problem exists with other components, like operators and MUXs. To avoid this ambiguity, I use DOT method “rank” to arrange the position of different components in steering logic. The concrete working method is : “{rank=same;reg_1;reg_0;reg_8;reg_5;reg_11;} {rank=sink;struct_MUX_reg_2;struct_MUX_reg_1;struct_MUX_reg_3;struct_M UX_reg_5;struct_MUX_reg_8;struct_MUX_reg_11;}” In this way, MUXs are spread in the second and fourth level. 8. Reduce Repeated Lines When constructing MUX structure and subgraphs, redundant lines may be created because every line is scanned in the process and some names may appear in parts of different lines for multiple times. The final DOT graph generated for the minimum area scheme is as follows: 16 Fig.19 The Final Dot Graph for Minimum Area UP to now, the flow seems to reach the end, but there are some flaws here. Firstly, this program is not so easy to be transplanted from one system to another. It is written totally according to the VHDL file that is generated by GAUT Synthesis, by totally, I mean even the whitespaces and names given to different components in the file play an important role in determine whether the program will work or not. Secondly, the layout of the final picture still has room to improve, for example, the operators take too much room in the graph, a trapezoid-shaped MUX may make sense much better. But these are related with DOT language functions which are not so hard to solve. 17
© Copyright 2024 ExpyDoc