International Journal of Electrical and Computer Engineering (IJECE) Vol. No. February 2019, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Hardware/software co-design for a parallel three-dimensional bresenhamAos algorithm Sarmad Ismae1. Omar Tareq2. Yahya Taher Qassim3 1Electronic Engineering Department. College of Electronic Engineering. Ninevah University. Iraq 2Computer Engineering Department. College of Engineering. University of Mosul. Iraq 3School of Electronic Engineering. Griffith University. Australia Article Info ABSTRACT Article history: Line plotting is the one of the basic operations in the scan conversion. BresenhamAos line drawing algorithm is an efficient and high popular algorithm utilized for this purpose. This algorithm starts from one end-point of the line to the other end-point by calculating one point at each step. As a result, the calculation time for all the points depends on the length of the line thereby the number of the total points presented. In this paper, we developed an approach to speed up the Bresenham algorithm by partitioning each line into number of segments, find the points belong to those segments and drawing them simultaneously to formulate the main line. As a result, the higher number of segments generated, the faster the points are calculated. employing 32 cores in the Field Programmable Gate Array, a line of length 992 points is formulated in 0. 31s only. The complete system is implemented using Zybo board that contains the Xilinx Zynq-7000 chip (Z-7. Received Apr 8, 2018 Revised Jun 19, 2018 Accepted Jul 2, 2018 Keywords: BresenhamAos algorithm FPGA Line segmentation Parallelization System on Chip Copyright A 2019 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Yahya Taher Qassim. School of Electronic Engineering. Griffith University, 170 Kessels Road. Nathan. QLD 4111. Australia. Email: yahya-taher. qasim@griffithuni. INTRODUCTION One of the essential and main activities in computer graphics is the straight line drawing. To get the lines displayed in a short time, the computation speed of the applied algorithm is crucial . Several algorithms have been developed and employed for drawing straight lines . , . However, the utilization of complex and inefficient algorithms in generating straight lines then displaying their pixels could be slow and unacceptable to the user . Such slow algorithms may include heavy computations that consume time and Bresnham algorithm was highly regarded among other equivalent algorithms due to its suitability and efficiency since floating-point, multiplication and division operations do not exist in addition to its numerical scaling of integers only . There are several works in the literature concerning line rasterization including Bresenham algorithm, which had been implemented in different schemes and methods. In . , an improvement to Bresenham algorithm was performed by determining the number of pixels for each line then filling those pixels. The modified algorithm of . , which combines the symmetry nature of lines with Bresenham algorithm, improved the drawing speed by fifty percent. however, the work was only experienced in simulation. Drawing long straight lines negatively affects the efficiency of Bresenham algorithm, where the rasterization speed is extremely reduced in addition to missing the pixels of correlation . , . In order to speed up the drawing algorithm, hardware utilization is necessary. A popular hardware platform utilized for digital system design is the field programmable gate array (FPGA) . However, when the length of the line Journal homepage: http://iaescore. com/journals/index. php/IJECE Int J Elec & Comp Eng ISSN: 2088-8708 is extended, the time required to calculate the extra points would increase as well since Bresenham algorithm is based on finding the points of the specified line sequentially. Therefore, even with the involvement of hardware, the delay in drawing lengthy lines is likely to happen. In this paper, we have solved the problem of drawing lengthy straight lines in an extremely short Our approach of line segmentation and parallelization in implementing the 3D-Bresenham algorithm on single SoC (System on Chi. , . is presented. The parallelization approach is to handle several procedures at the same time and execute them independently . Therefore, we have segmented the line into several shorter lines of equal length and considered every segment as a separate process. In such a case, the time consumed in implementing one procedure is the same time in implementing multiple procedures. As a result, any extension in the length of the line can be covered by adding an additional procedure to compute the points reflecting that increase and this would not add further running time. This article is organized as follows. in Section 2, a brief description on the 3D Bresenham algorithm is outlined. Section 3 explains our line segmentation and parallelization approach utilized for drawing straight lines followed by the Zynq implementation in Section 4. In Section 5, the obtained results and analysis are given whereas the conclusions and future work are included in Section 6. THREE DIMENSIONAL BRESENHAMAoS LINE ALGORITHM The flow chart for three dimensional BresnhamAos algorithm . is presented in Figure 1, where the pixels of the line segment are generated in a three dimensional space. For each pixel, the x, y, z coordinates are calculated in the object space. BresnhamAos algorithm starts from one end-point of the line to the other end-point by calculating one point at each step. As a result, the calculation time for all the points depend on the length of the line thereby the number of the total points presented . Figure 1. The flow chart for the 3D-Bresenham algorithm From Figure 1, the vertices of the line segment . a, ya, z. b, yb, z. are read first then the greatest coordinate difference . x, dy, d. is calculated. The variables s1, s2, s3, inc1, inc2, inc3, p1, p2, p3 and m are given their assigned values according to the greatest coordinate difference. Following to this, the Hardware/software co-design for a parallel three-dimensional bresenhamAos algorithm (Sarmad Isma. A ISSN: 2088-8708 error values . 1 & e. are computed in order to find the increment value of x, y, z. The next step is to find the intermediate pixels of the line segment. Those points can be calculated each time by adding the incremented value to the x, y, and z coordinate. Finally, store the calculated point and return to calculate the new increment value. When the coordinate difference is equal to zero, the algorithm ends . MATERIAL AND METHODS As previously mentioned, in Bresenham algorithm, the line pixels are generated one by one beginning at the start point a . a, ya, z. towards the end point b . b, yb, z. In such a case, the time required to compute all the points is increasing as long as the line becomes longer, leading to a slower plotting process. Our approach in implementing the 3D-Bresenham algorithm is performed by line segmentation and procedure parallelization. The line is divided into equal-length segments using the ARM A9 Cortex or the processor system (PS) available on the Zynq chip followed by sending them to the FPGA located on the same chip. The FPGA or the programmable logic (PL) contains up to 32 individual cores and each of those cores can perform a separate procedure concurrently with other cores. Therefore, to draw a line, the latter can be divided into up to 32 equal-length segments then calculate the points of each segment separately in one of the cores and in parallel with the other segments. As a result, the number of cores employed divides the time required to find the total points of the line. In other words. Bresenham algorithm is implemented simultaneously with a number of times equivalent to the number of cores engaged in the The time required to draw a line can be controlled. For example, the increase of segmentsAo number decreases the total time required to draw the line or keep the time constant when the length of the original line increased by generating additional segments. Figure 2 describes the Zynq architecture . showing the PS part and only two cores . ut of . at the PL side. The AXI4-Lite is the communication bus interface between the ARM cortex A9 processor and the FPGA. It can be programmed to work in more than one We have appointed it as AXI4-lite bus, which is simple, easy and does not require memory mapping . Figure 2. The Xilinx Zynq SoC including the ARM cortex (PS) and showing two of the FPGA (PL) cores. The BR3D refers to Bresenham algorithm . hree-dimensiona. For the practical implementation of Bresenham algorithm, the line pixels are described by the coordinates . , y, . Each pixel is represented by 32 bit in total as follows: 10 bit for x, 10 bit for y, 10 bit for z coordinate and 2 bit are not applicable. Since the FPGA-BRAM . onnected to each cor. is of size 1024y32 bit, which is fixed for each line segment, the straight line plotting could start at the coordinate . , 0, . and ending at the coordinate . 3, 1023, 1. The maximum length of a single line segment is 1024 point. As mentioned, the number of cores employed depends on the number of segments where each core is performing pixel computation of one segment at a time. Table 1 shows different number of cores employed or number of line segments and the maximum length of rasterized line in pixels. When all the FPGA cores are involved in line computation, the maximum length of the line generated is 32768 pixels. The number of cores/segments versus maximum length of line in pixels shown in Table 1. Int J Elec & Comp Eng. Vol. No. February 2019 : 148 - 156 Int J Elec & Comp Eng ISSN: 2088-8708 Table 1. The Number of Cores/Segments versus Maximum Length of Line in Pixels No. of cores or line segments Max length of rasterized line . n pixe. 1024 * 2 = 2048 ( 2K pixel ) 1024 * 4 = 4096 ( 4K pixel ) 1024 * 8 = 8192 ( 8K pixel ) 1024 * 16 =16384 . K pixel ) 1024 * 32 =32768 . K pixel ) THE ZYNQ IMPLEMENTATION As mentioned, the Zynq chip located on the Zybo board contains two main computation parts: the ARM processor (PS) and the FPGA (PL) as in Figure 2. The PS is allocated for algorithms with high computation complexity whereas the PL is utilized to implement logical system design . Therefore, the whole system is programmed in C language to perform the following tasks for each of the PS and the PL The PS is directed to perform line segmentation by partitioning the main line into number of segments assigned according to the number of PL cores. In other words, the processor will generate the start and end coordinates, i. , y, . coordinate for the terminal points for every segment before sending them to the FPGA. Since the terminal points of each segment are known, the FPGA cores start calculating the total segmentAos points according to the 3D Bresenham algorithm. At the end, all the cores that involved in computation complete at the same time and the coordinates of all points are calculated and stored in the BRAM. When all the points are computed and stored, their values are send back to the PS part one by one for Finally, the points are rasterized on the computer screen formulating the specified line. Core operation Figure 3 shows the internal architecture of one of the FPGA 32 cores. It consists of numbers of 32-bit register utilized for initialization and signal separation. The coordinates of the terminal points P1 and P2 are entered through Reg0 and Reg1 respectively. The three dimensional Bresenham algorithm (BR3D) unit contains the logic configured reflecting this algorithm, which computes the points of the assigned When the unit completes calculating the points, bit10, the Ready (Rd. bit in Reg2 is set and bit0 to bit9 will hold the number of the calculated points (NCP). Each calculated point is stored in the dual port block RAM that can be fetched through Reg4 (P_ou. when the corresponding address is given through Reg3. Figure 4 illustrates the block diagram designed to implement Bresenham algorithm inside the BR3D unit in Figure 3. In every clock pulse, a new point is generated from point 1 . 1, y1, z. and point 2 . 2, y2. The architecture contains three subtraction units, three units for the absolute value (ABS) and three comparators, a multiplexer and finally the calculation unit. The intermediate signals s1, s2, s3, inc1, inc2, inc3, p1, p2, p3 and p4 were described in Figure 1. The combination of all of these blocks execute Bresenham algorithm. Figure 3. The internal architecture of the FPGA core designed for the 3D Bresenham algorithm Hardware/software co-design for a parallel three-dimensional bresenhamAos algorithm (Sarmad Isma. A ISSN: 2088-8708 Figure 4. The implementation of Bresenham algorithm inside the FPGA cores In Figure 5, the timing simulation of the main signals in Figure 4 is shown. The start point . 1, y1, z. and the end . 2, y2, z. coordinates are all inputs to the architecture giving the calculated output point . _out, y_out, z_ou. varies at each clock pulse. The output points combined represent the line segment. The start point in this example is . , 2, . and the end point is . , 15, . It can be noticed from Figure 5. that some coordinates may change value every two-clock pulses such as z-out whereas one clock pulse is enough to change value in case of the x-out coordinatedue due to the variation in the coordinate difference between its start and end. Figure 5. The timing waveform for the input/ output coordinates of one segment calculated using 3D-Bresenham algorithm Int J Elec & Comp Eng. Vol. No. February 2019 : 148 - 156 Int J Elec & Comp Eng ISSN: 2088-8708 Timing constraints Meeting timing constraints remains an essential part in any digital system design. In this work, all the design procedures are performed through the Vivado software package including design optimization, which is essential to meet timing requirements. For example, when a specified frequency of operation is assigned, such as 100MHz. Vivado will optimize the design according to the period of Aucreate clockAy. This is essential to achieve the targeted frequency according to the worst failing path in the design. The Xilinx recent release of Vivado Design Suit 2017. 2 supports the Zynq SoC with a wide variety of FPGA devices. supersedes previous design tool by its additional features of high-level synthesis and SoC . RESULT AND DISCUSSIONS Timing analysis In our design, the Zybo board operates at 100MHz clock rate and the architecture calculates one point in every segment in one clock pulse. The number of FPGA cores employed share calculating the number of line points. The hardware runtime is reduced to half when the number of cores employed is The fastest runtime achieved is 0. 31s when all the 32 FPGA cores are involved. This time represent the segmentation time . n the PS) in addition to the time required for pointsAo calculation . n the PL). Table 2 lists a comparison between this work and other relevant works. Although the comparison is not on exact coordinates, the runtime in this work clearly advances the running time of other works. Figure 6 reflects the decrease in the hardware running time against the increase in the number of cores employed for the line of 992 point. Table 2. Runtime Line Drawing Compared to the Literature Start , 0, . , 0, . , 0, . Reference Year This work . FPGA core. This work . ne FPGA cor. Bresenham 3D ess hardwar. Bresenham 3D line generation based on line pixels . Bresenham improved algorithm A fast line rasterization algorithm , 992, . , 992, . 0, 1000, 1. Operating Zybo SoC board Zybo SoC board Spartan 3E FPGA Running 31AAs 92AAs 16AAs , 0, . 0, 1000, 1. Spartan 3E FPGA 71AAs , . , . , . 0, 1. 0, 1. 0, 1. Personal computer Personal computer Personal computer End coordinates Figure 6. The reciprocal relationship between the FPGA cores employed and the hardware runtime Graphical analysis The main 3D line of 992 point is depicted in Figure 7 in four different cases. The number of segments generated from the main line using our approach of parallel implementation of Bresenham algorithm depends on and equal to the number of FPGA cores utilized. The generated points are plotted using Matlab. Colors are used to distinguish the start and end of each segment. Hardware/software co-design for a parallel three-dimensional bresenhamAos algorithm (Sarmad Isma. A ISSN: 2088-8708 . Figure 7. The line of 992 point computed in . one-FPGA core, . 2-FPGA cores, . 4-FPGA cores, . 32-FPGA cores and formulated with 32 different color segments each of 31 point Int J Elec & Comp Eng. Vol. No. February 2019 : 148 - 156 Int J Elec & Comp Eng ISSN: 2088-8708 Figure 7. The line of 992 point computed in . one-FPGA core, . 2-FPGA cores, . 4-FPGA cores, . 32-FPGA cores and formulated with 32 different color segments each of 31 point . Resource evaluation Each of the FPGA cores in the Zynq SoC contains variety of logic resources essential to build the assigned digital circuit by the user, such as the look-up tables (LUT), block RAM and flip-flops. The more FPGA cores are employed, the more utilization of resources is resultant. The parallel usage of these identical cores leads to a very fast implementation of Bresenham algorithm . for a line with a high number of points . The high ability of the Zybo platform and the Vivado software package lead to excellent achievement in the running time of Bresenham algorithm. However, the percentages of resource utilization are increased directly with the higher number of cores employed as Figure 8 shows. Figure 8. The increase in the number of FPGA cores employed against the increase in the percentages of resource utilization CONCLUSION The characteristic of procedure parallelization is extremely practical for any digital system design when the architecture of the available hardware supports parallelism. With the employment of the Zynq SoC that include 32 identical cores in its PL part, parallel processing of identical procedures was achievable. In this paper, this was applied to divide a line of 992 points into a maximum of 32 identical segments and implement Bresenham algorithm in parallel to compute the points of each of those segments. The hardware runtime for a line of 992 point was reduced from 9920ns . n case of normal implementation of Bresenham Hardware/software co-design for a parallel three-dimensional bresenhamAos algorithm (Sarmad Isma. A ISSN: 2088-8708 algorithm on the Zybo boar. down to only 0. 31s when the parallel implementation is handled. This makes the followed procedure suitable for real time graphical applications. Future applications to the concept of parallelization can include drawing a cube or other complex It can also be effective in drawing a 3D polygon shape using the same architecture in calculating one polygon line per core to end with formulating all the polygon lines at the same time, or even rendering a polygon mesh. In addition, drawing several lines in parallel . aximum 32 line. will speed up the overall graphical presentation. Moreover, the hardware-software co-design can also be expanded to facilitate the association between the PL (FPGA) and the PS (ARM-Corte. blocks in the Zynq chip. REFERENCES