Hardware accelerator for anti-aliasing Wu's line algorithm using FPGA

Digital images are suffering from the stair-step effect because they are built from small pixels. This effect termes aliasing and the method uses to decrease so-called anti-aliasing. This paper offers a hardware accelerator of an antialiasing algorithm using HLS (high level synthesis) along straight-line segments or edges. These straight-line segments are smoothed by modifying the intensity of the pixel. The hardware implementation of two different architectures which is based on Zynq FPGA are presented in this work. The first architecture is built from one core while the second architecture is built from multi-core and uses a parallel technique to speed up the algorithm by dividing line segments into sub-segments and drawing them after smoothing instantaneously to formulate the main line. This parallel usage leads to a very fast execution of Wu's algorithm which is represented one-tenth hardware runtime for one core only. Also, the optimized resource utilization and power consumption for different cores have been compared, through single-core design which utilizes 8% and consumes 1.6 W, while utilized resources using 10 cores are 77% with a power consumption of 2 W.


INTRODUCTION
Digitizing continuous 2D graphic primitives is one of essential procedures in computer graphics that must be done at the sampling rate of device resolution. Losses in information during this important process produce aliasing (a staircaseeffect) which can be avoided by increasing the resolution of the raster device and consequently cost which is not an economic solution or using available techniques which utilize gray scales to increase the effective spatial resolution [1,2]. This paper uses the famous Wu's algorithm [1] that efficiently smooth object's edges by antialiased line generator and in order to speed up the rendering time, two hardware solution are proposed which can be embedded in a larger design to integrate a computer graphics system. On the other hand, Bresenham's algorithm is one of the first published algorithms [3] for plotting straight lines on a display device or a plotter where the grid over which the discrete points (or pixels) of the line are drawn . This algorithm was modified later by many researchers, it is also extended to work in three-dimension field and used in several applications and in hardware implementation as can be seen in the following works: Researchers in [4] try to speed up this algorithm using the properties of linear Diophantine equations in order to obtain a speed factor of almost five in scan converting a line segment. They claimed that their proposed technique could be easily implemented on most hardware systems. Other researchers in [5]  new fast algorithm for line drawing differ from the original Bresenham's algorithm. In this paper, a fast line drawing algorithm, which totally based on integer calculations is 3 times faster than Bresenham's algorithm in the feature of average time. The optimization is done by using slope symmetry. Comparing with Bresenham's algorithm, this algorithm reaches an important increase in entire efficiency. Additional advantage of the algorithm is its simplicity and compatibly for hardware implementation similar to Bresenham algorithm. The three-dimensional extension based on the same idea of Bresenham's algorithm is used in [6] based on minimum distance between the grid points and the plotted line. The organization of the Voronoi diagram is proposed for grid points to which line pixels may be approximated. To increase the efficiency of the calculation, integer arithmetic and symmetry are also used for their 3D extension of the algorithm. The simplification of the three-dimension Bresenham's algorithm is accomplished to be compatible for hardware requirement during the implementation stage [7]. In this work, all the hardware outputs are compared to that results from OpenGL product for verification. The graphic sub-system for three-dimension Bresenham's algorithm is done for real-time applications using Spartan3E FPGA.
The designed soft processor with a 3D graphics coprocessor [8] in Line Generator stage uses Bresenham's line drawing algorithm on the cheapest available FPGA board with HDMI connector, containing only 8K logic elements. Also, a 3D Stereo Rendering architecture, presented in [5], is successfully tested and results are proven. These results include the performance of the 3D rendering operations which is based on off-axis technique to create stereo pairs and Bresenham's line drawing algorithm to draw objects, which have been implemented on FPGA hardware. Also, researchers in [9], improved an approach to speed up the Bresenham algorithm by partitioning each line into a number of segments, finding the points belong to those segments, and then formulating the overall line by drawing them simultaneously. By employing 32 cores in the Field Programmable Gate Array, a line of length 992 points is formulated in 0.31μs only. The whole system is carried out using Zybo board which includes the Xilinx Zynq-7010 chip.
Although Bresenham's algorithm plots lines very rapidly, but it does not solve anti-aliasing oppositely. Wu's algorithm which has anti-aliasing function is relatively fast but is certainly slower than Bresenham's algorithm. The algorithm involves drawing pairs of pixels along the line, each colored according to its distance from the line. Pixels at the line ends are calculated alone. Aliasing along edges or straight-line segments is investigated by researcher in [10] from different points of view including its origin and effect of line slope. Then, the aliasing problem is solved by modifying gray level of each pixel to produce a smooth line segment. Hardware implementation of this method is finally formulated and tested using field programmable gate arrays (FPGA).

THEORY
A deep study of antialiasing effect was done with the goal of mixing two graphics methods to use in a computer graphics hardware accelerator for acceptable performance. So, this section is divided into two parts. The first part explains the theoretical part of Bresenham's line drawing algorithm, while the second part explains the details of Xiaolin Wu's line algorithm since both of them have been used in the design of the proposed system. The detailed discussion for the above two computer graphics theories are presented with numerical and experiment examples are presented in the next sections to show results that may affect the designed hardware.

Bresenham's line generation algorithm
The 2D Bresenham's algorithm is an incremental scan conversion algorithm using the minimum difference between distances to calculate pixels' positions. This is done by moving through the x-axis in one pixel intervals at each step selecting between two diverse y coordinates that are nearer to the original line [3,7]. The Bresenham's algorithm has the advantages of being a fast-incremental algorithm using only integer calculations. The simplest form for this algorithm is shown in Figure 1. The only drawback in this algorithm is the aliasing effect which can be enhanced. This used algorithm in this work is to calculate the endpoints for individual line segments, then the Wu's algorithm is used to display these anti-aliased line segments [11,12]. Although Wu's algorithm is usually used in modern computer graphics because it can draw smooth lines or solves aliasing effect. The rapidity and simplicity of Bresenham's line drawing algorithm make it still imperative. It is essential in many software graphics libraries and it is also used in the hardware design of contemporary graphics cards [7,13].

Xiaolin Wu's line algorithm
The Wu's algorithm is relatively slower than Bresenham's algorithm, but it solves the aliasing problem. It distributes intensity between nearest neighbors so that the total one is constant, but the intensity of each pixel is determined by its relative distance from the line [1]. A magnified view of the relation between the desired line and its neighbor pixels is shown in Figure 2 [1,14]. The upper part from this figure is for a line drawn using Bresenham's algorithm, while the lower part is the producers of the Wu's algorithm. According to Wu's algorithm (at each step), the calculation is made for the two closest to line pixels, and they have different intensity according to their distance from original line. The desired line is drawn in yellow, while the distance to the nearest cell is either green for lower pixel or red for upper pixel. If these distances are equal, they will have the same color each 50% intensity [2]. Otherwise, the intensity is divided between the pixels on both sides of the line as illustrated in Table 1. Finally, the traditional Wu's line antialiasing algorithm that was concluded is shown in Figure 3.    Figure 3. Wu's algorithm

THE PROPOSED HARDWARE ACCELERATOR
High level synthesis (HLS) technique has been used in many designs in the past several years [15][16][17]. Its acceptance remains to grow because it is the fastest way to convert complex algorithms into efficient hardware implementations since the computer graphics algorithms are complex in its nature [18][19][20]. Therefore, we try to use HLS in our work as mentioned before in Wu's algorithm. The line pixels pairs are generated pair after pair starting from point (xa, ya) towards the endpoint (xb, yb). Thus, the time necessary to calculate all the pairs is growing as long as the line extended, making the plotting process slow down. Our approach fuses Bresenham's algorithm with Wu's algorithm. First, the Bresenham's algorithm is used to divide the line into equal-length segments and calculates the individual line segments endpoints coordinates. Then, Wu's algorithm is used to anti-alias these line segments simultaneously. Many optimation directives are utilized in order to enhance the performance of our designed algorithm such as PIPELINE, DATAFLOW, and LOOP_MERGE directives that help the hardware algorithm within the HLS environment to speed up the execution time.

Implementation using single-core
The ZC702 board which is populated with the Zynq-7000 XC7Z020 AP SoC has been used in implementing the proposed design in this work as a hardware platform. It consists of an SoC-style combined processing system (PS) and programmable logic (PL) on a single chip [21]. Figures 4 and 5 illustrate the proposed design for Wu's algorithm only using this platform and its relative flowchart is shown in Figure 6. The Zynq AXI_lite interface connection is used to join the hardware core of Wu's algorithm with other hardware elements in the FPFA environment such as shared memory block and processing system (Figures 4  and 5). The shared memory in our design is used to keep the initial line endpoints and then, it receives all coordinates of the line segments that are generated by the hardware cores of our designed algorithm in order to send them to the PC via serial port ( Figure 6).

Implementation using multi-core
The used Zynq hardware platform in our design comprises a processing system (PS) organized around a dual-core ARM Cortex-A9 processor, and programmable logic (PL), that equivalent of traditional FPGA with additional new features such as integrated memory, variety of peripherals, and high-speed communications interfaces [21][22][23]. Figure 7 and Figure 8 illustrate the overall designed system and its relative flowchart respectively. Bresenham's algorithm for dividing line into pieces is done using the ARM A9 Cortex or the processor system (PS) available on the used Zynq chip followed by directing the calculated endpoints via AXI4-Lite bus, which is simple, easy and does not require memory mapping, to the programmable logic (PL) part located on the same Zynq chip.
Wu's algorithm is done on the FPGA or the programmable logic (PL) which contains up to 10 parallel cores each of those cores can perform an isolated procedure simultaneously with other cores. Therefore, to smooth a line, first, divide it into up to 10 equal-length parts and then, compute the endpoints of each part separately then draw all anti-aliased parts in parallel. As we mentioned in the previous section, each hardware core is associated with special block memory to store the output coordinates of each algorithm (Figure 7). Also, we implement the binary tree algorithm witch is the simple and fast algorithm [24][25][26] to divide the original line into the equal line segment which is based on the number of the hardware cores (10 cores in our system) in order to calculate the initial endpoints for each segment and then sent each pair of endpoints to specific Wu's hardware algorithm (Figure 8). As a result, the designed cores save the time required to find the total points of the line. In other words, Wu's algorithm is implemented concurrently with a number of times equal to the number of cores involved in the process. To decrease the total time needed to draw the line, the segments' number must be increased.

RESULTS AND ANALYSIS
In this work, all the design processes are done using the Vivado Design Suite package as well as design optimization, which is necessary to meet timing requests. The Xilinx release of Vivado Design Suit 2016.1 supports Zynq702 with a wide variety of FPGA devices. It replaces the previous design tool by its extra features of high-level synthesis and SoC [22,23,27,28].

Graphical analysis
Different slopes same length line segments are drawn using 2D Bresenham's algorithm directly without smoothing as depicted in Figure 9 (a) then, the same lines are drawn with antialiasing using Wu's algorithm as shown in Figure 9 (b). It is obvious that from this figure the efficiency of the Wu's algorithm is omitted the aliasing effect. After that, the number of segments produced from the main line using Bresenham's algorithm to calculate line segments endpoints rapidly depends on the number of FPGA cores used. These cores work concurrently for antialiasing purpose using Wu's algorithm. The generated points from two algorithms are plotted using Matlab as shown in Figure 10. Colors are used to differentiate the start and end of each section. The line pixels computed in (a) 1-core, (b) 4-cores, (c) 10-cores.

Timing analysis
The Xilinx Zynq ZC702 board that used in our design works at 156.25 MHz frequency [24]. The cores are used to share the calculation of antialiased line pixels. The hardware runtime is decreased to half when the number of cores used in the design is doubled. The best time accomplished is 0.31μs when 10 cores are involved. This time represents the splitting up time (in PS) in addition to the time necessary for antialiased pixels calculation (in PL). Figure 11 reflects the increase in the number of cores used against the decrease in the hardware running time. Figure 11. Inverse relationship between number of cores user and hardware time

Power consumption analysis
This section demonstrates power consumption by each core during the implementation of the designed hardware accelerator on Zynq. The power consumption between the different number of cores used in our design has been compared as shown in Figure 12. It is clear that the power which is consumed by PS (95%) is larger than that is consumed by PL (5%) using one core (total 1.6 W). This difference in the distribution of the consumed power has been shrunken as the number of cores increases to be (69%) in PS and (31%) in PL when using ten cores (total 2 W).

Resource evaluation
Each designed FPGA cores in the Zynq702 kit contains multiplicity of logic resources which is important to build the allocated digital circuit by the user, like look-up tables (LUT), block RAM and flip-flops [28]. The more cores are used, the more utilization of resources is resulted. The high capability of the Zynq702 platform and the Vivado design suite software package lead to excellent accomplishment in executing time. Conversely, the percentages of used resources that are increased directly with the greater number of cores used as shown in Figure 13.

CONCLUSION
Since the multi-core parallel system is used to draw anti-aliased lines, the larger number of segments, the faster the smoother pixels are calculated. This parallel usage of these similar cores leads to a very fast execution of Wu's algorithm (0.13 μs) which is the tenth hardware runtime for one core only. We have concluded that FPGAs are considered as a valuable platform for studying problems related to multi-core CPUs. Their flexibility allows different designs to be evaluated, and their ability to run full-length programs provides an advantage over software simulators. Partitioning an application between core processor and co-processor can be attractive for the theoretical or manufacturing situation. Resource utilization and power consumption parameter of the proposed technique are calculated and targeted on Zynq evaluation board development kit. It is found that through using single core design utilizes maximum 8% and consumes 1.6 W, while utilized resources using 10 cores is 77% with power consumption of 2 W.