# Design and Implementation of Scalable and Reconfigurable Approximation of DCT

## B. Hymavathi<sup>1</sup>, Dr. S. Balaji<sup>2</sup>, D. Anjaneyulu<sup>3</sup>

<sup>1</sup>M.Tech, CMR Institute of Technology, Medchal Road, Hyderabad, Telangana, India

<sup>2</sup>M.Tech, Ph.D, Professor, CMR Institute of Technology, Medchal Road, Hyderabad, Telangana, India

<sup>3</sup>Assistant Professor, M.Tech, CMR Institute of Technology, Medchal Road, Hyderabad, Telangana, India

**Abstract:** The digital image processing (DIP) has started as pioneer as a constituent of the digital signal processing (DSP) and emerged as the leader in very short time. Its prominence has kept on increasing in all research fields varying from low-level applications to high-level applications. The image and the video compression is a key operation in it and in past years various compression algorithms are proposed in the literature, but achieving the desired performance is still a challenging task. The introduction of the Discrete Cosine Transform (DCT) has successfully overcome the existing algorithm issues in an effective manner and has replaced the traditional Fast Fourier Transform (FFT). The DCT has the ability to consider the real component of the image data while the earlier FFT his miserably failed in this aspect. The proposed design is used for the computation of a 64-point DCT or for parallel computation of four 16-point DCTs or eight 8-point DCTs. The proposed method achieves the lower arithmetic complexity as well as computational complexity over traditional methods. The DCT used in the image compression can be replaced with the modified approximation DCT.

Keywords: Discrete Cosine Transform, Image and the video compression, 8-point DCTs

#### 1. Introduction

The very-large-scale integration (VLSI) has created revolutionary changes in the design of micro chip. According to Moore's law the usage of transistors in chip design increases dual times for every 18 months margin and standard research organizations defined very-large-scale integration (VLSI) as millions of transistors combination approach which results in micro chip of size ( tens of nanometers). The integration of millions of FETS through very-large-scale integration (VLSI) sets to perform operations related to predominant research areas like medical content processing, digital signal processing and advance robotics for the creation of artificial intelligence

An image in its original illustration carries an enormous quantity of knowledge. Thus, it needs massive amounts of memory for storage. compression is a very important space in image process that expeditiously removes the visually insignificant information. Compressed pictures square measure sent over the restricted information measure channel with some further process for strong (error free) transmission. rework primarily based compression algorithmic rule may be a most most well-liked selection that consists of image rework (in non-overlapping blocks), division of remodeled coefficients and entropy cryptography. Joint photographic professional cluster (JPEG) may be a committee that standardizes the compression algorithmic rule. The 8x8 block-wise twodimensional separate trigonometric function rework (2-D DCT) is employed as AN orthogonal rework in JPEG compression. pictures compressed by this normal square measure used globally. This algorithmic rule provides the user to settle on between the number of compression and quality as per the necessity of the image in several applications. The variable quantity of compression makes this algorithmic rule much appropriate for the transmission purpose because the user will change the bit rate of the transmission in line with the data rate.

#### 2. Motivation

The embedded processors are most used processors in daily applications and the technological association with it has made it more stable in real time usage. The hardware system implementation of a digital device is performed by using the processor, embedded digital signal processors, reconfigurable logic/processors,application specific instruction set processors (ASIPs), and hardware. Each processor mentioned above has its own advantages and disadvantages in its own way and their implementation in the necessary application can give the best performance. The embedded processor comprises of low cost and low power consumption while the DSP processor cost varies according the application area. The DSP processors are classified into three categories named Low, Midrange and the high end. The 21st century realtime applications process with the high speed and this criteria demands the high speed processors as the minimum requirement and the main obstacle faced in the real time applications are huge power consumption and the hardware cost. The high speed processor needs the best compression approach as its basic parameter and the traditional compression approaches, introduces the unnecessary complexity in the system which eventually degrades the system performance in unimaginable level. The introduction of the DCT has changed the course of compression action and replaced the traditional systems due to its ease and high popularity.

The traditional algorithms disadvantages demands the implementation of the advance compression approach and the motivation of the proposed method lies in its intelligent flow of action which is highly successful in reducing the computational complexity The necessary features are

#### Volume 6 Issue 1, January 2017 <u>www.ijsr.net</u> Licensed Under Creative Commons Attribution CC BY

satisfied with the implementation of DCT the approximation. In the proposed approach a novel approximate DCT is proposed named sparse DCT matrix recursive. The low complexity in terms of arithmetic is quite less compared to existing DCT approximation approach and the proposed DCT is having the different orthogonal lengths which is results in the lower error- energy compared to its earlier methods. The proposed method is has ability to transform the energy consumption to higher-size DCTs using the decomposition process. Interestingly, the planned algorithmic rule is definitely scalable for hardware additionally as software system implementation of DCT of upper lengths, and it will build use of the most effective of the present approximations of 8-point DCT.

# 3. Background

The Discrete Cosine Transform (DCT) communicates a finite sequence of data points as way as a sum of cosine function capacities varying at numerous frequencies. DCTs are essential to numerous applications in science and engineering, from lossy compression of audio (e.g. MP3) and images (e.g. JPEG) (where very little high- frequency components are often disposed of), to apparitional methods for the numerical arrangement of partial differential equations. The use of cosine function as critical sine function capacities is critical for compression, since it seems (as pictured beneath) that less cosine function capacities area unit expected to rough a run of the typical signal, whereas for differential conditions the cosines categorical a selected call of limit conditions. DCT is mainly utilised as a region of Image and Video pressure. Specifically, a DCT may be a Fourier-related transform just like the discrete Fourier transform (DFT), however utilizing just real numbers.

From past couple of decades, interest for communication of multimedia framework information through the broadcast communications arrange and getting to the sight and sound framework learning through internet is developing dangerously. Image compression is to a great degree vital for sparing transmission and capacity of images. There square measure a few compression strategies avaliable, however still there's should grow faster and a considerable measure of durable and healthy procedures to compress images. Image compression addresses the matter of diminishing the quantity of data expected to speak to a digital image. The fundamental premise of the diminishment technique is that the expulsion of excess data. There are 3 different measure to sorts of excess significant to images: spatial redundancy, psycho visual redundancy and spectral redundancy.By exploitation data compression procedures, it's achievable to dispose of some amount of repetitive data. This can abstain from squandering amount of file size and allows a ton of images to be continue amid a specific amount of disk or memory space. There are various measure transformation techniques used for data compression.

Discrete Cosine Transform (DCT) are the foremost ordinarily used transformation. The employment of cosine function instead of sine functions is essential for compression, since it needs fewer cosine function functions to approximate a typical signal. DCT has high energy compaction property and needs less computational resources. The main advantage of approximating the DCT is to induce obviate multipliers that contribute for the computational complexity. Rounding of the fractional value to the to the closest value i.e., either zero or one is nothing however approximating. This approximation is completed as a result of the computational time is additional for frac tional calculations. Therefore, if the fraction value is rounded off to the closest price the procedure time taken are going to be less. Most of the prevailing algorithms for approximation of the DCT target solely the DCT of tiny transform length. Higher length, like 16-point and 32-point isn't doable and a few of them area unit non-orthogonal. If the transform is orthogonal, we will forever notice its inverse and also the kernel matrix of the inverse transform is obtained by simply transposing the kernel matrix of the forward rework.

## 4. Literature Survey

(1) A novel data transfer approach is proposed by HARADA s et.al in the year 2014. Data transfer in this approach is done based on packets and it is named as packet data transfer scheme (PDTS). In traditional data transfer schemes the configuration memory size is unresolved issue and the size of Control/ configuration is reduced. In very-large-scale integration (VLSI) the size of the CCM depends on the number of modules distributed in reconfigurable manner and simultaneously the size of CCM is also depends on the read operations in all respective memories. In traditional data transfer schemes only inter LME is observed while the usage of packet data transfer scheme (PDTS) introduces both inter LME and inter cell for data transfer in effective manner. The data validity is indicated by Differential-Pair Circuits (DPCs) which is based flag information and Differential-Pair Circuits (DPCs) is used in this method to take control of the current sources [10].

Thee transfer of information is transferred in the form of packets and this transmission is enabled when the control signal changes and LME per cells ratio is proportional to the size of the CCM. The data transfer elements especially write and load operations operated automatically as the address of the destination end matches with the register ID address and once match happens then it results in reduction of CCM size. Finally the power saving operation is accomplished based on the control flag information which is autonomous in nature and CCM distribution is done to every cell according to their priority.

(2) A new data transfer scheme based on register level packet is proposed by FUJIOKA Y ET.AL in the tear 2012. The proposed data transfer scheme is a routing scheme and intention of proposed scheme is to reduce the size of the configuration memory belongs to a processor named as DRP (Dynamically Reconfigurable Processor) [9]. In literature various schemes is proposed to reduce the configuration memory size but due to allocation of packets to all clock cycles these schemes are failed to achieve the desired result. The proposed method stops the allocation of packets to all clock cycles this results in drastic decline configuration memory size. Collision of the packets is another major reason behind the large configuration size and the issue of packets collision is solved by using the buffer less routers in effective way for the construction of compact DRP.

#### International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2015): 78.96 | Impact Factor (2015): 6.391

Transfer of information is initialized by local memories and reconfiguration of local memories is always a concerned area. In this method a packet data control scheme is initialized to control the Dynamic reconfiguration of Local Memories (LMs) for effective result. Compare with conventional DRP the proposed DRP is succeed in reducing the configuration memory size by 1/10 of functional unit (10% reduced). The collision free transmission of packets is resultant of compact router which is very useful fine grain packet transfer and compact router has ability to process the large number of routers in ultra parallel mode. Finally this approach is successful in insert large number of routers in same chip size where traditional methods fails to do and the ability of the ultra parallel processing is greatly improved.

(3) In very-large-scale integration (VLSI) utilization ratio is major drawback as these ratio keep on declining in terms of performance. A novel approach named multiple-valued reconfigurable for VLSI system is proposed by OKADA N ET.AL in the year 2009. Improvement of utilization ratio for hardware resources is consider as challenging task compare to the software resources. As the hardware resources composed of different elements for different operations and different elements operations tends to give different ratio which results in declining the performance [8]. In conventional systems the utilization ratio estimation is based on single architecture and the single architecture approach fails to produce the desired result. In this paper a hybrid scheme is proposed to improve the utilization ratio for hardware resources. As the hybrid architecture is composed of cells and each individual cell is multiplexed by using 2to-1 multiplexer.

In literature various works have been witnessed to realize the distributed control but only the proposed succeed in achieving good distribution control as the interconnections between the controllers and logic modules is relatively short. During the utilization resources estimation tiny amount of overhead is witnessed with additional hardware elements. Finally hybrid architecture proposed in this approach is successful to reduce the complexity level in interconnections and the complexity levels are reduced mainly because of superposition of signal and data signal.

# 5. Existing Method

The discrete cosine transform (DCT) is considered as numerically advanced operation and it has wide range of impact in various applications and compression standards. The design of digital circuit system criteria based on 8-point DCT results in development of fast DCT algorithms. The ever increasing complexity is the primary concerned area in system design and complexity levels varies along the size of the application. A comprehensive low complexity 8-point DCT approach based on integer functions is proposed by R.J. CINTRA, F.M. BAYER, and C.J. TABLADA in the year 2014. The low complexity 8-point DCT implementation drastically reduces the computational complexity and compatible to various architectures. An approximated 8point DCT is presented in this paper and a collection of 12 approximations for integer functions are initialized for low complexity DCT. The different functions used in this paper are ceiling, truncation, floor and rounding-off functions. The intention behind the proposal of 12 approximations implementation is to meet the following criterions (i) various computationally algorithms has reported in literature in past decades and among all algorithms DCT is most efficient which supports wide range of applications, initial criteria is to yield the low complexity in terms numerical. (ii) The criterion next to low complexity is achieving Orthogonality, Orthogonality notates as 90 degrees phase shift and (iii) the final criterion is performing the inversion in terms of low complexity.

The scaling functions and its variations help in obtaining the approximations and on the other end traditional algorithms approximations are considered as exceptional cases of proposed methodology. The proposed DCT has inclusion of signed DCT and rounded DCT. In this paper various factors taken into consideration such as introduction of four orthogonal approximations and all four quasi algorithms reliability are based matrix factorization approach. Approximations of the proposed methodology require additions and approximations are free of multipliers. Finally the proposed methodology approximations are compared with the exact DCT and as well as its reliability is tested by jpeg compression for images.

# 6. Proposed Method

We discuss the planned scalable design for the computation of approximate DCT of and thirty two. we've derived the theoretical estimate of its hardware complexness and discuss the reconfiguration theme.

### A. Proposed Scalable Design

The basic procedure block of the algorithmic program for the planned DCT approximation, is given in [6]. The diagram of the computation of DCT supported is shown in Fig. 1. For a given input sequence the approximate DCT coefficients square measure obtained by  $F=C_N^{\Lambda}X^t$ . The basic approximation of DCT contains 22 additions and free of multiplications.



**Figure 1:** Signal flow graph (SFG) of (C8). Dashed arrows represent multiplications by 1

The approximation of DCT involves 14 additions as shown in fig.2. Using 14 additions approximation of DCT implement the reconfigurable and scalable of 64-point DCT.The main objective is reducing the power and calculation time.so area to be reduced in 64 point DCT using 14 additions approximation of DCT is less compared to 64 point DCT using 22 additions approximation of DCT.



Figure 2: Approximation of DCT using 14 additions



**Figure 3:** block diagram of proposed DCT for N=16(c^16)

An example of the diagram of is illustrated in Fig. 3, wherever 2 units for the computation of a reused together with associate input adder unit and output permutation unit. The functions of those 2 blocks square measure shown severally in (8) and (6). Note that structures of 16-point DCT of Fig. a pair of may be extended to get the DCT of upper sizes. as an example, the structure for the computation of 32- purpose DCT may be obtained by combining a combine of 16-point DCTs with associate input adder block and output permutation block.

#### **B.** Complexity Comparison

To assess the process complexness of planned –point approximate DCT, we'd like to see the process value of matrices quoted in (9). As shown in Fig. one the approximate 8-point DCT involves twenty two additions. Since has no process value and needs additions for –point DCT, the arithmetic complexness of 16-point, 32-point, and 64-point DCT approximations area unit sixty, 152, and 368 additions, severally. a lot of usually, the arithmetic complexness of -point DCT is up to additions. C. Proposed reconfiguration scheme



Figure 4: Proposed reconfigurable architecture for approximate DCT of lengths N=8 and 16

As laid out in the recently adopted HEVC [10], DCT of various lengths like, 16, thirty two square measure needed to be utilized in video committal to writing applications. Therefore, a given DCT design ought to be probably reused for the DCT of various lengths rather than victimization separate structures for various lengths. we have a tendency to propose here such reconfigurable DCT structures that may be reused for the computation of DCT of various lengths. The reconfigurable design for the implementationof approximated 16-point DCT is shown in Fig. 4. It consists of 3 computing units, particularly 2 eightpurpose approximated DCT units and a 16-point input adder unit that generates a(i) and b(i)he input to the primary 8-point DCT approximation unit is fed through 8 MUXes that choose either [a(0)...a(7)] or [x(0)...x(7)] counting on whether or not it's used for 16-point DCT calculation or 8-point DCT calculation. Similarly, the input to the second 8- purpose DCT unit (Fig. 3) is fed through eight MUXes that choose either [b(0)...b(7)] or , dependingon whether or not it's used for 16-point DCT calculation or 8-point DCT calculation. On the opposite hand, the outputpermutation unit uses fourteen MUXes to pick out and re-order the output counting on the dimensions of the chosen DCTis used as management input of the MUXes to pick out inputs and toperformpermutation per the dimensions of the DCT to be computed. Specifically sel16=1 allows the computation of 16-point DCT and sel16=0 allows the computation of a try of 8- purpose DCTs in parallel. Consequently, the design of Fig. three permits the calculation of a 16- purpose DCT or 2 8-point DCTs in parallel.



**Figure 5:** Proposed reconfigurable architecture for approximate DCT of lengths 8, 16 and 32

A reconfigurable style for the computation of 32-, 16-, and 8-point DCTs is given in Fig. 5. It performs the calculation of a 32-point DCT or 2 16-point DCTs in parallel or four 8point DCTs in parallel. The design consists of 32-point input adder unit, 2 16-point input adder units, and 4 8-point DCT units.The reconfigurability is achieved by 3 management blocks composed of sixty four 2:1 MUXes beside thirty 3:1 MUXes. the primary management decides whether or not the DCT size is of thirty two or lower. If the choice of input file is finished for the thirty two purpose DCT, Otherwise for the DCTs of lower lengths.



Figure 6: Proposed reconfigurable architecture for approximate DCT of lengths 8, 16, 32 and 64

A reconfigurable design for the computation of 64-, 32-, 16-, and 8-point DCTs of 22 addition and 14 addition is presented in Fig. 6. It performs the calculation of a 64-point DCT or two 32- point DCTs in parallel or four 16-point DCTs in parallel or eight 8-point DCTs in parallel. Sel64, Sel32 and Sel16 are used as control signals to the 4:1 MUXes. Specifically, for { Sel64, Sel32, Sel16}2 equal to {000}, {001}, {011} or {111} the 64 outputs correspond to eight 8-point parallel DCTs, four parallel 16-point DCTs, two parallel 32-point DCTs or 64-point DCT, respectively. Note that the throughput is of 64 DCT coefficients per cycle irrespective of the desired transform size.

## 7. Results

The proposed design is reconfigurable 64 point DCT using 22 additions approximation of DCT and 14 additions approximation of DCT as shown in figure 7.

|               |      | <u>[123</u>          |                      |  |                    |      |  |  |
|---------------|------|----------------------|----------------------|--|--------------------|------|--|--|
| Time: 1000 ns |      | 400                  | 600                  |  | 800                | 1000 |  |  |
| 🗉 🚮 f[63:0]   | 6    | 64'hFF0F30FCFC30F3CF | 64'hCFF3CFF30FFF0FFF |  | 64'hC00CC00CC00CC0 | DC   |  |  |
| 🖽 🚮 sel[2:0]  | 3'h7 | 3'h1                 | 3'h3                 |  | 3'h7               |      |  |  |
| 🖽 🚮 x[63:0]   | 6    |                      | 64'h1234567887654321 |  |                    |      |  |  |
|               |      |                      |                      |  |                    |      |  |  |
|               |      |                      |                      |  |                    |      |  |  |
|               |      |                      |                      |  |                    |      |  |  |
|               |      |                      | ~ ~ ~ ~ ~ ~          |  |                    |      |  |  |

Figure 7: 64-point reconfigurable architecture DCT

Compared the area between 22 additions and 14 additions of 64 point and 8 point approximation of DCT.In 14 addition approximation DCT the area is to be less compared to 22 additions.

| Table 1: Area for 14 additions and 22 additions of 64 point approximation DC | Г |
|------------------------------------------------------------------------------|---|
|------------------------------------------------------------------------------|---|

|                                            |                  | <u> </u>               |                       |
|--------------------------------------------|------------------|------------------------|-----------------------|
| Device utilization                         | Number of Slices | Number of 4 input LUTs | Number of bonded IOBs |
| 14 additions of 64 point approximation DCT | 168              | 307                    | 131                   |
| 22 additions of 64 point approximation DCT | 244              | 434                    | 131                   |

## 8. Conclusion

The existing algorithms for approximation of DCT targets only on the DCT of small transform lengths, the main objective is reducing the power and calculation time. Multiplications are the operations in DCT which consumes majority of time and power and it is very complex to calculate the values of DCT. This paper introduces an reconfigurable and scalable orthogonal ,approximation for the 64 point DCT and makes use of the symmetries of DCT basis vectors. The proposed transformation matrix contains only ones and zeros. Bit shift operations and multiplication operations are absent. The approximate transform of DCT is obtained to meet the low complexity requirements. The proposed method is found to offer many advantages in terms of hardware regularity, modularity and complexity.

## References

- [1] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, "NEDA: A low-power high-performance DCT architecture," IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955–964, 2006.
- [2] C. Loeffler, A. Lightenberg, and G. S. Moschytz, "Practical fast 1-D DCT algorithm with 11

# Volume 6 Issue 1, January 2017 www.ijsr.net

Licensed Under Creative Commons Attribution CC BY

multiplications," in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 1989, pp. 988-991.

- [3] M. Jridi, P. K. Meher, and A. Alfalou, "Zero-quantised discrete cosine transform coefficients prediction technique for intra-frame video encoding," IET Image Process., vol. 7, no. 2, pp. 165–173, Mar. 2013.
- [4] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, "Binary discrete cosine and Hartley transforms," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 4, pp. 989-1002, Apr. 2013.
- [5] F. M. Bayer and R. J. Cintra, "DCT-like transform for image compression requires 14 additions only." Electron. Lett., vol. 48, no. 15, pp. 919–921, Jul. 2012.
- [6] R. J. Cintra and F. M. Bayer, "A DCT approximation for image compression," IEEE Signal Process. Lett., vol. 18, no. 10, pp. 579-582, Oct. 2011.
- [7] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, "Lowcomplexity 8 8 transform for image compression," Electron. Lett., vol. 44, no. 21, pp. 1249-1250, Oct. 2008.

#### **Author Profile**



HymavathiBallareceived the B.tech degree in ECE department from the pragati engineering college affiliated to JNTU (K), Kakinada, INDIA and presently pursuing the M.Tech degree in VLSI system design from CMR INSTITUTE OF TECHNOLOGY

affiliated to JNTU (H), Hyderabad, INDIA. Her research interests includes VLSI Technology and design, Digital signal processing and architecture, CMOS digital IC, and image processing.



Dr.S.Balaji is a scholar with rich industry experience. He has 21 years of industrial and academic experience. He has worked as Professor in various Engineering colleges. He has working as DEAN & Professor of JNTUH at CMR institute of technology. His industry experience

includes Customer Support Engineer in "BIG APPLE COMPUTERS", Hyderabad and worked in ECIL, Hyderabad. He has coordinated many workshops and contributed 20 papers to various International Journals. He also has to his credit the award of Best Lecturer.



D. Anjanevulu has received his B.Tech degree in Electronics and communication Engineering (ECE) from JNTU Kakinada in the year 2012 and his M.Tech degree in the year 2015 with System signal

processing(SSP) as a specialization from the JNTU Kakinada. He joined the department of ECE in CMR Institute of Technology as Asst. Professor and continuing his research in the area of Embedded Systems and VLSI.