Image Processing is an emerging field: every year new applications are introduced, and these are pushing the hardware requirements. This thesis looks at the design of a hardware accelerator to accelerate several filters used in Image Processing: 2D Convolution, Census Transform, and Local Binary Patterns. At Intel, these filters are used for Convolutional Neural Networks, Gaussian Blur, Stereo Vision, and Face Detection applications. The new hardware accelerator is based on an existing Intel accelerator (the Block Matching and Bilateral Filter accelerator). The new accelerator reuses some components from this accelerator such as multipliers, adder trees, and subtraction units. This allows for a considerable reduction of the area overhead. Furthermore, the new accelerator is more flexible than the existing one because it accelerates both the new filters and the original filters and has a variable window size of 5x5, 7x7 and 11x11 that is realized by combining the results of the smallest window together. In order to analyze the performance of the accelerator implemented in this work, we compare the new accelerator with the original one in terms of speed, processor utilization, throughput, and area. We demonstrate that the average speedup for the 2D Convolution and the 5x5 Census Transform is 1,57x and 1,50x respectively. Note that higher speedups are possible when using multiple instances of the new accelerator; i.e. the maximum speedup for the 2D Convolution is 7,86x, and the maximum speedup for the Census Transform is 4,50x. In addition, the processor utilization is lowered by 12,61x on average for the 2D Convolution and 4,00x for the 5x5 Census Transform. This improvement allows the processor to perform other operations in the background or to reduce the dynamic power consumption. Throughput is obtained by considering real-time video processing. The 2D Convolution is capable of processing 4K video, and the 5x5 Census Transform can handle 8K videos. Finally, the area of the new accelerator increases with 42% compared to the original accelerator, but at system level it results in a total area increase of only 6%. Then, varying the number of accelerators provides a trade-off between speedup, processor utilization, and area usage.