1 Introduction

Since ancient times, writing has been in use as a tool to communicate and disseminate information in forms which could be stored and accessed (Baird and Tombre 2014; Baird et al. 1992). These included the use of papyrus, palm leaves, clay tablets, stones, and metal plates for the purposes of communication and documentation. However, since the invention of paper in 105 A.D., paper documents have become a popular source for the incorporation of all types of written communication (Kasturi et al. 2002; Marinai 2008a; Baird and Tombre 2014). Later, with the advent of computers and new types of storage technologies, along with an increasing demand for the preservation of precious/irreplaceable documents, digital versions of paper documents have been introduced, which are popularly called as ‘document images’ (Kasturi et al. 2002; Baird and Tombre 2014). Today we consider scanners, fax machines, mobile cameras and frame grabbers as important sources of acquiring document images (Jain 1989; Rehman and Saba 2014). In an endeavor to move towards the paperless office (Kasturi et al. 2002; Marinai 2008a), large volumes of printed documents are being digitized and archived as document images over the internet and over different image databases (Jawahar et al. 2004b; Lee and On 2011; Doermann and Tombre 2014). Since the manual processing of such a vast number of documents became a cumbersome task, a need was perceived to automate the process of Document Images Analysis (DIA). As a consequence, with the birth of the OCR in the 1950s, automatic reading and analysis of document images became feasible (Kasturi et al. 2002; Marinai 2008a; Jawahar et al. 2004b; Doermann and Tombre 2014). Since then, DIA has seen tremendous growth in the form of many successful applications like postal automation (Angadi 2007), form processing (Vasudev 2007), bank cheque processing (Marinai 2008a; Jayadevan et al. 2012), and invoice reading (Klein et al. 2004).

Digitally acquired document images generally occupy a large amount of storage space and thus require an excess of download/transfer time (Kia 1997; Gonzalez and Woods 2009; Miano 1999). Therefore, for efficient storage and transmission, document images are subjected to compression before being archived in digital libraries or when transmitted over the internet (Sayood 2012; Salomon et al. 2010). As a result, huge volumes of compressed documents get generated on a daily basis. In addition, for the same reason, the trend at present is to keep and store the documents in a compressed form for as long as possible (Adjeroh et al. 2013; Mukhopadhyay 2011). The only way available in the literature to work with compressed document images is through decompression. Hence, direct analysis and processing of compressed document images has now become a potential research issue in the DIA (Javed et al. 2015a, b). This novel idea of carrying out both operations and analysis directly in the compressed representation is called compressed domain processing (Javed et al. 2013; Adjeroh et al. 2013; Mukhopadhyay 2011; Kia 1997). An operation is said to be in the compressed domain, if the required analysis and processing are carried out with minimal decompression or even without decompression of compressed data. In the recent literature, due to increased importance affixed to direct processing of compressed data (Mukhopadhyay 2011; Adjeroh et al. 2013; Lu and Jiang 2011; de Queiroz and Eschbach 1998; Kia 1997), many research works have been reported that anticipate opening up of a new vista in digital image analysis.

The current survey paper brings out two major contributions to the research community. First it presents a study on image analysis and image compression techniques and subsequently portrays the motivational reasons for pursuing DIA in the compressed domain. Second, it critically reviews the related techniques proposed in the literature based on the type of compression and the type of operations performed. In all, the paper provides a perspective for pursuing further research in the areas of image processing, pattern recognition and computer vision by employing directly the compressed data. Rest of the paper is organized as follows. In Sect. 2, we introduce document images and its related operations from the perspective of DIA. Further in the same section, we provide discussion on compressed representation of document images as a solution to the Big-data problem, and also mention the pros and cons of such a representation. We then differentiate document images from document files in Sect. 3, and also highlight in the same section the merits of pursuing research in the compressed domain. In Sect. 4, we provide a detailed literature review on document images from the perspectives of image analysis, image compression, and processing of images in the compressed domain. In Sect. 5 we give future research directions and in Sect. 6 we conclude the paper.

2 Document images and compressed representation

In this section, we introduce document images from the perspective of image analysis and image compression, and present the motives/reasons for pursuing document image analysis in the compressed domain.

2.1 Document images and analysis

In the phrase ‘document image’, the word ‘document’ is derived from the Latin word documentum which in Medieval Latin is referred to as a ‘written instruction’ or ‘an official paper’ that is normally used to communicate and store information (Javed 2016). Text is the main source of information communicated through paper documents which could be either in a printed or a handwritten form (Jawahar et al. 2004a, b; Jayadevan et al. 2012; Rehman and Saba 2012). The text contents in a document can also be accompanied by tables, graphs, images, etc., as additional source of information, which could be presented in different types of layouts (see Fig. 1; Antonacopoulos et al. 2009; Marinai 2008a; Ogier et al. 2009). Two such typical printed documents containing text and non-text formatted in double and triple column layouts are shown in Fig. 1, which are extracted from Prima Layout Analysis Dataset (Antonacopoulos et al. 2009) published in ICDAR2009. In an effort to move towards a paperless office, thanks to the contributions made by the DIA community in the past few decades (see the conference series of International Conference on Document Analysis and Recognition-ICDAR, since the year 1991), that today, automatic analysis and processing of such documents have become a reality (Kasturi et al. 2002; Marinai 2008b; Doermann and Tombre 2014). In real life we encounter document images in both printed and handwritten forms.

Fig. 1
figure 1

Two sample documents from the Prima Layout Analysis Dataset, a 00000676, b 00000263

A decade ago, DIA was restricted only to static document images originating from scanners and fax machines (Jain 1989; Baird et al. 1992; Baird and Tombre 2014). However, with advancements in computers and digital technology today, we find instances where document images are captured from mobile phones, digital cameras, and surveillance video cameras too (Jawahar et al. 2004b; Jain 1989). Therefore, with the rapid pace of technology, the problem of DAR (Document Analysis and Recognition) has spread to other domains too like natural scene image processing and video processing (Mukhopadhyay 2011). Since 2005, a separate workshop called CBDAR (Camera-Based Document Analysis and Recognition) has been dedicated to the processing of camera based document images in ICDAR. In addition, due to extensive usage of video cameras in generating document images, we also anticipate VBDAR (Video Based Document Analysis and Recognition) (Jathanna and Nagabhushan 2015) in the near future. Overall, we can say that the scope for DIA is increasing day by day. In the current research paper, the DIA on static compressed document images is reviewed; however, the issue pertaining to natural scene images and video frames are not chosen to be discussed.

DIA entails tools and techniques that allow the contents of the image to be analyzed through the computation of statistics and mathematical measurements using the pixel intensities present in the image (Marinai 2008b; Baird et al. 1992). Computing histogram, line profiles, correlation, and entropy could be a few such analyses related to document images (Marinai 2008b; Nixon and Aguado 2012; Ding et al. 2012). The DIA technique involves carrying out different operations on document images which can be grouped broadly into Textual processing and Graphical processing (see Fig. 2; Kasturi et al. 2002). The overall objective of DIA is to recognize the text and graphic components in the document image, and to generate the expected information required for further decision making.

Fig. 2
figure 2

Block diagram representing DIA techniques

In DIA, textual processing involves the development of OCR and page layout analysis techniques (Kasturi et al. 2002; Marinai 2008a). The job of OCR is to recognize and convert the text contents printed in the image to the ASCII form which eventually consumes less storage space and becomes convenient for editing and search operations (Doermann et al. 1998; Doermann 1998; Doermann and Tombre 2014). While the page layout analysis techniques complement OCR by feeding only the text contents that are automatically extracted from document images which involve a complex layout (Antonacopoulos et al. 2009; Baird et al. 1992), the technique also includes the detecting of skew, text blocks and paragraphs in the document. On the other hand Graphical processing involves the techniques for the detection and processing of non-text components such as logos, images, tables and graphs (Ogier et al. 2009). All of the operations in conventional DIA are carried over the uncompressed version of the document image. In the present research paper, the DIA techniques that operate directly with the compressed document images are reviewed.

2.2 Compression of document images

Digital images generally occupy a large amount of storage space, and hence consume more time during transmission and downloading (Sayood 2012; Salomon et al. 2010; Miano 1999). For example, the number of uncompressed colour images that an 8 Mega pixel camera can store on a 16 GB flash memory card is about 41 images of 24 Mega Bytes (MB) each (Gonzalez and Woods 2009). To elaborate further, consider the sample document image (00000676) shown in Fig. 1, whose size in the uncompressed mode is 21.6 MB. By compressing the document image with JPEG (Joint Photographic Experts Group) and TIFF (Tagged Image File Format), the file sizes are reduced to 1.24 MB and 131 KB (Kilo Bytes) respectively. This also implies that, with a typical modem of speed 56 kbps (7 kBps), the entire document can be transmitted in approximately 52.66 min in an uncompressed mode, 3.02 min in the JPEG mode, and just 18.71 s (seconds) in the TIFF mode. That’s a big difference! Therefore, it has become very essential and the usual trend to compress digital images before archiving or communicating over the network, from the point of efficiency of both storage and transfer (Javed 2016; Kia 1997).

Image compression is a technique that is used to identify internal data redundancy and to subsequently come up with a compact representation, which occupies less storage space than the original image data (Sayood 2012; Salomon et al. 2010; Miano 1999). The reverse of this process is called decompression. There are different techniques of image compression available in the literature such as Run-length coding, Huffman coding, Arithmetic coding, Transform coding, Predictive coding, Sub-band coding, Vector quantization, and Wavelet transforms. A detailed review of these compression algorithms is provided in Sayood (2012), Salomon et al. (2010) and Miano (1999). In general, the compression algorithms can be classified into lossless and lossy types (Gonzalez and Woods 2009). Lossless algorithms compress the data in a lossless (without losing any information) manner and give back the exact data when decompressed. Out of the above listed algorithms, Run-length, Huffman, and Arithmetic belong to a group of the lossless type of compression, and the rest come under the lossy type of compression (Sayood 2012; Salomon et al. 2010). Lossy algorithms refer to a loss of information through compression which cannot be recovered during the decompression stage.

In the literature, there are different compressed image file formats (Miano 1999) available, and among them, the popular ones are TIFF, PDF (Portable Document Format), JPEG, PNG (Portable Network Graphics), and BMP (Windows Bitmap) (Javed 2016; Sayood 2012; Salomon et al. 2010). Each file format has internal support for several types of compression algorithms as is generally found in TIFF and PDF. The file formats can be grouped into continuous tone still-image compression formats and bi-level still-image compression formats (Miano 1999). The following are the compression formats under the category of continuous tone still-images which are used specifically to compress natural scene images and video frames.

  • JPEG—A JPEG (Gonzalez and Woods 2009; Miano 1999; Mukhopadhyay 2011) is a standard format that is used to represent colour images of a photographic quality. It is a lossy baseline coding system that uses quantized Discrete Cosine Transformations (DCT) on \(8\times 8\) image block. It also uses Huffman and run-length coding during entropy encoding. It is one of the most popular methods for compressing images on the internet (Mukhopadhyay 2011).

  • JPEG-LS—It is a lossless to near-lossless standard for compressing continuous tone images. It uses adaptive prediction, context modelling, and Golomb coding (Sayood 2012) techniques.

  • JPEG-2000—This is an extension to JPEG for increased compression of photographic quality images (Sayood 2012; Salomon et al. 2010). It uses Arithmetic coding and quantized Discrete Wavelet Transforms (DWT). The compression can be lossy or lossless.

  • BMP—Windows Bitmap (color) (Sayood 2012; Salomon et al. 2010), is a file format used mainly for creating simple uncompressed images in Microsoft Windows platform.

  • GIF—Graphic Interchange Format (Sayood 2012; Salomon et al. 2010) is a file format that uses lossless LZW (Lempel–Ziv–Welch) (Adjeroh et al. 2013) coding for 1 through 8-bit images. It is used popularly for making small animation films and short low resolution films for internet applications.

  • PDF—Portable Document Format (Sayood 2012; Salomon et al. 2010) is a format representing 2D documents in portable devices. It has the capacity to incorporate the JPEG, JPEG2000, CCITT, and other compression algorithms. Some PDF versions are also recognized as ISO standards.

  • PNG—Portable Network Graphics (Sayood 2012; Salomon et al. 2010) is a file format that losslessly compresses full colour images. It produces a transparency (up to 48 bits/pixel) by coding the differences between each pixel value and a predicted value based on past pixels.

  • TIFF—Tagged Image File Format (Sayood 2012; Salomon et al. 2010) is a file format supporting varieties of image compression standards. They include CCITT, PackBits, JPEG, JPEG-LS and others.

  • ZIP—This is a compression format (Sayood 2012; Salomon et al. 2010) used for greyscale or colour images (smarter version of LZW). It is a lossless algorithm.

Following are the compression formats under bi-level image which are specifically used to compress black and white document images.

  • CCITT Group 3—CCITT (T.4-Recommedation 1985) stands for Consultative Committee for International Telephone and Telegraph, now known as ITU-T (Telecommunication Standardization Sector of the International Telecommunications Union). It is popularly used as a facsimile (FAX) method for transmitting binary documents over telephone lines. It supports 1D (MH-Modified Huffman), 2D (MR-Modified Read) run-length and Huffman coding. TIFF with CCITT Group 3 compression is considered to be the universal benchmark for FAX and multi-line documents (Cvision Technologies 2015).

  • CCITT Group 4—An extended version of CCITT Group 3 (T.6-Recommendation 1985) standard supporting 2D (MMR-Modified Modified Read) run-length coding only.

  • JBIG or JBIG1—A Joint Bi-level Image experts Group standard (Salomon et al. 2010) for progressive, lossless compression of binary images. Context sensitive arithmetic coding is used and the image gets enhanced gradually from low resolution progressively as additional compressed data is added.

  • JBIG2—A replacement for JBIG1 (Salomon et al. 2010), for bi-level images in desktop, internet and FAX applications. The compression techniques used here are content based. The dictionary based methods are used for text and halftone regions, whereas the Huffman or Arithmetic coding are for other image contents. These can be either lossy or lossless.

The various important compression schemes and supporting compressed image file formats are listed out in Table 1. Also the Table 2 summarizes different compressed image file formats based on three modes of compression namely, uncompressed, lossy, and lossless. Depending on the application one can choose the file type and the file format.

Table 1 Different compression schemes and supporting compressed file formats
Table 2 Compressed image file formats in different compression modes

With respect to document images, the popular compressed file formats available are BMP, TIFF, PDF, JPEG and JBIG (Javed 2016; Salomon et al. 2010; Miano 1999). Each format has an internal support for different compression algorithms. As observed from Tables 1 and 2, TIFF is a very flexible image format which supports all the three modes of image compression, and as a result it is one of the most popular formats when it comes to the compression of document images especially the text image data. PackBits (a form of run-length compression), CCITT Group 3 1D (MH), CCITT Group 3 2D (MR), CCITT Group 4 (MMR), and JPEG are some of the popular compression schemes embedded in the TIFF format (TIFF 1992; Javed 2016).

Run Length Encoding (RLE) and its variant algorithms constitute the backbone of document image compression in TIFF (TIFF 1992). Compression schemes such as RLE, Huffman coding, MH, MR, MMR, PackBits and JPEG encoding are all based on different forms of run-length data compression (Javed 2016). The performance of run-length based compression algorithms on a sample text document image (see Fig. 1a) is shown in Table 3.

Table 3 Effect of different document image compression algorithms on a sample document image (shown in Fig. 1a)

From the Table 3, it can be observed that though the JPEG is a popular compression algorithm (in the internet world), the compression ratio with regard to text documents is very low in comparison to run-length based compression algorithms (Javed 2016). Moreover, JPEG is a lossy compression scheme suitable for compressing color images, and as a result it is the least preferred format when it comes to the compression of text document images (TIFF 1992; Javed 2016; Miano 1999). On the other side, JBIG is the recent document compression standard developed by CCITT now ITU-T (Salomon et al. 2010). Though JBIG gives a good compression ratio when used with text documents, it is still not popular for compression purposes because of many patent-related issues and implementation issues linked to JBIG technology (Salomon et al. 2010; Javed 2016). Moreover, JBIG uses Arithmetic coding internally which makes it less suitable for compressed domain processing (Kia 1997; Adjeroh et al. 2013), and JBIG is yet to find its place in popular image formats like TIFF.

From the study presented so far, it can be seen that most of the data compression techniques in the literature have been invented either from the perspective of efficiency of data storage and its transfer or are based on the nature of the compressed data which may be either lossy or lossless (Sayood 2012; Salomon et al. 2010; Kia 1997; Adjeroh et al. 2013). Less importance was given to the direct processing of compressed data during the compression process (Kia 1997). However, in the present day scenario due to the availability of a large volume of documents in a compressed form, it has become crucial to think of providing direct access to the objects present in the compressed data. In this direction, the literature has made a few initial attempts to develop operation-aware compression schemes in document images (Kia 1997; Garain et al. 2006a, b), which are friendly enough to support a few pre-defined operations in the compressed data. To the best of our knowledge, the first attempt to develop an operation-aware compression scheme was made by Kia (1997). The researcher(s) have presented a customized symbol-based compression scheme which is similar in principle to JBIG compression. Here, the compression algorithm is made intelligent so as to handle compressed domain operations. The novel idea of developing an object-aware compression scheme was standardized in the form of JPEG2000 by the Joint Photographic Experts Group committee in the year 2000 and it is based on Wavelet compression (Salomon et al. 2010; Kia 1997). The other discussions related to compressed domain processing are presented in the upcoming sections.

2.3 Document images as a Big-data problem and compression as a solution

Big-data is the problem related to handling enormous amount of data that are generally characterized by 3 V’s (Volume, Velocity, and Veracity) (Javed 2016). According to the statistics published by United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community, it identifies three major sources of Big-data such as Social Networks, Traditional Business Systems and Websites, and also Automated Systems which generate huge amounts of data in the form of texts, images, audios and videos. As per the recent statistics by UNECE, the data growth is going to become exponential in the coming years as shown in Fig. 3 (Javed 2016). Therefore, due to the predicted exponential growth of data, the efficient handling of Big-data, its processing, storage and transmission are all becoming issues seen as a challenge and are the main focus in the present day scenario (Javed 2016).

Fig. 3
figure 3

Exponential growth of Big-data predicted by UNECE

In the literature, data compression techniques have been used extensively to overcome the volume aspects of Big-data (Mukhopadhyay 2011; Salomon et al. 2010; Javed 2016). As observed from Table 3, the compression techniques make a big difference in reducing the volume of uncompressed data (for the sample text document in Fig. 1a 99% data reduction is achieved). The overall objective of data compression is to transform big size data to small size data for efficient storage and transmission. The data compression techniques not only make the Big-data space efficient, but also communication efficient (download/upload); this is because transferring of a large volume of data over the internet or personal networks is expensive in terms of time and bandwidth (Kia 1997; Miano 1999). Therefore, in order to maintain the efficiency of Big-data (texts, images, audios and videos) different compression strategies have been invented in the literature of data compression (Anantharaman 2001; Sayood 2012; Salomon et al. 2010).

In the year 2012, it was reported by Venter and Stein (2012) that images and image sequences (videos) make up about 80% of all corporate and public Big-data. Since compression techniques have become an integral part of image and video acquisition systems, we can today confidently state that more than 80% of Big-data consists of compressed images and videos that are spread across digital libraries, internet and personal databases (Javed 2016). Therefore an automatic mode of analysis of compressed images and videos is becoming a challenging research issue in Big-data. This is because of the traditional method of processing of compressed data, which is through decompression that indents additional computing resources and demands a large buffer memory (Mukhopadhyay 2011; Adjeroh et al. 2013; Javed 2016). Thus for a huge volume of Big-data, decompression based processing may become very expensive in terms of time and space in the near future. Since compressed document images are also an integral part of the Big-data problem, the research study on direct processing of compressed document images is also very much relevant from the Big-data perspective (Javed 2016). In today’s scenario, with the rapid growth of digital libraries, e-governance, and internet based applications, a huge volume of document images are being archived and communicated in a compressed form (Mukhopadhyay 2011). The discussion of the issues related to compressed domain processing of natural scene images and videos is not intended to be a part of this current research study; however, it is touched upon briefly in the Sect. 4.

2.4 Limitations of the compressed representation of document images

No doubt, in the literature, data compression techniques have been considered to be one the ways to overcome the volume aspects of Big-data, especially the issues related to data storage and transmission; but there are also some limitations in having data in the compressed form which are projected hereunder (Javed 2016). The overall objective of data compression is to transform big size data into small size data, so that the data gets handled efficiently (Miano 1999). On the one hand, data compression techniques overcome the issue of data storage and transmission; but on the other hand, they make the processing of compressed data (small size data) expensive in terms of time and space (Mukhopadhyay 2011; Adjeroh et al. 2013). This is because of the traditional way of processing compressed data that is carried out through decompression, which implies that in order to operate on the compressed data, the data needs to be decompressed first, and then processed, which requires additional memory and computing resources (Kia 1997; Javed 2016). Therefore, the traditional model of decompression-based processing is generally considered to be expensive (Mukhopadhyay 2011; Adjeroh et al. 2013).

From the perspective of the Big-data problem, decompression implies that the compressed data (small size data) is again brought back to the state of Big-data (large size data), which is not desirable. Therefore, the issue of volume in the Big-data problem still remains unsolved when it comes to the processing of compressed data via decompression (Javed 2016). From the viewpoint of storage and transmission, decompression is expensive in terms of requirement of memory for the decompressed data, and further, once the decompressed data is processed, it needs to be re-compressed to make it storage and transmission friendly (Kia 1997; Mukhopadhyay 2011; Javed 2016). Therefore, in a decompression based processing model, the compressed data needs to repeatedly go through the stages of compression and decompression, depending upon the number of times the compressed data requires processing.

Consequently, based on the different issues highlighted in this section, the overall concern in the present research theme is to keep the data in its compressed form and yet be able to directly analyze and process the data without the involvement of the decompression stage. In DIA, OCR (Baird and Tombre 2014; Baird et al. 1992) is used to solve many problems related to document images such as searching, editing, indexing and retrieval (Doermann et al. 1998; Doermann 1998). However, applying the existing OCR on a compressed document requires the data to be in an uncompressed form, which is possible only when the compressed data is decompressed. Therefore, decompression becomes an additional burden while using OCR on compressed documents (Javed 2016). This issue has motivated the researchers to investigate for a decompression-less and OCR-less model of processing compressed document images which is discussed in Sect. 4.

3 Document images vis-a-vis document files

The dictionary meaning of the word ‘document’ is a piece of written, printed, or electronically typed matter that provides some information or evidence or serves as an official record. Though in the present day, we use documents in both printed and electronic versions, paper continues to be the preferred medium of communication, due to the ease in navigation and adding of annotations. The electronic version of the document is called document file or ASCII file, which requires less memory, and supports editing and searching operations with a minimum effort (Adjeroh et al. 2013; Lloret and Palomar 2012; Gambhir and Gupta 2016). When the same document file is printed or typed on a paper, we get a document in the paper form. Any operation or information extraction from the paper document requires manual processing which is considered a laborious task when it involves huge volumes of paper documents (Jawahar et al. 2004a; Javed 2016). Therefore in real time, computers are the preferred choice for automating the process of document analysis (Marinai 2008a). Computers cannot directly handle paper documents: hence, documents have to be digitized with the help of scanners or fax machines (Jain 1989). A paper document in the digital form is called a document image. A document image generally requires more storage space than a document file (Salomon et al. 2010; Adjeroh et al. 2013). For example, a typical document page which in ASCII form can be easily stored in 2–3 KB requires a storage space between 500 KB and 2 MB (Kia 1997).

In the current digital era, due to an excessive use of digital technology, a huge amount of data is being generated in the form texts and images (Venter and Stein 2012) through corporate, social media, and the world of the internet. The text-contents are the most informative component in document files and images. Moreover, the text contains a rich source of information in text books, journals, newspapers, magazines, legal documents, and so on (Baird and Tombre 2014). Therefore, the automatic extraction and analysis of text in both files and document images has been an important research issue since many decades (Adjeroh et al. 2013). Automatic analysis of text in files includes operations such as searching, editing, indexing, classification, text mining, information extraction, and text retrieval. A detailed survey of different operations and techniques related to text files are available in Salton (1988), Berry (2013), Lloret and Palomar (2012) and Gambhir and Gupta (2016). In the literature, many strategies have been evolved to solve the problem of conventional pattern matching in text files. The issue of exact pattern matching (Faro and Lecroq 2013; Crochemore et al. 2007; Adjeroh et al. 2008) and inexact pattern matching (Navarro 2001) are reported in detail. The other variants of pattern matching with swaps (Lee et al. 1997), with fusion (Adjeroh et al. 1999), with don’t cares (Akutsu 1994), with scaling (Amir et al. 1992), with rotation (Amir et al. 2006), multiple patterns (Aho and Corasick 1975; Amir and Calinescu 1996), super patterns (Knight and Myers 1999), multi-dimensional patterns (Giancarlo and Gross 1997), parameterized pattern (Fredriksson and Mozgovoy 2006) have also been attempted. In case of compressed pattern matching, the key works reported are on LZW (Lempel–Ziv–Welch) (Amir et al. 1996; Adjeroh et al. 2013), LZ77 (LZ refers to Lempel–Ziv coding and 77 refers to 1977, the year when this coding was introduced) (Farach and Thorup 1995; Salomon et al. 2010), hybrid approach using LZ77 and LZ78 (LZ78 refers to advanced version of LZ77 coding introduced in the year 1978) (Navarro and Raffinot 1999, 2004; Salomon et al. 2010), BWT (Burrows Wheeler Transform) (Bell et al. 2002), Grammar based compression (Kieffer and Yang 2000; Yang et al. 2001), Run-length encoding (Eilam-Tzoreff and Vishkin 1988; Apostolico et al. 1997; Bunke and Csirik 1993, 1995; Adjeroh et al. 1999), and Huffman codes (Mukherjee and Acharya 1994; Moura et al. 2000; Ziviani et al. 2000).

However, extracting information from a document image is not quite simple as has been found with document files (Jawahar et al. 2004a; Adjeroh et al. 2013; Doermann and Tombre 2014). Before the document image is ready for analysis, it has to go through different stages such as pre-processing, feature extraction, segmentation, and finally recognition or analysis (Kasturi et al. 2002; Marinai 2008a). The invention of OCR was a major breakthrough in DIA, and its prime task was to transform the printed text image into a printed text file (Doermann 1998). A detailed survey on OCR and other automatic document image processing techniques with respect to uncompressed documents are discussed in Tang et al. (1996), Kasturi et al. (2002) and Doermann (1998). In today’s scenario as a huge volume of text document images are being generated on a daily basis in digital libraries, e-governance, and internet based applications, the trend at present is to store these documents in a compressed form for the purposes of efficiency in storage and transmission. Therefore automatic processing and analysis of texts in compressed document images is becoming an important research issue (Mukhopadhyay 2011; Adjeroh et al. 2013).

4 State of the art

In this section we present a detailed review of the literature on document images from three perspectives- image analysis, image compression, and image processing in the compressed domain.

4.1 Document image analysis

The objective of DIA is to automatically analyze and extract different kinds of information from the document images (Marinai 2008a, b; Baird et al. 1992; Baird and Nagy 1994). The different stages generally identified in any typical DIA system are pre-processing, segmentation, recognition, and post processing (Kasturi et al. 2002; Marinai 2008a; Rehman and Saba 2014). The pre-processing stage improves the quality of the image, and increases the possibility of accurate detection of the objects of interest. The segmentation stage detects and extracts the objects from the image. The recognition/classification stage identifies the objects of interest and categorizes them into certain classes. The post-processing stage involves techniques that actually boost the accuracy and performance of DIA applications. All these stages involve feature extraction (Ding et al. 2012) which exhibits the distinctive features or characteristics of the document images. In the literature, many operations have been defined which tackle various issues arising through the stages of DIA, from among them a few important ones are listed out in Fig. 4. A detailed survey is available in Kasturi et al. (2002), Marinai (2008a) and Ding et al. (2012). DIA involves the processing of both handwritten and printed document images.

Fig. 4
figure 4

Important operations through the different stages of a typical DIA system

The important operations involved in pre-processing stage are filtering, geometrical transformations, and thinning (Nixon and Aguado 2012; Marinai 2008a). Image filtering gives an image where each pixel has a value that is influenced by the pixels in its neighborhood in some strategy or the other. Important filtering operations include binarization (Lu et al. 2010; Bolan 2012), noise removal (Farahmand et al. 2013), and image enhancement (Bolan 2012; Hendahewa 2010). Geometrical transformations (Saragiotis and Papamarkos 2008; Na and Jinxiao 2011; Shiraishi et al. 2013) include operations such as detection and correction of skew in the document image. A thinning operation (Gonzalez and Woods 2009; Kasturi et al. 2002) is used to compute the features based on the skeleton of the image components. Overall, the pre-processing stage improves the quality of the image, and helps in the escalation of performances of the subsequent stages and eventually of the entire system itself.

Any typical printed document image may contain texts, images, tables, graphs, etc (Namboodiri and Jain 2007; Ogier et al. 2009). Therefore, the first task in any document image analysis is to understand the layout of the document, namely the physical layout and the logical layout (Marinai 2008a). Segmented page layout is called the physical layout and the corresponding labeled page layout is called the logical layout. Page segmentation techniques in the literature can be grouped into bottom up and top down type approaches (Namboodiri and Jain 2007). The recent developments in the area are reported in Zirari et al. (2013), Chen et al. (2013) and Breuel (2003). The bottom up approach (Zirari et al. 2013) involves repeated aggregation of lower level components to achieve segmentation, whereas in the top down approach (Chen et al. 2013; Breuel 2003), to begin with larger components are detected initially, and then broken down into smaller components to the required level. Once the page is segmented, the important features are next extracted from the segmented components, and their logical meaning is then attached to them (Nagy et al. 1992; Chen et al. 2013); for example the decision regarding text or non-text is done here. The other aspects in page segmentation are segmentation of text into paragraphs, lines, words and characters which are very critical in deciding the performance of the OCR (Rehman and Saba 2012; Dash et al. 2016). A detailed survey on text segmentation and recognition issues are reported in the research work of Nagy (2000) and Marinai et al. (2005). The other important research concerns related to text document images are script identification (Ghosh et al. 2010; Dhandra et al. 2006), font detection (Ramanathan et al. 2009; Slimane et al. 2013), reading order determination (Meunier 2005; Yucun et al. 2010), document classification (Chen and Blostein 2007), etc. Another important application in text document images is word spotting, that involves the searching of specified keywords within large repositories of document images. Today word spotting (Ahmed et al. 2017) is a challenging problem in historical documents (Rath and Manmatha 2007), degraded documents (Yong et al. 2010), and handwritten documents (Wshah et al. 2012a, b; Ahmed et al. 2017). A detailed survey on keyword spotting techniques is available in Murugappan et al. (2011) and Ahmed et al. (2017). On the other hand, processing of graphical components present within the document image is also an important aspect of DIA. A detailed work on graphics recognition can be traced from the series of workshops on GREC (Graphics RECognition) (Ogier et al. 2009) in ICDAR conferences.

All through the stages of DIA, the different features involved need to be extracted from the document images in order to carry out the desired operation (Nixon and Aguado 2012; Marinai 2008b; Ding et al. 2012). The feature extraction techniques reported in the literature can be grouped broadly into three classes (Nixon and Aguado 2012) namely, the low level features, the shape based features, and the flexible shape based features. The low level features are those features that can be extracted without having the shape information sourced from the image (Nixon and Aguado 2012; Ding et al. 2012). All the point based features (projection profiling (Javed et al. 2013), entropy (Javed et al. 2015b), histogram, template operations, thresholding, correlation (Javed et al. 2015a)), the edge based, the line based, the curvature based, and the motion based features come under the category of low level features (Javed 2016; Nixon and Aguado 2012). The shape based features are high level features which include techniques such as image subtraction, image thresholding, intensity template, binary templates, and Hough transform (Nixon and Aguado 2012). The flexible shape features include techniques such as the deformable templates, the snakes, the symmetry operations, and the active shape models (Nixon and Aguado 2012). The different DIA techniques discussed so far have the limitation of working only with uncompressed document images. This implies that the techniques proposed may not be directly suitable for carrying out operations on compressed document images. However, in the present research paper compressed domain techniques related to feature extraction, segmentation, text-font characterization and word spotting directly in compressed document images are reported.

4.2 Document image compression

In the literature, we have many types of compression techniques proposed which are used to compress images either with redundancy reduction or lossy coding techniques so as to achieve a high compression ratio. The redundancy reduction methods are run-length coding and Huffman coding (Sayood 2012; Salomon et al. 2010; Miano 1999), where the compression in run-length coding is achieved by replacing the sequence of similar elements with the value of the element and its count, and in Huffman coding, the frequently occurring elements in the image are substituted with shorter codes that are dynamically computed after going through the whole data once (Sayood 2012). The redundancy reduction methods are of a lossless type of compression, and are the most preferred in the compression of text document images. On the other hand, lossy coding methods are lossy (incurring loss of information) in nature due to which a high compression ratio is achieved, and are usually preferred in compressing scene images and image frames (in videos) (Miano 1999; Sayood 2012). The different types of compression algorithms under this category are Transform based coding (Jain 1989), Sub-band coding (Vetterli 1984; Woods and O’Niel 1986), Predictive coding (Habibi 1977; Pirsch 1982), Vector quantization (Lim 1990), and Wavelets (Mazzarri and Leonardi 1995).

The transform based coding popularly uses Fourier transforms (Jain 1989), Sine/Cosine transforms (Ahmed et al. 1974; Jain 1989), KL transforms (Andrews 1970; Jain 1989), etc. The coding techniques are pixel based, and they involve the extraction of an orthonormal basis to get a high level of compression. The technique is made lossy by coding only the high frequency elements and discarding the low frequency elements. In Sub-band coding (Vetterli 1984; Woods and O’Niel 1986), the technique of coding is similar to transform coding, with just one difference, that is, the different frequency elements are treated at different resolutions. The concept of the Predictive coding (Habibi 1977; Pirsch 1982) method is based on the prediction of the next element by using the knowledge of the previous element and the error between the prediction and the given sample element. The Vector quantization (Lim 1990) based methods achieve compression by dividing the image into different blocks which are then transformed into vectors having a higher dimension. Each vector is then treated as a sample class in the form of a prototype which actually resembles the vectors that belong to this class. However, we at present have compression algorithms based on Wavelets (Mazzarri and Leonardi 1995) which have higher compression capabilities. Fractal coding (Zhang et al. 1995) is one such example that is based on Wavelets. Though all these compression methods have a good compression ratio with respect to scene images, they are the least preferred when it comes to compressing text document images. This is because when text based document images are compressed using lossy techniques, they tend to lose information in the presence of low resolution and degradation. Therefore, the lossless compression algorithms are preferred while compressing text document images, and today we have many such standard algorithms that are specifically used for document images (Sayood 2012; Salomon et al. 2010).

Text contents in a document image can be losslessly represented and reproduced using a black and white pixel image (Javed 2016). The important characteristic of a text document is that it contains a large amount of continuous black and white pixels. A continuous sequence of similar pixels is known as a run. Therefore in a text document image, we get many such sequences of white and black pixel runs. These pixel runs were observed by CCITT (T.4-Recommedation 1985; T.6-Recommendation 1985) and a new standard of compression was introduced in the form of CCITT Group 3 1D (Modified Huffman) that specifically targeted documents originating from FAX machines and digital libraries. Run based features also form the backbone of many text document compression algorithms like Run-Length Encoding (RLE), PackBits, Huffman, Modified Read (MR), and Modified Modified Read (MMR) which are widely supported by TIFF and PDF image formats (Salomon et al. 2010).

In the literature, we find compression algorithms like JBIG and JPEG that are also used in compressing document images (Salomon et al. 2010; Sayood 2012). However, these compression schemes are not as popular as TIFF for the following reasons. Though JPEG gives good compression ratio with color images, the compression ratio with binary text documents is low (see Table 3) and also lossy in nature, and hence they are usually not preferred for the compressing of text documents (Javed 2016). On the other hand, we have JBIG which gives a good compression ratio with binary text documents, but the compression scheme is not very popular presently, and it is yet to find its position in the TIFF image format (Javed 2016). An experimental analysis is reported in Table 3 for a sample printed text document in Fig. 1 that clearly shows the performances of different compression algorithms.

We have recently witnessed a few document image compression schemes that are based on the repetitive nature of text components (Ascher and Nagy 1974; Inglis and Witten 1994; Kia 1997). The coding scheme (Ascher and Nagy 1974) uses symbol level coding instead of the pixel level found in most of the previous methods. The method creates a catalogue of characters that are present in the document image; and any new character that is encountered gets added to the library consisting of a collection of character-images. By using such a created library, compression can be attained. This compression was further enhanced by Inglis and Witten (1994), where they used pattern matching approaches to extract the unique image components from the text document, then compressing the indexed components to achieve a high compression ratio, and finally coding the residues to achieve lossless compression. In the same direction, (Kia 1997) developed a new compression algorithm that promoted compressed domain processing based on the symbols present in the text document image. The method consists of three stages such as symbol prototype generation, symbol coding, and residue coding. With prototype images and component location table, one can generate the lossy version of the document; however, the residue coding could be used to get a lossless version. In the next section we review different techniques that have been developed to operate directly with compressed document images.

4.3 Processing of document images directly in the compressed domain

With an excessive use of the internet, computers, and personal gadgets, we are witnessing an explosion of information in the form of texts, images, audios, and videos (Bell et al. 2002). For efficacious storage and transfer, the present day trend is to keep the data in a compressed form (Mukhopadhyay 2011). Since, more and more data is being made available in the compressed form; there is a need for direct analysis of the compressed data without involving the stage of decompression. This novel idea of operating directly on the compressed data without undergoing the stage of decompression is known as Compressed Domain Processing (CoDoP) (see Fig. 5) (Javed 2016; Mukhopadhyay 2011; Adjeroh et al. 2013; Kia 1997).

Fig. 5
figure 5

A general model of processing a compressed document image in compressed domain

The problem of CoDoP was introduced initially in the field of text data (ASCII files) by Amir and Benson in the year 1992. Their whole idea was to search for a pattern directly within the compressed text data without the use of decompression, and reporting the number of occurrences of the pattern in the file (Amir and Benson 1992). In recent days, compressed domain methods are gaining increased momentum as the amount of data being handled is very large, and the pattern matching techniques are comparatively faster than decompression based methods (Adjeroh et al. 2013). Based on this research theme, many interesting articles and books have been published in the area of compressed texts (Adjeroh et al. 2013), images and videos (Mukhopadhyay 2011). However, in this section, we focus mainly on compressed domain processing techniques related to document images, and just touch upon the issues related to texts, scene images and videos briefly.

A general model of processing document images in the compressed domain is shown in Fig. 5 (Mukhopadhyay 2011). In the context of document images, different compressed domain methods have been proposed in the literature, based on the significance and the type of data being compressed. For processing text document images, CCITT Group 3 (Javed et al. 2013, 2015a, b, 2016), CCITT Group 4 (Spitz 1998; Hull and Cullen 1997; Lu and Tan 2003b) and JBIG (Kanai and Bagdanov 1998; Regentova et al. 2005) based models have been proposed. For processing grayscale documents, compressed domain models in JPEG (de Queiroz 1998) and symbol based compression (Kia 1997) schemes have been proposed. Detailed remarks regarding these models are given in the following deliberations.

There are plenty of efficient coding schemes available for the compressing of document images. Among them CCITT Group 3 1D (MH), CCITT Group 3 2D (MR), CCITT Group 4 2D (MMR), JBIG, run-length (RLE), and JPEG compression standards are popular for dealing with document images (Kou 1995; Salomon et al. 2010). CCITT is the most preferred compression algorithm for bi-level document images that are widely supported by TIFF and PDF formats. CCITT Group 3 and CCITT Group 4 compressions are used in all facsimile transmissions by modem over analog telephone lines, as well as in digital storage and transmission of documents. Due to its popularity, researchers became interested in working directly with these compressed documents (Latifi and Kanai 1997; Javed 2016). Many initial attempts have been reported in the literature of the compressed domain to perform different operations using these documents which are discussed in the following paragraph.

The first group of work reported are on the CCITT Group 3 and CCITT Group 4 documents. In CCITT Group 3 there have been attempts to extract features like projection profile, run-histogram, and entropy features (Javed et al. 2013). Further, the image processing applications like text segmentation and word spotting have also been successfully attempted (Javed et al. 2015a, b, 2016). The proposed run-length compressed domain model is general and operates directly on the run-length compressed data irrespective of the compression scheme applied for compression. The model is also extended to CCITT Group 4 documents by providing the option of generating run-length data from CCITT Group 4 (Javed et al. 2016a) compressed data. The initial work by Spitz (1998) describes the detection of dominant skew, multiple skew angles, and presence of logotype in compressed documents. It uses the Baird technique (Baird 1987), where the fiducial points are found by using the pass modes in a CCITT Group 4 file. A fiducial point is the bottom center of the bounded box surrounding the connected component. Here, the bottom positions of black and white structures in the compressed image are extracted, and their alignment is determined with the help of peak strength. This technique is extended further so as to detect the logotype which produces a unique alignment signature at different angular distributions in the compressed representation. A new compression technique derived from CCITT Group 4 is proposed in Deng et al. (1998, 1999) for the purpose of skew detection and correction, and the extraction of the connected components in binary documents. The authors claim that their method is efficient in compression and is also flexible for operating in the compressed domain with good speed-up in comparison to the traditional method.

With respect to CCITT Group 4, some interests are also seen in finding a similarity in documents in the compressed domain. The works reported by Hull (1997) and Hull and Cullen (1997) extract pass codes from a rectangular patch of text and use the modified Hausdorff distance measure to determine similarity in two documents. Later in the work of Hull (1998) and Lee and Hull (2001), it was observed that the extraction of pass codes was not cost effective, and therefore a new feature based on up and down endpoints were proposed for document matching which led to promising results. Further, the problem of document similarity in compressed documents was extended to applications like word searching (Lu and Tan 2003b), document retrieval (Lu et al. 2001; Lu and Tan 2003a), and pattern searching (Maa 1994). A word searching technique for CCITT Group 4 compressed document had been devised based on the features extracted from the black and white pixel changing elements in the compressed file. In the subsequent steps, using the connected components and the word bounding boxes across words, a dissimilarity measure was evolved using which word searching was attained. The same concept was extended to document retrieval of CCITT Group 4 compressed images (Lu and Tan 2003a). The document similarity here is based on scalar product of document vectors, and the weighted Hausdorff distance is used for the purposes of classification and retrieval. Another interesting work by Maa (1994) on the CCITT Group 4 was to detect the existence of bar codes using the features of the vertical mode coding standard. Based on the repeated sequence of vertical modes present within the compressed file, the existences of bar codes were detected. Further, an algorithm with reduced operations for detecting connected components in CCITT Group 3 2D and CCITT Group 4 2D has been reported by Regentova et al. (2002). The idea of developing an OCR for compressed documents using CCITT Group 3 2D is discussed in Marti et al. (2000). Finally, in the case of CCITT Group 3 1D, a formal model for processing compressed documents is defined in Latifi and Kanai (1997). They use the finite state transducer to perform point operations, and push down transducer for algebraic and spatial transformations. However, it should be noted that most of the compressed domain techniques proposed on CCITT Group 3 and CCITT Group 4 documents use compressed data to extract the location of coded pixels and to generate a prototype of the document which contains fewer pixels in comparison with the actual document image (Lu and Tan 2003a; Javed et al. 2016). This implies that the actual compressed domain operation is not carried out directly on the compressed data but on the uncompressed version of the document.

A few other works which are related to CCITT are based on run-length information extracted directly from uncompressed binary images. A run based connected component detection technique is discussed in Ronse and Devijver (1984), and is also based on the coordinate information of runs. The work by Shima et al. (1989, 1990) extracts connected components by sorting and tracking of runs in each scan line and then labelling the connected runs. The work in Hinds et al. (1990) uses the run length encoding guided by the Hough transform to locate the skew angle present in the run length coded binary image. Recently, the work by Breuel (2008) showed the morphological operations on run-length representations.

JBIG (Salomon et al. 2010) is the binary image compression standard which was recently adopted by CCITT (Kanai and Bagdanov 1998). JBIG supports a lossy model of compression, and provides a strong support for progressive transmission. In the context of JBIG compressed images, the following research works are reported in the literature (Kanai and Bagdanov 1998; Regentova et al. 2005). Similar to the work reported in Spitz (1998), it also uses the concept of extracting fiducial points for skew detection during the decoding stage of JBIG compressed images. Further, with the help of the projection profile, the skew angle is determined. The angle of projection within a search interval that produces maximum alignment with respect to fiducial points is the skew angle. In Regentova et al. (2005), the researchers propose an interesting work with JBIG compressed documents to achieve document layout analysis. They employ the knowledge of JBIG encoding process, and perform horizontal smearing followed by connected component extraction. They use a hybrid strategy where top-down analysis results in structural layout while the bottom up analysis yields the connected components. Then the simple geometrical features extracted from the connected components are used to classify the document into text and non-text blocks. Further, they have extended their layout analysis for detecting the form type and the form dropout.

By using JPEG (Miano 1999) compressed documents, segmentation of different regions has been demonstrated by de Queiroz and Eschbach (1997, 1998). JPEG images have been used to perform scaling, previewing, rotating, mirroring, cropping, re-compressing and segmentation in the work of de Queiroz (1998). The whole idea was to have an encoding map (ECM) for the entire image in terms of the encoding cost over each block. The ECM, along with row-sums and decimated DC maps are embedded inside the JPEG compressed stream which results in an increased file size. The ECM, which provides an important activity in the block, is decoded to extract just the informative blocks, so as to carry out different operations like segmentation and image cropping. The ECM alone is not sufficient to get good results: therefore, the other details such as DC coefficients are preferred to get better results. In case of JPEG2000, the merging of two different regions from two different image compressed streams is demonstrated in Rosenbaum and Taubman (2003). They use different synchronization codes to mark the code blocks of different regions. Based on the synchronization codes extracted from the compressed stream, region merging has been achieved. Another region segmentation method on compressed gray images using quadtree and shading representation is given in Chung et al. (2004).

An attempt was also made by Kia (1997), to develop a new symbol based compression scheme having the ability to perform compressed domain operations such as keyword searching, sub-document retrieval, skew estimation, and document matching. The compression scheme utilizes the repetitive nature of the text components, and is a symbol based coding scheme. The compression scheme generates a prototype map of the text under compression, the component layout, and the residual map. Using the prototype map and component layout, a lossy version of the image can be generated; however, to get a lossless image, the residual map needs to be decoded.

We have so far discussed different compressed domain methods related to just the processing of compressed document images. However, the literature also has compressed domain techniques proposed for JPEG images, video frames and audio data. We do not intend to discuss these issues in detail, but to just touch upon them in brief.

JPEG (Miano 1999) is the most widely used image compression algorithm which uses Discrete Cosine Transform (DCT). Many algorithms have been developed to work in the DCT compressed domain. Operations such as image filtering (Kresch and Merhav 1999; Yim 2004), image enhancement (Lee 2007; Tang et al. 2003; Mukherjee and Mitra 2008), image resizing (Dugad and Ahuja 2001; Martucci 1995; Merhav and Bhaskaran 1997; Mukhopadhyay and Mitra 2009; Mukherjee and Mitra 2006), and image transcoding (Viswanath 2009; Viswanath et al. 2010) have been reported. The image editing operations such as digital watermarking (Lu 2002; Hernndez et al. 2000; Chen and Wornell 2001), steganography (Avcibas et al. 2005; Chen and Wornell 2001; Provos 2001), indexing (Reeves et al. 1997; Shneier and Mottaleb 1996; Smith and Chang 1994), face detection and identification (Wang and Chang 1997; Chua et al. 2002), and image registration (Lee et al. 2005; Ito and Kiya 2007) are also reported in JPEG compressed domain. Edge enhancement of remote sensing images and image enhancement without decompressing the image are reported in Chen et al. (1999) and Tang et al. (2003). Text and image sharpening is achieved in Bhaskaran et al. (1997) by scaling each element of the encoding quantization table which enhances the image. A set of algorithms for manipulating compressed images like performing algebraic operations, scalar addition, scalar multiplication and pixel wise addition or multiplication is demonstrated in Smith and Rowe (1993). Operations like image shearing and rotation in the compressed domain are seen in Chang et al. (1992), Chang (1995a, b). Using the algebraic operations, the inner block manipulation and scan line methods in the compressed domain were developed by Shen and Sethi (1995, 1996). Image filtering, which is one of the common operations used in the spatial domain has been applied on compressed images by Chiptrasert and Rao (1990), Ngan and Clarke (1980), Lee and Lee (1992) and Mukherjee and Mitra (2006). Algorithms for image translation, overlapping, scaling, linear filtering and pixel multiplication which have been applied to image subtitling, anti-aliasing and masking are discussed in Chang et al. (1992) and Chang and Messerschmitt (1995). Direct feature extraction is achieved in Scotney et al. (2005) by using a Laplacian operator to work directly on the compressed images. Using DC and AC coefficients of the DCT domain (Miano 1999), the colour, texture, shape and statistical features are identified from skin regions (Ye et al. 2003). Further, the classifier detects the objectionable image. Finding the areas of interest within the image, and detecting the coarse edges is found in Shen and Sethi (1996). Pattern matching on Huffman encoded texts has been attempted by Klein and Shapira (2005). Losslessly compressed images obtained using lossless JPEG can be retrieved using information derived from the Huffman coding tables of the compressed image in Schaefer (2010).

Video compression techniques are used frequently for the purpose of compressing digital videos in different applications (Salomon et al. 2010). Because of the large availability of videos over the internet and other sources of multimedia in a compressed form, there have been several attempts to tackle the problems related to video processing in the compressed domain. Operations such as key-frame extraction (Avrithis et al. 1999; Shen and Delp 1995; Yeo and Liu 1995; Zhang et al. 1995), caption localization (Yeo and Liu 1995b; Gargi et al. 1998), object recognition (Jing and Zhang 2004; Wang and Chang 1997), and video transcoding (Moiron et al. 2009) have been attempted in the literature of video processing in the compressed domain.

MPEG (Moving Picture Experts Group) (Mukhopadhyay 2011) audio compression techniques are in popular use so as to overcome the storage and transmissions issues of high quality audio signals. Therefore, several attempts have been reported to directly process the MPEG data in the compressed domain. A detailed study on MPEG audio compressed data, and subsequent analysis and processing strategies are reported in the research work of Anantharaman (2001). The other significant contributions on MPEG audio compressed domain processing are by Tzanetakis and Cook (2000), Schuller et al. (2011), Rizzi et al. (2006), Shao et al. (2004) and Li and Han (2009).

In the literature, there are also attempts reported to do pattern matching in compressed text document files (Adjeroh et al. 2013). Searching for patterns is a crucial activity in many applications, which both humans and machines do most of the time. The fundamental concept of pattern matching in a compressed text is also meant to overcome the problem of storage, and computation at run time (Adjeroh et al. 2013). In conventional pattern matching, text and pattern are both given in an uncompressed form and the problem is to check the occurrence of the pattern in the given text which may be an exact/inexact matching. In case of a compressed pattern matching problem, the text, the pattern, or both the text and the pattern are found in a compressed form (Amir and Benson 1992; Gasieniec and Rytter 1999).

The first attempt was made by Amir et al. (1996) to search for patterns in LZW (Limpel-Ziv-Welch) compressed files. LZW is a dictionary based compression method where data compression is achieved by replacing strings in the source text with a unique code generated for the strings in the dictionary. In practicality, the LZW compression is carried out using a tree-like data structure called as dictionary trie. The tree structure maps the unique code words generated for the strings. Using this compressed trie dictionary, the compressed patterns are matched and the occurrences reported. The detailed discussion is available in Amir et al. (1996) and Adjeroh et al. (2013), and the extended works in Tao and Mukherjee (2005) and Gawrychowski (2012). On LZ77 compressed text, (Farach and Thorup 1995) were the first to devise an algorithm for conducting a search by introducing informers which indicated the regions that were likely to contain the pattern. This scheme assists in decompressing and analyzing only the regions identified by the informers. Later a hybrid compression scheme using LZ77 and LZ88 was introduced by Navarro and Raffinot (2004) which allowed for a more efficient pattern matching than that of Farach and Thorup (1995). The latest developments on the LZ family of compression can be traced through the work of Gawrychowski (2011, 2012). Finally, a Grammar based compression which is closely related to the dictionary based method was investigated for compressed pattern matching by Kieffer and Yang (2000) and Yang et al. (2001). In BWT (Borrow Wheeler Transform), the text is broken into blocks and then compressed. The pattern searching requires the entire block to be traced; therefore, Bell et al. (2002) proposes to maintain an inverted index at the locations where the patterns may most likely occur, and then carry out the compressed pattern match. A detailed comparative study is provided by Adjeroh et al. (2008).

Run-length encoding is the compression algorithm that replaces a sequence of symbols with the count of the occurrences of that particular symbol. The initial idea of pattern searching on run-length encoded text were given by Apostolico et al. (1997) and Bunke and Csirik (1993, 1995). In Adjeroh et al. (1999) video sequences were represented as strings, which resulted in similar strings. During the matching of video sequences, a special edit operation had been defined to detect the repeated strings based on the principle of run-length encoding. Pattern matching on the other variations of run-length coded text were studied by Makinen et al. (2003) and Tamakoshi et al. (2013), and on 2D run-length by Amir and Benson (1992) and Amir et al. (2006).

Direct searching techniques were also introduced in Huffman compressed files by Mukherjee and Acharya (1994). In order to search patterns, they introduced a new data structure that separates byte boundaries and allows pattern matching with reference to the compressed pattern. The other strategies to search patterns in the compressed text are discussed in Moura et al. (2000), Ziviani et al. (2000) and Klein and Shapira (2005, 2011). Compressed pattern matching is also reported using the compression techniques of Arithmetic coding (Bell et al. 2002), Anti-dictionary (Shibata et al. 1999), PPM (Prediction by Partial Matching) (Cleary and Teahan 1997), Byte pair encoding (Mandal et al. 1999; Manber 1997; Shibata et al. 1999).

Fig. 6
figure 6

Overview of compressed domain techniques in the literature

An overview of the compressed domain techniques in the literature is given in Fig. 6.

5 Future research direction of DIA in the compressed domain

In the previous state-of-the-art section, we saw that only few DIA related operations have been tried in the compressed domain and most them have been proven to be computationally efficient (Mukhopadhyay 2011; Javed et al. 2013; Adjeroh et al. 2013). Hence there is a large scope for research in the field of document images analysis to be carried out directly in the compressed domain. The important applications like graphics recognition (Ogier et al. 2009; Doermann and Tombre 2014), handwriting recognition (Wshah et al. 2012a, b; Rehman and Saba 2012; Ahmed et al. 2017; Dash et al. 2016; Jayadevan et al. 2012), image analysis (Kasturi et al. 2002; Marinai 2008a), texture based image representation (Dong et al. 2015a, b; Alvarez et al. 2012; Li et al. 2010), clustering analysis of document images (Li et al. 2016; Zeng et al. 2014; Xu et al. 2003; Shahnaz et al. 2006; Gambhir and Gupta 2016; Asghari and KeyvanPour 2015), document understanding (Ceci et al. 2005), camera based document processing (Iwamura and Shafait 2013), document image mining (Marinai 2008b), document forensics (Saini and Kaur 2016), historical documents (Rath and Manmatha 2003, 2007), segmentation and restoration (Yong et al. 2010) and other innovative applications can be tried. Further, due to digitization, there is also scope for automatic processing of digitally born documents (Doermann and Tombre 2014; Lee and On 2011; Asghari and KeyvanPour 2015) in the compressed domain.

6 Conclusion

Through this survey paper, we presented a study on document image analysis techniques from the perspectives of image processing, image compression and compressed domain processing. The motivation behind pursuing research directly on compressed document images were discussed. Different document image processing techniques that have been attempted in the literature were reviewed on the basis of data compression technique employed during compression and the type of operations performed. Overall, the paper projects the importance of compressed domain processing of document images and gives a perspective for further research in the area of image processing and pattern recognition by employing directly the compressed data.