1 Introduction

According to the increase in digital devices, a large part of information has been stored as a type of digital data. Therefore, digital evidences are recognized as an important factor in various criminal investigations. As such digital data can be easily fabricated and modified, it is necessary to carefully manage the data in order to use it as legal evidences because it can be very easily damaged using the conventional data management method. Thus, digital forensics has been raised to treat technical and procedural issues for collecting, producing, analyzing, and processing digital evidences. The objective of digital forensics is to clarify realistic truth by analyzing the stored data in a digital device that is related to a criminal case and accept it as an effective evidence in a court.

In a digital forensic investigation, an evidence recovering process is generally implemented through an evidence collection process, and an evidence analysis is carried out using the data in this process. The evidence recovering process is an important step to obtain evidences in a digital forensic investigation. The process can be classified as a method that uses the information or meta data in file systems and a file carving method that recovers files based on signatures and file architectures in the unallocated space of a storage region without any information [4, 7].

Studies on the file carving have been focused on the recovery of document or image files. Thus, there are few studies on the recovery of multimedia files like video and audio files. The largely used multimedia files, such as AVI, WAV, and MP3, can be exactly recovered using an internal format in files as they are continuously allocated.

Also, the file carving tools widely used at the present time cannot exactly recover NTFS compressed files. It is due to the fact that such carving tools are not considered for that files. As the NTFS compression has been frequently used to save data storage space, there is a high possibility in presenting deleted NTFS compressed files in unallocated space. If multimedia files are stored as the NTFS compressed files, the exact recovery of files is difficult. Therefore, it is necessary to study a recovery method for such deleted NTFS compressed files in file carving.

In this study, we propose a carving method for multimedia files and represent a recovery plan for deleted NTFS compressed files. In addition, we propose a way to apply such a recovery method to the carving of multimedia files.

2 Multimedia file carving method

The carving of multimedia files can be performed using an internal format by dividing the files using the Signature of the file header. In general, although the files, which are recovered using the file carving method, can be executed using a specific application of such files if the data of files is perfectly presented, the data of multimedia files can be verified by reproducing them up to the middle of the recovered section through executing it if a space between the header of files and a specific data area exists in most multimedia files.

In this study, we propose a file recovery method that can reproduce data for the carving of multimedia files based on the characteristics of multimedia files. The next section represents the carving of AVI, WAV, and MP3 files.

2.1 AVI files carving

AVI files are one of the most largely used video files at the present time. The AVI files internally use a resource interchange file format (RIFF). The RIFF is a common format used to store various types of data, such as image and audio, as a single file and used as a base format of the multimedia application of Windows. The RIFF consists of several Chunks, Lists, and Data. It is used in various files, but it shows different types of Chunks and Lists used inside the files [1, 2].

For carving AVI files, the Signatures of the header is first to be identified. The AVI file header is a size of 12 Bytes, and Fig. 1 shows its structure.

Fig. 1
figure 1

AVI file header

As shown in Fig. 1, it can be seen that the 4 Bytes 0x52494646 RIFF Signature with offset 0 and the 0x41564920 AVI Signature with offset 8 exist at the AVI file header. After verifying AVI files using such Signatures, it is possible to obtain the data of files using the size of four Bytes with offset 4 that represents the size of the file.

It should verify that whether the minimum data, which is reproducible, exists by checking the format of AVI files before obtaining the data of files. If there is no reproducible data, the file is not recovered.

Figure 2 shows the structure of AVI files. The hdrl List, movi List, and idx1 Chunk are located under the most upper Header. The hdrl List defines the format of data, the movi List stores the actual data of streams, and the idx1 Chunk stores the Data Block information of the movi List. As the original data from the Header to the header of the idx1 Chunk, the all streams that are actually reproduced can be recovered. Also, as a part of the movi List data exist, the recovery can be performed as much as the existing data and that can also be reproduced. If the movi List does not exist, the recovery will not be carried out under the judgement that cannot recover the subject files.

Fig. 2
figure 2

AVI file structure

The hdrl List exists right after the header. Figure 3 shows the structure of the hdrl List. It is possible to find the 4 Bytes 0x4c495354 List Signature with offset 0 and the four Bytes 0x6864726c hdrl Signature with offset eight at the beginning of the hdrl List. If the Signature of the hdrl List is checked, the beginning of the movi List can be seen using the List Size with offset four.

Fig. 3
figure 3

hdrl list format

Figure 4 shows the structure of the movi List. As shown in Fig. 4, the four Bytes 0x4c495354 List Signature with offset 0 and the four Bytes 0x6d6f7669 movi Signature with offset eight are to be seen at the beginning of the movi List. If the Signature of the movi List is checked and a some part of data exist, the recovery can be attempted. For checking whether files are perfectly presented, it can be verified by moving the position to the beginning of the idx1 Chunk using the List Size with offset four.

Fig. 4
figure 4

Movi list format

Figure 5 shows the structure of the idx1 Chunk. As shown in Fig. 5, if the four Bytes 0x69647831 idx1 Signature with offset 0 is verified at the beginning of the idx1 Chunk, it can be regarded that all practically reproducible data exist.

Fig. 5
figure 5

Idx1 Chunk format

2.2 WAV file carving

WAV files are one of the most widely used audio files at the present time. As well as the AVI files, the WAV files use the RIFF internally. For carving WAV files, it is necessary to first identify the Signatures in the header. The size of the WAV file header is 12 Bytes. Figure 6 represents the structure of the header.

Fig. 6
figure 6

WAV header format

As shown in Fig. 6, the four Bytes 0x52494646 RIFF Signature with offset 0 and the 0x57415645 WAV Signature with offset 8 are verified at the WAV file header. After verifying the WAV file using the Signature, the data of files can be obtained using the four Bytes file size with offset four.

It should verify that whether the minimum data, which is reproducible, exists by checking the format of WAV files before obtaining the data of files. If there is no reproducible data, the file is not recovered. The fmt Chunk and data Chunk are presented under the most upper Header. The fmt Chunk defines the format of data, and the data Chunk stores actual audio data as the unit of a specific block. As WAV files represent no footers of the data Chunk where actual stream data do not exist differed from AVI files, it is not possible to verify whether the data files exist perfectly. However, as the reproducing is possible as a part of data stream exist for a specific level, the recovery is attempted under the consideration that a some part of stream data exist as the Signature of the data Chunk is verified. However, if the Signature is not verified, the recovery will not be performed [6, 10].

The fmt Chunk exists right after the header. Figure 7 shows the fmt Chunk format. It is possible to verify the four Bytes 0x666d7420 fmt Signature with offset 0 at the beginning of the fmt Chunk. If the Signature of the fmt Chunk is verified, the beginning of the data Chunk can be checked using the Chunk Size with offset four.

Fig. 7
figure 7

fmt Chunk format

Figure 8 represents the data Chunk format. As shown in Fig. 8, the four Bytes 0x64617461 data Signature with offset 0 can be verified at the beginning of the data Chunk. If the Signature of the data Chunk is checked and a some part of data exist, the recovery can be attempted. As mentioned above, however, there are no footer, it is difficult to verify whether the data of files exist perfectly.

Fig. 8
figure 8

Figdata Chunk format

2.3 MP3 file carving

The MP3 (MPEG Audio Layer-3) is a lossy compression format developed as an MPEG-1 audio specification. It has been considered as the most popular audio file format due to the improvement version of the MP1 and MP2.

Figure 9 shows the MP3 file structure. The MP3 file stores data as a frame unit called AAU (Audio Access Unit). Also, each AAU consists of Header, CRC, Side Info, and Main Data. The AAU stores actual stream data in the Main Data area, and the stream data can be reproduced using a single AAU only because each AAU stores data independently.

Fig. 9
figure 9

MP3 file structure

Figure 10 represents the AAU Header format. There exists the two Bytes 0xfffb Signature with offset 0 from the beginning of the Header. The first four bits in the one Byte with offset two shows the Bit Rate, and the next two bits represent the Sampling Frequency, and the next one bit shows the Padding. Tables 1 and 2 represent the Bit Rate and Sampling Frequency, respectively, according to each bit [8].

Fig. 10
figure 10

AAU header

Table 1 Bit rate for each bit
Table 2 Sampling frequency for each bit

Then, each AAU size can be calculated using the equation of \( \left[ {{144}*{\hbox{Bit Rate}}/\left( {{\hbox{Sampling Frequency}} + {\hbox{Padding}}} \right)} \right] \) based on these properties. For instance, the Bit Rate presented in Fig. 10 is 00001001, it becomes 128 kbps. Also, as the Sampling Frequency is 00, it becomes 44100, and the Padding is 0. The the size can be calculated as 417 Bytes by substituting these values to the equation as \( \left[ {{144}*{128}000/\left( {{441}00 + 0} \right) = {417}} \right] \) [8].

The MP3 file can be classified into three different types, such as the format that is configured by using the AAU only, the ID3 Tag Ver.1 that stores tags at the AAU and the end of the file, and the ID3 Tag Ver.2 that stores tags at the AAU and both the beginning and the end of the file. Because each type shows different configurations, the carving for each type is also a bit different [8].

The carving of the file, which is configured by the AAU only, verifies whether the file is the MP3 file format using the Signature of the AAU Header. Then, the Signature of the next AAU is to be verified by checking the size of the AAU after that verification. If the Signature is agreed, the next AAU can be searched using the same way. However, the Signature is not the same, the verified AAU is to be recovered only. In the case of the MP3 file that is configured by the AAU only, it is difficult to verify whether the file is perfectly recovered because the footer of the file or the entire size does not exist. However, if all data are continuously allocated, the file can be fully recovered. Also, if all data are not presented, it is possible to recover the data up to the range where the AAU exists.

The carving of the ID3 Tag Ver.1 MP3 file can be performed as the same way used in the MP3 file that is configured by the AAU only. Regarding the difference between these two methods, as the tag data is presented at the end of the file in the ID3 Tag Ver.1, the 0x544155 3 Bytes determined by the Signature of the tag are to be determined as the end of the file and that leads to perfectly recover the file including the tag of 128 Bytes. If the tag data is not verified, the verified AAU is only recovered.

The carving of the ID3 Tag Ver.2 MP3 file is performed by verifying the tag because the tag is presented at the beginning of the file.

Figure 11 shows the tag header of the beginning of the ID3 Tag Ver.2 MP3 file. As the three Bytes 0x494433 with offset 0 is verified from the beginning of the tag header, it is necessary to move the position to the beginning of the AAU through verifying the tag of the four Bytes with offset six. The tag size can be obtained by adding the size of the tag header (ten Bytes) to the residual bits, which are continuously calculated, except for the left first bit in each byte. Then, the carving is to be carried out using the same way as the ID3 Tag Ver.1 MP3 file.

Fig. 11
figure 11

ID3 Tag Ver.2 header

3 NTFS conpression files carving

Since the NTFS version 5.0, it provides a compression function at a file system level. As the compression is executed at a file system level, it is not necessary to consider whether the subjective files are compressed in application programs [3].

As shown in Fig. 12, the configuration of the NTFS compression is implemented at the disk properties or at the window of the advanced attributes for files or folders as shown in Fig. 13.

Fig. 12
figure 12

Window for the configuration of drive compression

Fig. 13
figure 13

Window for the configuration of file compression

Although the NTFS compression file is useful to increase the efficiency of storage space, it represents a difficulty in carving the file. In practice, most carving tools and studies do not consider such NTFS compression file. Therefore, in this research, we describe the basic concept of the NTFS compression and represent a carving method for the file. Also, we propose a practical application of this method to multimedia files.

3.1 NTFS compression

In the NTFS compression, it is not carried out for the entire files, but for the unit size of 16 clusters. Also, this unit is used to process the reading and writing of data. If the size of clusters is defined by 512 bytes, the compression unit is 8 KB (512 * 16) [5, 9].

The compression unit consists of several compression blocks. After implementing a compression process, as represented in Fig. 14, each compression block shows three different states, such as compressed data, uncompressed data, and data with sparse properties, according to the condition of original files. Also, it may represent file slacks at the end of the compression unit due to the compression [5, 9].

Fig. 14
figure 14

State of the file compression

In the file compression policy in NTFS files, it can be varied by the case that newly generates files in the same storage space as the partition or folder, which applies compression properties, and the case that applies the compression for the files, which are already stored in a common partition.

In the case that newly generates compression files, the compression unit is continuously allocated as shown in Fig. 15 as it is allocated as a continuous manner.

Fig. 15
figure 15

Allocation state of the newly generated compressed file

In the case that applies the compression for the normal files, which are already stored, the compression is uniformly carried out with a constant interval with 16 clusters as illustrated in Fig. 16. Also, the data is not to be continuously stored due to the generation of unallocated space as much as the reduced data caused by the compression.

Fig. 16
figure 16

Allocation state of the compression of normal files

The NTFS compression uses an LZNTI algorithm based on the LZ77 algorithm. The LZNT1 algorithm implements the compression with the previously mentioned compression unit, and each compression block in the compression unit consists of a 2-byte header and several data groups. Table 3 shows the information of the header for each bit [5].

Table 3 Header information of the compression block

The first 12 bits in 16 bits represent the size of the compression block. The starting position of the next compression block can be determined by calculating the position of the present compression block + the size of the present compression block + 3. If the first two bytes in the next compression block is 0x0000, the position ranged from the next value to the end of the cluster is a file slack region. The last bit in 16 bits shows whether the data is compressed. As the subjective bit is one, the data stored in the compression block is compressed. In the case of the bit that is defined by 0, the data is not compressed. In the case of the data that is not compressed, the original data is to be stored without changes. If it is compressed, the data will be compressed according to the LZNT1 compression algorithm [5, 9].

Figure 17 illustrates the composition of the data group in the compression block.

Fig. 17
figure 17

Composition of the compression block data

As there is not data group in the compression block, the data consists of tag bytes and normal eight bytes as illustrated in Fig. 17. The data group located at the right side in Fig. 17 represents two compression bytes in the data group. A single compression byte is composed by two bytes. Therefore, an increase in the compression byte by one increases the size of the data group by one byte. The data group in Fig. 17 has a total of ten bytes because it has two compression bytes [5].

A tag byte plays a role in presenting the position of compression bytes in the data group. As the tag byte is considered as a bit unit, in the case of the bit value that is determined by one, the byte at the subjective position is the compression byte. The data group located at the right side in Fig. 17 represents the compression byte at the third and sixth bytes. Thus, the bit value of the tag byte is 0x00100100 (Bits) [5].

The LZNT1 compression implements a compression work that stores the offset and size of the subjective data as the previously stored data has duplicated contents.

Figure 18 represents an example of the storing of compressed data. The value of ‘00000000 (Bits)’ in Fig. 18 shows the expression of one byte to bits and is a tag byte that is used as an index value, which identifies a compression byte in the last eight bytes. The compressed byte stores the offset and size of the compressed original data. As shown in Fig. 18, the third line, ‘00000100 (Bits)’, shows the third byte, 0x8807, in the last eight bytes. By using it, it is possible to calculate the offset and size of the data [5].

Fig. 18
figure 18

Example of the storing of compressed data

The offset can be calculated by adding 1 after shifting the compressed data 11 times to the right side and multiplies −1 to the calculation. The size can be calculated by adding three after applying 0x07FF and AND with the compressed data. Figure 19 illustrates the results of the offset and size of the compressed data of 0x8807. Using the results of this calculation, it can be seen that the actual data of 0x8807 represents the data, which is moved by −18 from the present location, that is ten for the string of “abcdefghij”.

Fig. 19
figure 19

Calculation of the offset and size of compressed data

As the compressed data in Fig. 18 is fully decompressed, it represents the string as shown in Fig. 20

Fig. 20
figure 20

Decompressed strings

3.2 Method for the NTFS compressed file carving

3.2.1 LZNT1 compression determine whether measures

It is necessary to process a procedure that identifies whether the subjective clusters are compressed for carving NTFS compressed files.

As mentioned in Section 3.1, the LZNT1 compression algorithm is composed by a compression unit with 16 clusters, and the compression unit consists of several compression blocks. The position of the next compression block can be identified using the size information in the header of the compression block in which the last compression block represents the value of 0x0000 at the header of the next compression block. Regarding that a sequential round in the header of the compression block is carried out by using this position, it can be regarded that the corresponding 16 clusters are normal data without the LZNT1 compression as the value of 0x0000, which represents the last compression block, does not exist and that exceeds the range of the compression unit.

In addition, in the case of normal compression blocks, the size information of the header of the compression block should represent the same value as the summation of the size of data groups in actual compression blocks. If the value is different, the corresponding compression unit can be regarded as normal data without the LZNT1 compression.

In the case of the data compression using the LZNT1 compression, the compression unit is to be generally determined by the size of 16 clusters. That is, the original data that is actually compressed has the size below 16 clusters. Therefore, the decompression for a normal compression file cannot represent more than 16 clusters. If the decompressed data shows more than 16 clusters, the data can be considered as normal data without compression.

3.2.2 Carving algorithm of the NTFS compressed file

For carving a NTFS compressed file, the comparison of file signature is to be possibly implemented. The first two bytes in the NTFS compressed file is the header of the compression block, and the third byte represents the index of the compressed data. Thus, the signature of the file is stored from the fourth byte.

Figure 21 shows the header of the compression block and the file signature. Also, it is possible to identify the two-byte compression block header and the signature of a MP3 file with 0x494443. It is difficult to identify the duplication of the first data in the compression block. Also, because the signature does not apply duplicated values, the data is not compressed but stored. Thus, in a carving process, the start position of the compressed file can be identified by checking the signature from the fourth byte. If the signature of a compressed file is searched, the original data can be obtained by decompressing the corresponding compression block using the LZNT1 algorithm.

Fig. 21
figure 21

Compression block header and the file signature

Compressed file uses a 16-cluster compression unit, and the compression unit cannot be separated. That is, it is possible to exactly recover the file using a file carving process because the compressed file below 16 clusters is not separated.

In the continuous allocation of the NTFS compressed files with more than 16 clusters, the header of the file can be searched by checking the signature from the fourth byte as the same way as the carving of the compressed files with below 16 clusters. As the header of the file is searched, it is necessary to identify the LZNT1 compression of the next 16 clusters after decompressing the file. If the corresponding clusters are compressed, it can be identified that whether the data is composed by a single perfect file through connecting it to the previously obtained data after decompressing the compressed file. For implementing this work, a general carving method that compares the footer of a file is used. In the results of the comparison, as the file is not a perfect figure, the file recovering is carried out by repeating the comparison using the next 16 clusters until it represents a perfect file. If it is identified as a none-compressed cluster during the comparison, the carving will be halted due to the fact that the data of the corresponding file is vanished during the process (Fig.  22).

Fig. 22
figure 22

 

In the case of multimedia files, the data can be reproduced even though the entire data are not presented. Thus, it is possible to recover the data up to the section in which the format of multimedia files is verified using the carving method as mentioned above even though all data are not perfectly presented.

As mentioned in Section 3.1, the file allocation policy shows a different way for the new generation of files in the partition where the compression is applied as the same as the compression of the normal files, which are already stored. Therefore, it is necessary to consider two different cases that verify both the next clusters after the present compressed clusters and the clusters after the 16 clusters when the verification of the compression of data is carried out through moving it to the next clusters.

By using this method, it is possible to implement the carving for the continuously allocated compressed files with more than 16 clusters. Figure 12 shows the pseudo-code for the carving algorithm of the NTFS compressed file. Whereas, it did not consider the carving for the discontinuously allocated compressed file more than 16 clusters.

4 Experiments

We compared the multimedia file carving results of proposed method from this paper and commercial file carving tools. Next tables show AVI, WAV, MP3 file carving result Tables 4, 5, 6.

Table 4 AVI files carving result comparison
Table 5 WAV files carving result comparison
Table 6 MP3 files carving result comparison

First column of each table represents the capacity and used capacity of each hard disk. And the rest of column data represents the number of recovered files, the file number and percentage of normal execution. Recovered files number of the proposed method from this paper less than commercial file carving tools. But the number of normal executed file is same and rate is high. Therefore, recovered file verification time is reduced. And, after delete the NTFS compression multimedia files, we performed a file carving. As a result, if using proposed method from this paper, we confirmed that deleted NTFS compression files are recovered. But, the commercial file carving tools did not recover the files.

5 Conclusion

In this research, we proposed a carving method for multimedia files. Also, we proposed a carving method for the NTFS compression file and its practical application method for multimedia files. According to the increase in the performance of computers and in the storage capacity of hard disks, storing user’s data as video and audio files has been increasing in recent years. Therefore, it can be considered that the importance of such multimedia files has also been increased. The method proposed in this study is able to perfectly recover AVI, WAV, and MP3 files as the files are continuously allocated. In addition, although the files are discontinuously allocated or the entire data of the files are not allocated, it can be verified by recovering the data, which are continuously presented, only due to the characteristics of multimedia files.

In addition, in the case of the multimedia files that are stored as the NTFS compression format, this study proposed a carving method using the characteristics of the NTFS compression file. Although the NTFS compression represents an efficient way to use storage space, it shows a difficulty in obtaining deleted data. However, it is possible to recover the average files and multimedia files that are stored as the NTFS compression format using the method proposed in this study.

5.1 Future work

We are study on the carving method for other multimedia files such as MPEG(Moving Picture Experts Group), WMV(Windows Media Video), FLV(Flash Video). And we will develop a carving tool using multimedia files carving method.