Keywords

1 Introduction

The capacity of hard disk drives has increased tremendously; as a result, user stores a large number of files on his/her personal computer. So, certainly sometimes users face lot of difficulties in getting desired documents even though they know that they are saved somewhere on the disk. Nowadays, searching for documents can be faster on the World Wide Web than on our personal computer. Due to the availability of a variety of web search engines and ranking algorithms like the PageRank algorithm introduced by Google [1], web search has become more efficient than PC search. Therefore, there is a need of providing efficient search over desktop data to access required information. The main motivation of this work is to search files on desktop system efficiently for retrieving required data easily. Retrieval of partial information from files is also a necessity of users. In this paper, we have proposed diverse ways of searching heterogeneous desktop data. The proposed system also retrieves partial contents from a semi-structured data files, e.g., XML files. Rest of the paper is organized as follows: Sect. 2 describes the related work; Sect. 3 discusses the proposed system for desktop data search; and finally, in Sect. 4 concludes the paper.

2 Related Work

Personal data refers to digital data accessed by a person during his/her lifetime and is owned by oneself. Personal data consists of heterogeneous data mix of word documents, pictures, XML data, audio file, video files, emails, and so on. This large amount of personal data may be spread on various devices like desktop system, laptop system, homepage server, e-mail server, official website, digital cameras, mobile phone, etc. For retrieving relevant information, effective management of personal data is required [2]. Desktop data is also personal data on one’s desktop, but it is centralized in nature. Various desktop search engines (DSEs) have been developed for managing desktop data including Windows search [3], Google desktop search [1], Yahoo! [4], Corpernic Desktop Search [5], and many more, some of them are compared on various parameters in [6]. DSEs are based on the file systems of underlying operating systems and lack in capability of retrieving partial contents from files [7]. For searching through DSE approach, users first input search query to the search engine and then search engine transfer the query to the indexed database to get required result [8]. DSE employ one or more crawler programs on desktop files to crawl and extract information that are used by indexer to create an indexed database. Problem with DSEs are that they do not provide partial retrieval of information [7], no support for complex queries, no support for semantic integration, and take significant initial indexing time. Modeling and querying over heterogeneous desktop data is another important research issue. In iMeMex [9, 10] data model, a graph data model has been proposed for modeling personal data. A new Xpath-like query language named iQL is proposed to query over the uniform view, which is complex to understand by a novice as users are expected to have knowledge of the underlying structure of the personal data. Similarly, various methods are proposed to query over XML data [1113].

3 A Proposal for Searching Desktop Data

This section discusses a solution for managing desktop data that includes various aspects of searching including metadata, relationships, and contents of XML files. The proposed work searches file system based on metadata and contents of semi-structured file. Figure 1 depicts a context diagram of the proposed system. Users input queries to the desktop search system, which in turn interacts with the file system for retrieving necessary information. The system returns results to the user after processing queries. Figure 2 depicts a detailed DFD of the proposed desktop search system. The system is divided into two main modules; the first module makes search over files/folders based on their metadata and the second module process queries on XML documents. The proposed system offers options for making searches based on the metadata of files and folders, relationships, and contents of XML file. These options are summarized as follows:

Fig. 1
figure 1

Context diagram of the proposed desktop search system

Fig. 2
figure 2

Detailed DFD of the proposed work

  • A file is searched based on metadata name, size, extension, and last modified date.

  • A folder is searched based on metadata name, size, and last modified date.

  • Relationship hasfile makes search on files.

  • Relationship hasfolder makes search on folders.

  • Retrieval of full contents of XML file.

  • Retrieval of partial contents of XML files based on tags and field names.

For query over metadata of files/folders, first user enters the path of the file/folder and the relationship either hasfile or hasfolder to make search on files or folders. For example, a user searches all the files from drive “d” that were last modified on January 10, 2015. After giving path and relationship a hash table is created in memory containing various entries of file/folder’s metadata and user gets result for files/folders based on the metadata as given in the query. This method of searching supports update guarantee as hash table is created in memory after entering the query. It also reduces the time taken for initial indexing of data by desktop search engines. Algorithm 1 makes search over files/folders based on their metadata and Algorithm 2 searches contents from XML files.

Algorithm 1 (Metadata-based search)

  1. Step 1:

    enter the path and relationship

  2. Step 2:

    map corresponding metadata entries in hash table: name, extension, size, last modified date for files and name, size, and last modified date for folders.

  3. Step 3:

    if relationship is “hasfile”

    • then

    • read choice in ch for metadata from 1 to 4

    • 1. name 2. size 3. extension 4. last modified date

    • else if relationship is “hasfolder”

    • then

    • read choice in ch for metadata from 1 to 3

    • 1. name 2. size 3. last modified date  

    • end if

  4. Step 4:

    if (ch == 1)

    • then

    • search hash map entry for file/folder name and print result

    • else if (ch == 2)

    • then

    • search hash map entry for file/folder size and print result

    • else if (ch == 3)

    • then

    • search hash map entry for file’s extension/folder’s last modified date and print result

    • else if (ch == 4)

    • then

    • search hash map entry for file’s last modified date and print result

    • end if

Algorithm 2 (Content-based search on XML files)

  1. Step 1:

    enter path of file

  2. Step 2:

    read choice in ch for file’s contents

    • (1) full contents (2) tag’s data (3) subtag/subfield’s data

  3. Step 3:

    query parsed

  4. Step 4:

    if (ch == 1)

    • then

    • get and print full contents of XML file

    • else if (ch == 2)

    • then

    • get and print all data of tag name

    • else

    • then

    • get and print data of subtag

    • end if

Some sample queries that the proposed system processes are

  1. 1.

    Search files from drive d where the file size is 500 MB.

  2. 2.

    Search file named nisha from e drive.

  3. 3.

    Search all folders from drive g which are modified on January 10, 2015

  4. 4.

    Search for folder named nishafol from f drive.

  5. 5.

    Search files from drive d which are modified on January 11, 2015.

  6. 6.

    Search all .xls files from d drive.

  7. 7.

    Display employee names from file EmpData.xml located in drive d.

  8. 8.

    Display employee’s postal addresses from file EmpData.xml located in drive d.

  9. 9.

    Display all information related to employees from Empdata.xml, which is located in f drive.

  10. 10.

    Display contents of file nisha.xml from g drive.

  11. 11.

    Display employee’s last names from file Empdata.xml from d drive.

4 Conclusion

Management of user’s desktop system is a need of current society as desktop data is huge in amount and change frequently. Various desktop search systems such as Google, Corpenic, etc., are developed for management of personal desktop data. But these search engines require extra indexing time prior starting their work and also do not support partial retrieval of contents from files. In this paper, we propose design of a desktop data search system to which allows search over desktop data using metadata as well as partial and full content retrieval from files (XML files). The implementation of the proposed system is in its advanced stage, extending functionality of the proposed system in our future plan.