Keywords

1 Introduction

The research on ancient Chinese characters is a meaningful work for the digitalization of ancient documents and the promotion of popularization and dissemination of Chinese civilization.

In a large scale research project in this field, researchers are often confronted with large amounts of characters to be studied. To identify the attributes of an ancient Chinese character including its pattern, pronunciation and meaning, researchers must apply their knowledge about ancient Chinese characters sufficiently, read a lot of related references and discuss with other researchers frequently. Meanwhile, it is also necessary for scientific management among researchers, the research task and the research resources including studied character images and related references. No doubt, employing information technology could bring many conveniences and improve the efficiency for these activities.

The number of Chinese characters is hard to be determined. At present, many ancient Chinese characters have still not been included in existing coded character set, which brings many problems when they are processed in computers. As we all know, the characters in computers are stored in code mode for transmitting and processing conveniently. When a character is to be displayed or printed, the corresponding character pattern is loaded by operating system according to its code. So, a character beyond the coded character set could only be displayed or printed in image mode, which will result in the larger memory consuming and compatibility issues with normal text. Moreover, these character images could not be searched with normal text retrieval techniques, they could only be found out with the help of more complex technology of image retrieval. Unfortunately, most of ancient Chinese characters to be researched are not coded and have to be treated as images. These will bring many problems when they need processed together with ordinary text which consists of coded characters.

Through the aforementioned analysis, it is necessary to research and develop a targeted assisting model for ancient Chinese character research.

The related theory and technology of constructing a managing model for assisting the research work of ancient Chinese characters based on network are related to the theory and technology in many fields including CSCW (Computer Supported Cooperative Work), [1] network, database management system, image processing and retrieval. In detail, the model building relates to website construction, ancient Chinese character resource management and cooperation, image-text integrated arrangement, ancient Chinese character sorting according to different dictionary radicals, ancient Chinese character image retrieval, and so on.

The theory of CSCW has been researched for many years which provide a basic support for our model construction. Its object is to construct a computer based system with which people could cooperate with each other to accomplish a common task cooperatively [1]. Rama and Bishop compared the CSCW groupware system including three commercial systems and four academic systems and designed a set of multidimensional criteria for comparing CSCW systems [2]. Penichet et al. proposed a classification method of CSCW system based on logical principles in a flexible and appropriate way [3]. Chen discussed the key problems of cooperative platform system including system hierarchy structure, user interface, consistency maintenance, concurrent control, access control and record management. A prototype system RITIS (Real-time Image and Text Interactive System) composed of clients and servers with centralized and peer-to-peer structure was constructed. It supports the real-time image and text interaction in Internet environment and multi-user interface of WYSIWIS [4].

Image-text integrated arrangement is to organize text and images in a layout concertedly. To realize this object, the spatial information of images and characters must be recorded and utilized [5, 6]. Compared with the technology of text information processing, the image-text integrated arrangement is more complex in sides of input, edition, display and output [712]. Yang and Cheng designed a scheme to realize image-text mixed arrangement based on XML. They employed the design pattern of MVC to separate text and its view for minimizing the coupling degree among modules and improving the extensibility of system with java language. Because of the platform independent attribute of java and XML, the proposed prototype system has better flexibility [5]. Lu proposed a method to realize the storage technology of image-text mixed arranging documents and their online edition using the open source server control FreeTextBox and ASPJpeg of ASP.net [6]. Zhang et al. put forward a B/S mode based question bank management system. Through using DSOFramer container and the edition function of Word, they realized the input, edition and composition of test paper containing image-text integrated questions [7]. Fan discussed the method of image-text mixed arrangement of test paper in which not only text but also images frequently appear. He classified the layouts into three types called non-image layouts, single image layouts and multiple images layouts. VB and SQL server are used to process and manage the three types of layouts respectively [8]. Zhang and Chen studied the collection, storage and extraction methods of test paper which contain text, formulas, tables and images. They use Delphi as the developing tool and realize the import and export of test papers [9].

In the field of image clustering and retrieval, two types of strategies called text-based method (TBIR) and content-based method (CBIR) are adopted for obtaining required images from image library [1320]. Yang researches on image clustering and its application in image retrieval. Through analyzing the existing image clustering features and algorithms, an image retrieval system based on image clustering is designed. The images in library are clustered with AP algorithm and an image index is built firstly. Then, the sample image is searched in the index to find the corresponding class. And the following image matching operation is fulfilled only in this class [13]. Xie studied on the topic of image clustering and retrieval. To solve the problem existing in traditional image clustering algorithms, an image clustering method based on MRF is proposed which transforms the clustering task into energy minimization process. And a local image retrieval method is designed with graph cutting mode [14]. Zhuang et al. proposed a novel method of retrieving Chinese calligraphic characters. The images of Chinese calligraphic characters are matched by the feature of approximate point correspondence algorithm. After the contour points are extracted, the approximate point correspondence is computed and the matching operation of character images is run according to their accumulated matching cost [15]. Zhuang et al. put forward a retrieval method of Chinese calligraphic manuscript images based on probabilistic indexing structure called PMF-Tree (Probabilistic Multiple-Feature-Tree). Integrated features are used in retrieval such as contour points of character images, character styles and types. The characteristic of this method is that users are allowed to select one of above features as retrieval components [16]. Chen proposed an image retrieval method based on integrated features of global statistic feature and local bitmap feature. The mean-variance of RGB values of images are calculated as the global feature. Then, the image is divided into sub areas to get the mean value with binaryzation processing as the local feature. Finally, image retrieval program is run with the combination of the global and local features [17]. Kong et al. design a semi-supervised image retrieval method. The characteristic points of an image are extracted with improved Harris algorithm. The image is divided into the regions of interest and the color and texture features are extracted. Then, the semantic relation between the image and its class is established through semi-supervised learning in image feature space. Finally, the similarity between images and class centers are computed [18].

The theory and technology on the construction of network stations assisting for research work have become mature; the details of them will not be discussed here.

The above work laid the foundation for our research and developing work. In this paper, considering the requirements of ancient Chinese character research, a web-based cooperation and retrieval model of character images for ancient Chinese character research is constructed which is composed of several modules including ancient Chinese image management and retrieval, research work cooperation, research conclusion and reference management. The key techniques employed in this system are discussed such as the character image cooperation mechanism, image-text integrated arrangement, conclusion data sorting of ancient Chinese character research according to Chinese characters radicals, global and local retrieval of ancient Chinese character images, and so on.

The paper is organized as follows. Section 2 outlines the overall architecture and functions of the model. In Sect. 3, the key techniques employed in the model are analyzed and introduced. The experimental result is introduced and analyzed in Sect. 4. Finally, conclusions and the further work are discussed.

2 Architecture of the Cooperation and Retrieval Model

The object of the model is to realize the cooperation management among character images to be studied, researchers and the records of research conclusions. Meanwhile, it provides the image and document retrieval service for Chinese character researchers in the process of research work. The architecture of the model is shown in Fig. 1.

Fig. 1.
figure 1

The architecture of the cooperation and retrieval model

The input data of the model is the images of single ancient Chinese character.

Ancient books are digitalized with optical sensing devices (scanners or digital cameras) to form the layout images firstly. Then, page layout analysis and character image segmentation program is employed to segment these layout images into a series of single character images supplied to ancient Chinese character researchers.

The output data of the system is the records of research conclusion data which includes the pattern, pronunciation, meaning and so on of each character image.

Based on the requirement analysis to the ancient Chinese character researchers, the design principles of the model are as follows.

Principle 1. Uniqueness principle. Each image is to be given a unique key code when it is storied into the library of character images and should be allocated to only one researcher for studying.

This could avoid the occurrence of the situation that one character image is assigned to more than one researcher at any time which will result in the confusions of conclusions.

Principle 2. Hierarchy principle. The users of the model are divided into different levels with different authority according to their roles in research work.

Users of the platform with different authorities have different operation scopes, which could effectively avoid the fault operations to the research data.

Principle 3. Independence principle. The research conclusions of a researcher about a character image could only be modified by himself. Other people could give suggestions to him rather than change his research records.

This item is to protect the data of research conclusions from modified by other people rather than original researcher of the character image himself.

Principle 4. Compatibility principle. No matter coded characters or images of no coded characters, the system could organize them normally with the mode of image-text integrated arrangement.

The image-text integrated arrangement problem exists not only in the display operation of research data but also the import and export of the research records in database. So, it is necessary to design a special structure to tackle these problems to ensure the normal use of conclusion records by researchers.

The data flow diagram of the model is shown in Fig. 2.

Fig. 2.
figure 2

Data flow diagram of the model

3 Key Techniques in the Model

3.1 Cooperation Mechanism of Research Task

According to the principles of the relationship among researchers, images and research conclusion records, a field controlling operation must be done in data library of different image elements.

Assume ResearchState to be a field in ancient Chinese character image library, SelectID to be a field in researcher library and ExpertNumber to be a field in research conclusion library. The definition of the field value in cooperation is shown in Table 1.

Table 1. Cooperation attributes in library.

Through proper setting operations to the semaphores shown in Table 1, the assigning principles of research resource could be abided to ensure the normally running of research work.

3.2 Image-Text Integrated Arrangement

Image-text integrated arrangement mainly consists of three modules: image-text integrated display, edition and their import and export of research data.

3.2.1 Image-Text Integrated Display

In this module, Literal control is employed to realize the image-text integrated display. The coded characters and the tags of image addresses are stored in character strings of Literal. The text attributes of Literal control is linked to Bind (“literal”). The image-text integrated display is shown in Algorithm 1.

3.2.2 Image-Text Integrated Edition

This module includes the functions of character input and the insertion of character images. Iframe control is utilized to realize the image-text integrated edition which creates an inline frame including a document.

When a character image is inserted, it is stored into database and renamed uniformly. The numbers of character images are recorded and corresponding image labels stored with characters publicly are established.

The algorithm of inserting images into text is shown in Algorithm 2.

3.2.3 Image-Text Integrated Import and Export

The attribute of the library field in which the labels of characters and images of literal is stored is set to “nvarchar”. When we export them into word document, an operation of transforming the html labels to images is fulfilled.

When the data in word documents is imported, the detected images are renamed and stored into database. The locations of the images are replaced with the address links and stored into database. The attributes of the corresponding library fields are set to “nvarchar”. The image-text integrated export is shown in Algorithm 3.

The image-text integrated import is shown in Algorithm 4.

3.3 Research Conclusion Records Sorting by Radical Order

There exist various kinds of radical sets with their corresponding orders related to ancient Chinese characters. So it is necessary to establish a mapping table to realize the sorting function of conclusion records according to different radical sets. The mapping operation of radical sorting of library records is shown in Fig. 3.

Fig. 3.
figure 3

Mapping operation of radical sorting of research conclusion records

Assume ACC i to be the key value in conclusion table, RC j to be the key value of the radical of ACC i in radical table, RS(RC j ) to be the key value of RC j in current mapping table.

Then, the series number of record ACC i in sorted list could be calculated as

$$ \begin{aligned} & RS(ACC_{i} ) = \left( {\sum\limits_{k = 1}^{n} {N_{{{\text{PR}}k}} \left| {_{{RS(RC_{k} ) < RS(RC_{j} )}} } \right.} } \right) \\ & + \left( {\sum\limits_{l = 1}^{m} {N_{{{\text{EQ}}l}} \left| {_{\begin{subarray}{l} RS(RC_{l} ) = RS(RC_{j} ) \\ ACC_{l} < ACC_{j} \end{subarray} } } \right.} } \right) + 1 \\ \end{aligned} $$
(1)

where \( N_{{{\text{PR}}k}} \left| {_{{RS(RC_{k} ) < RS(RC_{j} )}} } \right. \) is the number of the records whose RS(RC k) is less than RS(RC j ), \( N_{{{\text{EQ}}l}} \left| {_{\begin{subarray}{l} RS(RC_{l} ) = RS(RC_{j} ) \\ ACC_{l} < ACC_{j} \end{subarray} } } \right. \) is the number of the records whose RS(RC l ) is equal to RS(RC j ) and ACC l  < ACC j .

3.4 Ancient Chinese Character Image Retrieval

A retrieval algorithm of character images is specially designed in the model to assist researchers to find the local or global similar character images in database. It contains not only the traditional image retrieval functions oriented on the whole area or partial area of a character image, but also a new image searching style called image retrieval in symmetrical areas of character images for searching radicals in Chinese characters.

Assume A to be the area of an ancient Chinese character image which is composed of the sub area a ij

$$ A = \left( {a_{ij} } \right)^{T} ,(i = 0, 1, \ldots ,m - 1 ;\;j = 0, 1, \ldots ,n - 1 ) $$
(2)

where m is the row number and n is the column number of meshes divided according to the principle of elastic mesh [19] within the character image A as shown in Fig. 4(a). The directional line elements feature [20] is extracted in sub areas a ij to form corresponding feature vector as shown in Fig. 4(b).

Fig. 4.
figure 4

Area division and feature extraction of character image A

$$ F = \left( {f_{ij} } \right)^{T} ,(i = 0, 1, \ldots ,m - 1 ;\;j = 0, 1, \ldots ,n - 1 ) $$
(3)

where f ij consists of four directional components.

To improve the efficiency of image retrieval, a hierarchy strategy is employed in which a character image A is clustered into sub clusters previously according to the typical areas A U, A D, A L, A R, A C and A W defined by the structural characteristics of ancient Chinese characters.

The local areas A U and A D in vertical are defined as:

$$ A_{\text{U}} = \bigcup\limits_{i = 0}^{{\theta_{\text{U}} }} {} \bigcup\limits_{j = 0}^{n - 1} {a_{ij} } $$
(4)
$$ A_{\text{D}} = \bigcup\limits_{{i = \alpha_{\text{D}} }}^{m - 1} {} \bigcup\limits_{j = 0}^{n - 1} {a_{ij} } $$
(5)

where θ U is the coordinate of the vertical margin of A U, α D is the coordinate of the horizontal margin of A D.

The pixel areas of A U and A D in character image is shown in Fig. 5(a) and (b).

Fig. 5.
figure 5

Area division and feature extraction of A U and A D in character image A

The local areas A L and A R in horizontal are defined as:

$$ A_{\text{L}} = \bigcup\limits_{i = 0}^{m - 1} {} \bigcup\limits_{j = 0}^{{\beta_{\text{L}} }} {a_{ij} } $$
(6)
$$ A_{\text{R}} = \bigcup\limits_{i = 0}^{m - 1} {} \bigcup\limits_{{j = \delta_{\text{R}} }}^{n - 1} {a_{ij} } $$
(7)

where β L is the coordinate of the vertical margin of A L, δ R is the coordinate of the horizontal margin of A R.

The pixel area of A L and A R in character image is shown in Fig. 6(a) and (b).

Fig. 6.
figure 6

Area division and feature extraction of character image A

The local areas A C and A W are defined as:

$$ A_{\text{C}} = \bigcup\limits_{{i = \gamma_{\text{C1}} }}^{{\gamma_{\text{C2}} }} {} \bigcup\limits_{{j = \mu_{\text{C1}} }}^{{\mu_{\text{C2}} }} {a_{ij} } $$
(8)
$$ A_{\text{W}} = \bigcup\limits_{i = 0;}^{m - 1} {} \bigcup\limits_{i = 0}^{m - 1} {a_{ij} } $$
(9)

where γ C1 and γ C2 is the coordinate of the vertical margin of A C, μ C1 and μ C2 is the coordinate of the horizontal margin of A C.

The pixel area of A C in character image is shown in Fig. 7, while A W includes the whole area of character image.

Fig. 7.
figure 7

Area division and feature extraction of A C in character image A

The clustering operation is fulfilled according to the typical areas. The clustered classes generated by typical areas are C U, C D, C L, C R, C C and C W correspondingly.

When an area A X is drawn by user within the character image area, it will be matched within the above typical areas A U, A D, A L, A R, A C and A W to find out a best one A k as the guidance of the following retrieval operation.

$$ A_{k} = \hbox{max} [(A_{\text{X}} \cap A_{i} )] $$
(10)

where A i  = A U, A D, A L,A R, A C.

The character image will be searched only in the sub cluster C k corresponding to A k based on the feature extracted from A X . The similarities are measured by the feature distance in \( F = \, \left( {f_{ij} } \right)^{T} \) of A X and all samples in C k . The selected character images are returned to the user according to the value of similarities in ascending order.

When a symmetrical retrieval is required, the character image is searched not only in its corresponding classes but also its symmetrical area A XS in character image in vertical direction or horizontal direction such as A U coupled with A D, A L coupled with A R. The clustering and retrieval algorithm of ancient Chinese character image is shown in Algorithm 5.

4 Experimental Result and Analysis

A web-based cooperation and retrieval system of ancient Chinese character images for research with the proposed model in this paper is implemented and utilized in a research project of ancient Chinese characters. The system employs Visual Studio 2008 as the development tool, SQL server 2005 as the storing database, and ASP.NET applied to the system as website frame work.

When a user registers to the system, he will be assigned an authority according to his research task. The setting of authority of the system is shown in Table 2.

Table 2. Setting of user authority.

A researcher studies the character image assigned to him in the system workbench. He could use the image retrieval function of system to find the similar characters he need locally or globally in database.

The research conclusions of characters could be written by researcher in the corresponding text edition boxes and stored into library. If a researcher needs to write an image of a character, he could use image-text integrated edition mode.

The efficiency of image-text integrated operations is decided by the ratio of the number of coded characters and character images. In our system, there are 1521 records, about 112437 characters in the table of conclusion results. The number of coded characters and character images is shown in Table 3.

Table 3. The number of characters and images in image-text integrated arrangement.

The records of conclusion data of character research stored in database could be displayed in different orders according to the needs of specialists. Table 4 shows several kinds of mapping table for radical sorting.

Table 4. Radical mapping table.

In side of ancient Chinese character image retrieval, we scanned 3176 character images of ancient books for experiment with scanner. Through pre-processing of character images, the feature of clustering and retrieval are extracted. All character images are clustered into six clustering classes according to the definition of A U, A D, A L, A R, A C and A W. Each character image is remark with a clustering class ID.

The distribution of character images in different class is shown in Fig. 8.

Fig. 8.
figure 8

The distribution of character images in different clusters. (a) The number of character images of every clusters in local area clustering of A U and A D. (b) The number of character images of every clusters in local area clustering of A L and A R. (c) The number of character images of every clusters in local area clustering of A C and A W.

From Fig. 8 we can see that the numbers of character image samples have similar values among classes. This means the high efficiency when image retrieval is done in different classes.

When a searching area is obtained from screen area drawn by user on the character image, the feature of character image is extracted and the retrieval operation is only fulfilled in the corresponding clustering class of the current character image.

5 Conclusion

In this paper, a web-based cooperation and retrieval model of ancient Chinese character images is proposed for assisting ancient Chinese character research work. It includes the functions of ancient Chinese image management and retrieval, research work cooperation, research conclusion data management, and so on. Because of the characteristics of ancient Chinese characters, many problems need to be solved specially in the constructing process of the system. The key techniques developed in this system, such as image cooperation mechanism of ancient Chinese characters, the method of integrated image-text arrangement, the algorithm of character sorting according to Chinese character radicals and the global and local retrieval method of ancient Chinese character images are discussed. The proposed system realized the designing objects of the model including resource management, and research work cooperation. It is helpful for the improvement of the efficiency of character research work.

Our further work is to improve the performance and complete the function of the system. Firstly, in ancient Chinese character image retrieval, we will optimize the algorithm to obtain higher searching speed. This work needs to select more efficient features of character images and corresponding clustering and retrieval algorithms. Secondly, we will develop the recommendation function of research references. This relies on the knowledge arrangement of researchers in ancient Chinese character research work. Thirdly, better user-friendly menus will be developed. No doubt, these improvements will enhance the assisting ability of our model for a better service to the research work of ancient Chinese character research.