Keywords

1 Introduction

Health information is an online popular pursuit being sought by 80% of U.S. Internet users [2]. In this type of searches, users frequently have difficulties finding the correct terms to include in their queries [7, 21], lacking the knowledge of the proper medical terms [19, 22] or misspelling them [6, 13]. For these reasons, support in query formulation may contribute to an improved retrieval experience.

Previous findings [8, 9] led us to develop a system that, based on an initial user query, suggests 4 different queries combining two languages (English and Portuguese) and two bodies of terminology (lay and medico-scientific). To the best of our knowledge, no previous works have explored cross-language query suggestions in the health domain.

The usage given to the suggestions provided by this system was studied before [10] as well as their effect on the medical accuracy of the knowledge acquired during the search session, considering different user characteristics [18].

In this work we assess the impact of presenting and using these suggestions on the precision of the search session. As search assistance should be personalized to achieve its maximal outcome [4], we have considered users’ English proficiency, health literacy and topic familiarity in this analysis.

2 Related Work

Query formulation is one of the most important aspects of information seeking. Query suggestion provides alternative ways to help users formulate queries and explore less familiar topics [5]. This technique is particularly important in topics about which users have little knowledge or familiarity. In these situations, users lack of vocabulary and knowledge may hinder query formulation.

In the health domain, the terminology gap between medical experts and lay people often causes additional difficulties in searches conducted by consumers [23]. To mitigate some of these difficulties, different query modification approaches have been proposed. Most of the approaches use specialized vocabularies from the Unified Medical Language System (UMLS). This is the case of the assistant proposed by Zeng et al. [21] that compute the semantic distance between the query and suggested terms using co-occurrences in medical literature and log data as well as UMLS semantic relations. iMed [11] and MedSearch [12] are two health search engines that suggest related medical phrases to help the user refine the query. In these systems, the phrases are extracted and ranked based on MeSH (Medical Subject Headings), the collection of crawled webpages, and the query. Similarly to our two-terminology query suggestion system, Zarro and Lin [20] also use MeSH along with social tagging to provide users with medico-scientific and lay terms.

All these works assess their systems through user studies, although focusing on different issues. In three of them [11, 12, 21], users were randomized into 2 groups, one receiving suggestions and the other not receiving them. In the study conducted by Zarro and Lin [20], 10 subjects were lay and the other 10 were expert. All subjects used the same system. The evaluation that is most resembled with a precision assessment is conducted by Luo et al. [12]. Authors combine relevance and diversity in a metric they call usefulness and each document is either useful or not. This metric is then used to compute the NDCG metric. Zeng et al. [21] assess the rates of successful queries, i.e., queries with at least one relevant result among the top 10. In the assessment of iMed [11], a search session is considered successful if the user can list one of the correct diseases associated with the medical case’s situation. Note that this evaluation does not consider the relevance of each document. Zarro and Lin [20] focused on the differences between lay and expert subjects. Zarro and Lin [20] found that both user groups preferred MeSH terms because their quality was considered superior to the quality of social tags. All the assistance approaches described here were considered successful.

Regarding multilingual query suggestion, Gao et al. [3] proposed a system providing suggestions in a language different from the original query’s language. After using query logs to translate queries, authors used word translation relations and word co-occurrence statistics to estimate the cross-lingual query similarity. They used French-English and Chinese-English tasks for the evaluation and found that these suggestions, when used in combination with pseudo-relevance feedback, improved the effectiveness of cross-language information retrieval.

Since 2014, the Conference and Labs of the Evaluation Forum (CLEF) eHealth lab began to propose a multilingual information user-centred health retrieval task, incorporating queries in several languages in its dataset. The low number of teams proposing multilingual approaches makes us conclude that this type of approaches could be more explored.

3 Suggestion Tool

We developed a search suggestion system that, given a health query, suggests queries in two languages, Portuguese (PT) and English (EN), using medico-scientific (MS) and lay terminology.

In Fig. 1, we present the architecture of the suggestion tool, which will be briefly described in the following paragraphs. More details on the suggestion system can be found in a previous publication [18].

Fig. 1.
figure 1

Architecture of the suggestion tool [18].

Our system uses the Consumer Health Vocabulary (CHV) [14], a vocabulary that connects informal expressions about health to technical terms used by health care professionals. Each expression is associated with an Unified Medical Language System (UMLS) concept which, in turn, may be associated with several expressions or strings. Each string is associated with a CHV and an UMLS preferred names. Given that queries will probably be formulated in Portuguese, we use a Portuguese translation of the CHV.

After stemming the terms included in the CHV, we created an inverted index in which we associate each term with an inverse string frequency (\(isf_{t}\)) and a postings list, i.e., a list of the strings in which the stemmed term appears. The inverse string frequency is computed as \(isf_{t}= log(N/sf_t)\), where \(sf_{t}\) is the number of strings in which the term appears and N is the total number of strings.

As the probability of finding multiple occurrences of the same term in a string is very small, we decided to ignore the term frequency in each string (\(tf_{t,s}\)). Each (query, string) pair is assigned the following \(score(q,s) = \sum _{t \in (q\cap s)}isf_{t}\). Because the length of strings and queries has a very small variance, we decided to not normalize the above formula.

To limit the number of suggestions, for each query, we only select the string with the maximum score. For this string, we identify the associated concept and return its CHV and UMLS preferred names in English and Portuguese. This results in a maximum of 4 suggestions for each query. As an example, a set of suggestions could be: “colectomia”, “remoção do cólon”, “colectomy” and “colon removal”.

Our retrieval system used the Bing Search API to obtain web results for users’ queries. To increase the usability of the interface with regard to learning, we decided to keep the interfaces very simple and similar to those used in the most popular search engines. All the suggestions are presented in a single line.

4 Experiment

We conducted a user study with 40 participants (24 female; 16 male), with a mean age of 23.48 years (standard deviation = 7.66). English proficiency was evaluated using an instrument developed by the European Council that grades English proficiency in the Common European Framework of Reference for Languages, a widely accepted European standard for this purpose. To evaluate the users’ health literacy, we have used the Medical Term Recognition Test, a brief and self-administered instrument proposed by Rawson et al. [15]. Users’ familiarity with each topic was self-assessed on a five-level scale. The sample of users is heterogeneous in these characteristics.

Each user was assigned a set of 8 tasks, half of them conducted in a system presenting the suggestions (SYS+) and the other half on a system without suggestions (SYS). Each task was associated with a simulated work task situation [1]. Situations were rotated and counter-balanced across subjects and systems. Before the user study, to define these situations, we asked 20 persons with no medical expertise from 30 to 68 years old and a wide range of education levels (from high school to PhD) to state the health topic for which they had most recently searched on the Web. From these, we randomly selected 8 and created a scenario for each. Note that these persons were not participants of the study. The information situations were described to the users in Portuguese.

In each task the user had to formulate 3 queries in a language of their choice and assess the relevance of the top 10 results for each query, considering his own context in a 3-value scale. In the first query, users formulated the query without help from the system. Users had no restrictions in query formulation, being able to use their preferred language and terminology. Based on the initial query, the system presents suggestions that can, or not, be used for the formulation of the second query. The same happens when the user is moving from the second to the third iteration. The set of 3 iterations constituted a search session. After the third iteration, they were asked to evaluate the feeling of success with the iterations in a 5-level scale. More details on the user study can be found in a previous publication [18].

Our experiment was motivated by the following research questions: (1) Does a system that includes this suggestion tool lead to a higher precision? (2) How does clicking on a suggestion and using suggestions as sources of terms affect precision? (3) Does this effect differ with the language and terminology of the suggestions? (4) Does this effect differ with the English proficiency, health literacy and topic familiarity of the user?

5 Data Analysis

To evaluate the impact of suggestions provided by this system, we considered that users might use them as suggestions, clicking or not on them, or as a source of terms they can use in later queries. Considering this last scenario, we computed the proportion of suggestion’s terms that were used in the subsequent query (termsUsed) and the proportion of the suggestion’s terms that were used in the following query and were not used in the previous query (newTermsUsed). The former is useful to assess the quality of suggestions’ terms and the latter is also useful to assess the utility of the suggestions for the users. Let \(Q_{ it }\) be the set of unique stemmed terms belonging to the query of the iteration \( it \) and \(S_{ it }\) the set of unique stemmed terms belonging to the suggestion presented in the iteration \( it \), these proportions are computed as follows: \(termsUsed_{ it }=\frac{|Q_{ it }\cap S_{ it }|}{|S_{ it }|}\) and \(newTermsUsed_{ it }=\frac{|(Q_{ it }\cap S_{ it })\backslash Q_{ it-1 }|}{|S_{ it }|}\). With these proportions, we were able to analyze three scenarios of suggestions’ use as source of terms: using or not using suggestions’ terms (Terms?), using or not using all the terms of a suggestion (All terms?), using or not using suggestions’ terms that were not used in the previous query (NewTerms?). Note that users may use all the terms from a suggestion without clicking it, or they can change the order of the terms or even mix them with other terms.

We used Graded Average Precision (GAP), a measure proposed by Robertson et al. [16], based on a probabilistic model that generalizes average precision to the case of multi-graded relevance in which the user has a binary view of relevance even when using a non-binary scale of relevance. Based on the results presented by GAP’s proponents, we used an equally balanced and , i.e., , meaning that the levels 1 and 2 of our relevance scale have the same probability of being the grade from which the user starts considering the documents relevant.

The analysis was done comparing iterations where suggestions were used with iterations where suggestions were not used. We compared GAP means between groups of iterations (with and without the use of suggestions) using the Student’s t-test. When the assumption of homogeneity of variances was not verified, we applied the Welch t-test. To compare groups of users, we applied the one-way ANOVA and the Tukey’s test to locate the differences. Reporting our results, we use a * to mark significant results at \(\alpha \) = 0.05 and a ** to mark significant results at \(\alpha \) = 0.01.

6 Results

We found that the first iteration has a higher mean GAP than the second iteration (Tukey’s adj. p = 0.009**) and the third iteration (Tukey’s adj. p = 3.3e−06**). Differences may be explained by users’ criteria in judging relevance. We found that documents with reoccurring contents, because they are no longer useful, are assigned lower relevance scores. Given these differences, we decided to base our analysis on the variation of GAP between iterations. For each iteration we have therefore computed a \(\varDelta {GAP_{it}=GAP_{it} - GAP_{it-1}}\). We found no significant differences between \({\Delta \mathrm{GAP}_{2}}\) and \({\Delta \mathrm{GAP}_{3}}\).

In SYS+, GAP tends to decrease less (\(\varDelta {\text {GAP}}\) mean of −0.031) than in SYS (\(\varDelta {\text {GAP}}\) mean of −0.033). In all the four use scenarios (Terms?, All terms?, NewTerms? and Click?), we did not find significant differences between using or not using suggestions. After this general analysis, we repeated it by suggestion’s language and terminology. Almost all the comparisons were non-significant. The only exceptions occur when new terms from Portuguese (t(147.5) = 2.4, p = 0.01**) or lay (t(139.5) = 2.78, p = 0.003**) suggestions are used, situations in which the impact of suggestions on precision is negative.

6.1 Analysis by English Proficiency

We compared the mean \(\varDelta {\text {GAP}}\) in the four use scenarios, in each group of English proficiency (Table 1). With respect to Portuguese suggestions, although we haven’t found significant differences, the general tendency is to have higher precision when users do not use the suggestions. In terms of English suggestions, we found an opposite effect when a suggestion is clicked or when all the terms of a suggestion are used. Yet, this tendency is only significant when proficient users click on English suggestions. Proficient users tend to benefit when they use new terms from an English suggestion and, surprisingly, the same happens with basic proficiency users when they use terms from English suggestions.

Table 1. \(\varDelta {\text {GAP}}\) means by language. Boldface represents the maximum in each group and scenario. Square brackets are used there are significant differences between scenarios. EP stands for English proficiency.

6.2 Analysis by Health Literacy

Comparing the precision of lay and medico-scientific queries by level of health literacy, we found few significant differences. As can be seen in Table 2, two of the three exceptions occur when marginal (t(89.9) = 2.3, p = 0.01*) and functional (t(33.7) = 2.0, p = 0.03*) health literate users employ terms from lay suggestions they have not used before. These suggestions have a negative impact on the precision of the search sessions of these users. The use of lay suggestions tends to be beneficial to precision when low health literate users use all the suggestion’s terms and when these users and the marginal health literate users click in the suggestions.

In medico-scientific suggestions, the use of all the suggestion’ terms tends to be favorable to precision in all levels of health literacy. Moreover, the use of new terms from suggestions and suggestions’ clicks are beneficial to precision in the low and functional health literacy groups. Of these, the only significant difference occurs when low literate users click medico-scientific suggestions (t(35.4) = −1.9, p = 0.03*), showing the positive impact of this type of suggestions.

6.3 Analysis by Topic Familiarity

Considering topic familiarity, whose results are also in Table 2, we found that familiar users (TF2) have significantly higher precision in iterations in which they do not use new terms from lay (t(72.9) = 2.2, p = 0.01*) or medico-scientific suggestions (t(76.4) = 2.0, p = 0.02*). We also found that extremely familiar users tend to have higher precision with medico-scientific suggestions or when they click or use all the terms from lay suggestions. Non-familiar users also seem to benefit from clicks in both lay and medico-scientific suggestions.

Comparing the mean \(\varDelta {\text {GAP}}\) of the several topic familiarity levels in each scenario and type of terminology, we found that, when using new terms from medico-scientific suggestions, non-familiar users have a significantly higher precision than familiar users (Tukey’s adjusted p = 0.01*). This is simultaneously due to the increase in precision in non-familiar users and the significant decrease found in the familiar users, when using these suggestions.

7 Discussion

On Table 3 we summarize the significant findings previously reported, only found when users use new terms from suggestions or when they click on them.

Table 2. \(\varDelta {\text {GAP}}\) means by terminology. Boldface represents the maximum in each group and scenario. Square brackets are used there are significant differences between scenarios. HL stands for Health Literacy.
Table 3. Summary of the significant findings. \(\uparrow \) denotes increases and \(\downarrow \) decreases in each outcome.

Answering the first research question, the system with suggestions (SYS+) tended to demonstrate better performance in terms of precision. The positive effects of the suggestion system has also been previously shown in the medical accuracy of the knowledge obtained in the session [18].

Pertaining the second research question, we found no significant differences between using or not using the overall set of suggestions, either as a whole or as a source of terms. Answering the third research question, negative effects of language and terminology are found when users use new terms from Portuguese and lay suggestions. No other significant effect was found.

Moving on to the fourth research question, as seen in Table 3, the effect differs in the two use scenarios, clicking on suggestions has positive effects on precision and using new terms from them has negative effects. English suggestions are advantageous to proficient users when they click on them. The use of lay and medico-scientific suggestions has also a good effect on precision. Surprisingly, lay suggestions increase precision not only in non-familiar users but also in the extremely familiar group.

The precision increase found when low health literate users click on medico-scientific suggestions is consistent with what has been found in a previous study, that is, “less subject expertise seems to lead to more lenient and relatively higher relevance ratings” [17]. This means that these users may be assessing documents regarding their relation with the topic instead of their utility to themselves. Findings in non-familiar users could be explained the same way but, since we have found in a previous study [18] that medico-scientific suggestions increase their answers’ correct contents and tends to decrease their incorrect contents, we have reasons to believe this is not the case. Moreover, this agrees with what we found in a previous study [9] where we concluded that health literacy is more important to comprehend medico-scientific documents than topic familiarity.

With the use of new terms from suggestions, precision decreases with the use of Portuguese and lay suggestions. The same happens with lay suggestions in higher levels of health literacy and with both lay and medico-scientific suggestions in users familiar with the topic.

We have also compared the performance of different groups of users in each use scenario for each type of suggestion. In these comparisons, we found that users familiar with the topic have lower precision than non-familiar users when new terms from medico-scientific suggestions are used. This may be explained by the benefits that non-familiar users seem to obtain from medico-scientific suggestions in every scenario.

8 Conclusion

We describe a query suggestion system for the health domain and study its impact on the precision of the search session considering several user characteristics. This analysis takes into account the utility of the suggestions as new whole queries and as sources of terms.

We found that a system with these suggestions is beneficial for the precision of the search session. In a previous work we have reached a similar conclusion regarding medical accuracy [18]. The best precision effects of the suggestion tool are achieved when users use it strictly as a suggestion tool, that is, when they click in suggestions.

Previously, we found that English suggestions are preferred to the Portuguese by the general user, both in terms of clicks and as source of new terms [10]. We have also found that clicking English suggestions is beneficial for the medical accuracy of the knowledge acquired during the search session. In this work, we have shown that the benefit of whole English suggestions is not restricted to medical accuracy but also applies to precision if the user is proficient in this language. The benefits of these suggestions allied with users preference for them show the potential of English suggestions for users with other native languages in the health domain. This corroborates the finding of a previous study [8] that suggest that non-English–speaking users having at least elementary English proficiency can benefit from a system that suggests English alternatives for their queries.

In terms of terminology, medico-scientific suggestions are preferred to lay ones by the general user, in clicks and as source of terms, a preference that increases with health literacy [10].

In terms of medical accuracy, both terminologies are favourable to the general user. In the present work we found that both terminologies are advantageous to specific groups of users but not to the general user. This suggests that personalizing the suggestion system might have good effects on precision.

As future work, we would like to explore the effects of this suggestion system on motivational relevance.