Introduction

The number of healthcare-related websites has been estimated to be over 100,000 and increasing rapidly [1]. The potential for online health information to educate patients about the prevention and treatment of disease is tremendous. A well-informed patient is equipped to play an active role in healthcare decisions and benefit by full participation in encounters with healthcare professionals. According to a recent nationwide Harris Poll, 117 million American adults sometimes look online for healthcare information; however, only 37% feel that the information obtained is “very reliable” [2]. In a systematic review of 79 studies assessing the quality of health information for consumers on the Internet, 55 (70%) concluded that quality was a problem [3]. There are no required standards for medical information on the Internet. Some websites that appear to be educational are actually promotional in nature [4, 5], while others may be inefficient, incomplete, out of date, difficult to understand, or contain conflicting information [6]. Studies suggest that patients may receive misleading or harmful information [7] in areas as diverse as complementary and alternative medicine [8], cancer [9], and bone-mineral-density testing [10]. Self-imposed regulation by sponsoring institutions or organizations, voluntary website accreditation, and adherence to a code of ethics are potential quality indicators. The Utilization Review Accreditation Commission (URAC), also called the American Accreditation HealthCare Commission, Inc., is an independent, nonprofit organization that has established standards and offers accreditation for healthcare websites [11]. Self-certifying standards for codes of conduct have been developed by the Internet Healthcare Coalition [12], the American Medical Association [13], Health On the Net (HON) Foundation [14], and other organizations [15]. Many instruments have been used to evaluate the quality of healthcare websites, but few have been validated [16]. A criticism of rating instruments has been lack of information on interobserver reliability and validity of the measurements [17].

Google ( http://www.google.com), the website most often used for Internet searches, was used for 48% of searches for home and work users in the United States in mid-2005 [18]. It is estimated that about 44 million Americans have osteoporosis or low bone mass, with 1.5 million fragility fractures each year [19]. Many of these patients and their family members are likely to use Google to search the Internet for more information on osteoporosis. A recent (August 22, 2005) Google search for “osteoporosis” resulted in 6.27 million matches, suggesting a high level of interest in osteoporosis and a large amount of available information. The quality of the information found online by patients is uncertain. This aim of this study was to develop and evaluate a healthcare website assessment tool (HWAT) for measuring the quality of medical information on the Internet, using osteoporosis as the target disease. For the purposes of this study, quality is defined as information that is accurate, reliable, and complete [20].

Methods

HWAT development

HWAT uses assessment categories (content, credibility, navigability, currency, readability) adapted from published website evaluation resources [2123]. One or more quality indicators was established for each category and weighted according to perceived relative importance. Website pages were classified into three types. The linked page (LP) was the first page visible after linking to the website from the Google search engine. The primary education page (PEP) was the main page for educational information, which was sometimes the same as the LP. The entire website (WS) included all pages of a website, but not linked pages on other websites. Each quality indicator was assigned to a specific class of web page, so that all evaluators were testing the same or similar pages for the same quality indicator. The first version of the tool, HWAT 1.0, was evaluated by nurse educators and physician experts with osteoporosis websites that were believed to represent a range of quality from poor to excellent. The assessment tool was reviewed by a panel of assessors for ease of use, objectivity, and consistency of results. It was then updated to HWAT 2.0, tested on websites and reevaluated. The assessment tool was revised again to become the final study version, HWAT 3.0 (Appendix 1).

Assessment categories and quality indicators

Content

The WS was evaluated as to whether the originating person or organization was identified, educational information was distinguished from the sale of products or services, if any, and whether basic information about osteoporosis (definition, consequences, prevention, and treatment) was included. Websites viewed by linking from the evaluated website were not evaluated. The LP was scored for clearly stating the subject of the website.

Credibility

The LP/PEP was evaluated for a clear statement about the author or institution providing the information, and for identification of a credible source of the information: a university, professional society, government agency, nonprofit foundation, or published references. Additionally, the WS was scored for demonstrating a “seal of approval” or accreditation from a reputable organization: Health on the Net Foundation, Internet Healthcare Coalition, URAC, or equivalent.

Navigability

The LP was evaluated for full “printability,” which was considered to be present if the margins were not cut off when printed. The WS was tested for functionality of intra- and interwebsite links. The capability of communicating with the website provider was evaluated.

Currency

The LP was evaluated for showing a revision date or copyright date within 12 months of the viewing date.

Readability

The Flesch-Kincaid Grade Level [24, 25] for text on the PEP was obtained by using the “Readability Statistics” tool of Microsoft Word 2002 . The measure was taken by selecting the educational and informational text on the PEP, using the “Copy” feature, then pasting the text into a blank Microsoft Word 2002 page using “Paste Special” with “Unformatted Text.” This isolated the text without hypertext markup language (HTML) code, thereby eliminating possible artifacts of the code on readability scores. The “Readability Statistics” function was accessed by first selecting and using the “Spelling and Grammar” check for the document. A readability cutoff was set at 8th grade or less according to published recommendations [26]. Most patient education material on the Internet and in print is written at a grade level of 10 or higher [27, 28]. Literacy levels are typically estimated to be 3 to 5 levels below the highest completed grade of school [29].

Scoring and weighting

Each quality indicator received an absolute score of “1” if present or “0” if absent. The indicators were weighted with a factor of 9, 6, or 4, according to the level of importance assigned by the investigators, as follows: 9=essential, 6=desirable but not essential, 4=helpful but not absolutely necessary. The final weighted score was obtained by multiplying the absolute score by the weighting factor. The maximum possible total weighted score=100.

Patient evaluation tool (PET)

Patient perceptions of osteoporosis-website quality were measured with the PET (Appendix 2). This consisted of five questions, each representing one of the categories of content, credibility, navigability, currency, and readability. Answer choices were “agree” (20 points), “not sure” (10 points), and “disagree” (0 points), with no weighting factor and maximum possible total score=100. PET scores were compared to osteoporosis nurse educator HWAT scores. Office practice patients age 50 and older were invited to participate in the website evaluations, and offered $25 as compensation for their time. When 30 PET reports were returned, patient evaluations were closed.

Internet search

An Internet search for “osteoporosis” was done on the Google website (http://www.google.com) with preference settings for interface language=English, search language=English, SafeSearch filtering=moderate, number of results=100 per page. The top 900 matches were saved in electronic form for screening.

Selection of websites

Matching websites from the Internet search were reviewed in descending order of match, beginning with #1, until 100 websites were found that met study entry criteria. Websites that were duplicate matches, submatches, clearly intended for healthcare providers or researchers, had a non-functional link, were not applicable to human osteoporosis, consisted primarily of links to other websites, or were not in English were excluded. The Uniform Resource Locators (URLs) of the 100 qualifying websites were saved electronically for future retrieval. HWAT development, interobserver reliability testing, and validity testing were based on random selections of 10 of the qualifying 100 websites. Websites for patient evaluations with PET were selected from among the 100 qualifying websites plus additional websites with lower Google matches in order to have a wide range of scores from very high to very low.

Interobserver reliability

This is a measurement of the agreement of HWAT 3.0 scores for each quality indicator and the average agreement of all quality indicators with different evaluators. It was measured by comparing the osteoporosis nurse educators (JRC, BMT) with each other and comparing the physician osteoporosis experts (EML, LAR) with each other. For this study, interobserver reliability was expressed as percent agreement of scores. For example, if the osteoporosis nurse educators agreed with each other on the score of a quality indicator for 9 out of 10 websites, then the interobserver reliability was 90%. The higher the value, the better the interobserver reliability was.

Validity

This is a comparison of mean HWAT 3.0 scores obtained by osteoporosis nurse educators with the mean scores of the osteoporosis physician experts, who were designated as the reference standard for this study. For this study, validity was expressed as percent agreement. For example, if the osteoporosis nurse educators agreed with the physician experts on the score of a quality indicator for 9 out of 10 websites, then the validity was 90%. The higher the value, the better the validity was.

Assessment of website quality

This is the analysis of HWAT 3.0 scores by an osteoporosis nurse educator for all qualifying websites. The higher the score, the better the assessment of website quality was.

Patient perceptions of website quality

PET was used to evaluate patient perceptions of website quality. PET scores were then compared to the HWAT 3.0 scores of osteoporosis nurse educators to evaluate whether patients and healthcare professionals perceived website quality similarly. Results were expressed as percent agreement of quality class (excellent, good, fair, poor), since assessment tools and scoring scales were different.

Additional statistical analysis

Differences between the mean scores of the URL suffix groups were compared using analysis of variance (ANOVA) followed by the Tukey-Kramer multiple comparisons test after confirming that the data were normally distributed and that standard deviations of the groups were not significantly different (Bartlett statistic). Mean scores of the top and bottom deciles were compared using the nonparametric Mann-Whitney test because the data were not normally distributed for both groups. Data were analyzed using GraphPad InStat version 3.00 for Windows 95, GraphPad Software, San Diego, CA, USA.

Results

The Google search for “osteoporosis” with the specified limits generated 2.13 million matches on March 7, 2004. The top 229 matches were screened in order to select 100 websites (Appendix 3) that met study entry criteria. The reasons for exclusion of websites were submatches or duplicate matches (15%), primarily links to other websites (12%), not in English (10%), intended for healthcare providers or researchers (11%), nonfunctional Google links (6%), and other (46%).

Interobserver reliability

There was an average agreement on HWAT 3.0 scoring for all quality indicators of 88% for the physician osteoporosis experts and 79% for the osteoporosis nurse educators (Table 1). Agreement for individual quality indicators ranged from 70–100% for the physicians and from 60–100% for the nurses. There was 100% agreement with physicians and nurses in the readability category. Excluding readability, the category with the highest average agreement for physicians was content (95%), while for nurses it was currency (90%). The category with the lowest average agreement for physicians was credibility (83%), while for nurses it was content and credibility (73% for both).

Table 1 Agreement of HWAT 3.0 scoring for different evaluators

Validity

There was an average agreement on HWAT 3.0 scoring for all quality indicators of 71% between the physician osteoporosis experts and the osteoporosis nurse educators (Table 1). Agreement for individual quality indicators ranged from 40% (author or institution stated) to 100% (readability). Excluding readability, the category with the highest average agreement was currency (90%).

Assessment of website quality

HWAT 3.0 scores for evaluated websites ranged from 18–96 (distribution shown in Fig. 1), with a mean score of 63 and a median of 66. The mean weighted scores by website URL suffix are shown in Table 2. The highest mean score (78) was with .gov websites. There was a statistically significant difference between the mean scores of the various URL suffix groups [ANOVA, P<0.0001; the .net group was excluded because the small sample size (n=4) did not allow for confirmation of normality]. The URL suffix .com group scored significantly lower compared to the URL suffix groups .gov (P<0.05), .edu (P<0.01), and .org (P<0.01). There were no significant differences among the other groups. No attempt was made to ensure that equal numbers of websites for each URL suffix were evaluated, as comparing scores for the different URL groups was not the primary objective of the study. Thus sample size among the different URLs varied greatly, which is not ideal for formal statistical analysis. However, statistical analysis was performed in an effort to gauge the relative differences in mean scores for each URL suffix, and results should be considered in view of the small and unequal sample size. The mean weighted score for the top decile (websites 1–10; mean=82) was significantly better (P=0.0375) than the bottom decile (websites 90–100; mean=66). Table 3 lists the quality indicators in order of rank according the percent of evaluated websites with positive scores. This shows that almost all evaluated websites did well with functioning links, while only 2 of 100 had an acceptable Flesch-Kincaid grade level score, with a wide range for other indicators. A “seal of approval” or equivalent was found for 27 websites (qualifying order #s 1–6, 10–12, 21, 22 ,28, 31, 32, 34–36, 41, 46, 53, 63, 66, 69, 75, 81, 82, 89), with the majority of these found in the top one-third of qualifying order.

Fig. 1
figure 1

Distribution of weighted HWAT 3.0 scores

Table 2 Mean weighted HWAT 3.0 scores by URL suffix
Table 3 Percent of websites with positive scores for each quality indicator

Patient perceptions of website quality

The volunteer patient website evaluators included 12 men and 18 women with a mean age of 66.3 years (range 50–84). Educational level was beyond high school for 77%, high school level for 17%, and less than high school for 3%, and not given for 3%. Self-designated computer skill level was advanced for 10%, intermediate for 63%, and beginner for 26%. The websites selected for patient evaluations represented HWAT 3.0 weighted scores ranging from 18–96 (mean of 57). Data acquisition was confounded by the inability of some patients to connect to some websites. Four patients were unable to connect to three or more websites and four websites had no responses from three or more patients. Of 300 possible website evaluations (30 patient evaluators x 10 websites), there were no data for 34 (11%) due to inability to connect. When nonconnected websites were excluded from analysis, there was 100% agreement on the three highest quality websites combined, comparing the patient evaluators using PET and the osteoporosis nurse educator using HWAT 3.0 (Table 4). Agreement was less striking with lower HWAT 3.0 rankings, but nevertheless appeared to show some degree of concordance.

Table 4 Patient ranking of osteoporosis websites compared to HWAT 3.0

Discussion

Quality means doing it right when no one is looking. —Henry Ford

The quest to verify and promote excellence in healthcare is an elusive one. Investigators have struggled to find a universally acceptable definition of quality and the tools to measure it, whether it be for the business of medicine [30], delivery of health care [31], medical websites [15], or bone itself [32]. A URAC white paper identified accuracy, reliability, and completeness as the key elements of quality information on medical websites [20]. Others have described detailed criteria for assessment of Internet health information that include credibility (source, currency, relevance/currency, review process), content (accuracy, completeness, disclaimer), disclosure (purpose), links (selection, architecture, content, back linkages), design (accessibility, navigability, internal search), interactivity (feedback, communication), and caveats (clarification of marketing of products or services) [21]. Many instruments for rating website quality have been reported—mostly in the form of “report cards” that grade websites according to specified criteria, “awards,” or “seals of approval” that are displayed on the website [16, 33]. The application of these instruments has been limited by the variability and rapidly changing nature of the websites being evaluated, the short lifespan of many web-based rating instruments, uncertainty in defining quality standards, and the lack of data on interobserver reliability and validity of the instruments themselves [16, 17].

HWAT 3.0, the evaluation tool developed for use in this study, was designed to be comprehensive enough to discriminate important differences in the quality of information on osteoporosis websites, yet simple enough to be used efficiently and effectively by health educators who are not scientists or informatics experts. Five assessment categories (content, credibility, navigability, currency, and readability) were ultimately selected, with a total of 13 quality indicators within those categories. PET was developed to determine whether patients perceived quality in the same as healthcare professionals did, using a single quality indicator for each of those categories. Content is arguably the most important of the categories, provided that the source of the material is trustworthy and verifiable, the technical functions of the website are usable, the information is up to date, and it is written in a manner that is understandable for most users. Evaluation of the rating instrument and the results of using the instrument to evaluate osteoporosis websites are reported here. The findings suggest that healthcare professionals usually agree with one another on website quality ratings, and that patients generally agree with healthcare professionals, especially for the top-rated websites. Some website URL suffixes are associated with better quality ratings than others. Some quality indicators, such as functioning website links, are almost always present, while others, such as readability, are almost never present. Websites with higher search engine matches generally, but not always, scored higher than those with lower level matches.

These findings are mostly consistent with those of other studies evaluating medical website quality. Assessment of 69 arthritis websites with a similar tool also showed great variability of quality scores, with sites having the URL suffix of .gov having the highest mean score [22]. An evaluation of 116 websites about carpal tunnel syndrome showed that a high Google match was an indication of accuracy of the medical information provided [34]. A quality rating instrument used to assess 60 websites devoted to chronic liver disease found a correlation between quality scores and sponsorship, with commercial websites most likely to have low ratings [35]. An evaluation of 60 websites on low back pain showed that most of the patient information was of poor quality, concluding that patients should be discouraged from using the Internet as a source of information unless viewed websites have been shown to be evidence-based [36]. Most reported rating instruments do not provide detailed information on the selection of criteria for quality assessment, and very few present data on interobserver reliability and validity.

While the hope for universal excellence in patient information provided on medical websites is laudable, it is not likely to happen. It is unknown whether the use of instruments for rating website quality is effective in directing patients to some websites over others, or whether the instruments make a difference in the design and content of websites being developed. However, given the very large number of patients seeking medical information on the Internet, and the vast and variable panoply of information that is available, continued pursuit of quality verification may be worth the effort. More study is needed to work toward consensus in defining and measuring the quality of patient education on the Internet. If quality standards are then adopted by organizations that have the potential to influence website developers and patients using the Internet, perhaps the Internet can become a more reliable and trustworthy means of providing patient education. A good score with any rating instrument is not a guarantee of future quality, given the rapid changes in medical knowledge and subsequent changes, or lack of change, in website content. Errors in extracting data may have occurred due to evaluator mistakes. The websites evaluated in this study were retrieved from matches on one search engine at one point in time, and may not be representative of matches with other search engines or searches done at other times.

Conclusions

Tools for measuring the quality of medical websites were developed for use by healthcare professionals and patients. Interobserver reliability and validity were assessed with osteoporosis nurse educators and physician experts. The tools were applied to a sample of osteoporosis websites. Significant variability in website quality was observed, with higher quality scores associated with a higher level of search engine match and specific URL suffixes. The quality indicators most often present were for functioning website links, while the one that was found least often was for readability. Patients agreed with healthcare professionals on the quality of top-rated websites, with less agreement on lower-rated websites. More research is needed to achieve consensus on quality standards and determine whether validated instruments for rating the quality of medical websites ultimately benefit patients.