Background of the study

For well over a decade, Web hyperlink analysis has been a growing area and a main topic of Webometrics research. Starting from the early Web Impact Factor concept (Ingwersen 1998), many studies, both quantitatively (e.g. Thelwall 2001) and qualitatively (e.g. Bar-Ilan 2005), have been carried out that developed different concepts and techniques that use Web hyperlink data to find various types of information. Inlink analysis and co-link analysis have been two common types of Web hyperlink analysis. Parallel to the inlink and co-link concepts are the concepts of the number of keywords and the number of co-words. Thelwall and Sud (2011) proposed organisation title mentions as a measure of academic impact while Vaughan and You (2010) proposed the Web co-word analysis as a way to measure and visualize relationships among organizations. Extending these earlier studies, the current study examined the two concepts in a multi-industry environment to determine if the keyword count of the company name can be a measure of business performance and whether the co-word method can measure relationships among companies in a heterogeneous context. Further, the study were carried out in various types of Web environments including general Websites, blog sites and Web news sites to find out if and how results differ in these different environments and which type of Websites is more conducive for this type of keyword analysis.

Many studies have shown that the number of inlinks to an organization can be a measure of the organization’ performance or position. For example, inlink counts have been found to correlate with university’s teaching or research performance (Li et al. 2003; Smith and Thelwall 2002) and company business performance measures (Vaughan and Romero-Frías 2010). Co-link studies have been applied to various types of organizations and were found to be able to show relationships among the organizations studied, for example academic relationships (Ortega and Aguillo 2009; Thelwall and Wilkinson 2004), business relationships (Romero-Frías and Vaughan 2010b), political relationships (Kim et al. 2010; Romero-Frías and Vaughan 2010a), and government relationships (Holmberg 2009). While these hyperlink studies have successfully contributed to our understanding of the Web link phenomenon and made significant contribution to Webometrics, the source of Web hyperlink data collection from commercial search engines has been decreasing over the years. MSN suspended its inlink search in 2007 (Seidman 2007). Although Google still provides inlink search, it only retrieves a sample of inlinks (Google 2011). Recent studies relied on Yahoo! for inlink and co-link data collection. However, Yahoo! stopped link search command from its Web interface (www.yahoo.com) in the summer of 2010 which made co-link search impossible there. Then in April 2011 Yahoo! terminated its API service (Yahoo! 2011) so no inlink nor co-link search could be done through API. At the time of writing (Oct 2011), Yahoo! Site Explorer still provides inlink search but it is not clear how long this service will be available.

Given the diminishing data source for link analysis, researchers have tried Web keywords as an alternative object to supplement Web hyperlinks. Vaughan and You (2010) proposed the Web co-word analysis concept and tested the method in the telecommunications industry. They found that the co-word method could generate business competition maps as the co-link method did. Later, Thelwall and Sud (2011) proposed the organisation title mentions (the number of hits of keyword searching of the organization title) as a Web impact measurement. Building on these earlier studies, the current study attempts to further our knowledge of Web keyword analysis in the following ways. First, we will determine if the number of occurrences of a company name on Websites can be used as an indicator of some commonly used business measures. While Thelwall and Sud (2011) showed that organization title mention can be academic impact measure, no study has examined if the company name mention can be a measure of business performance and our study attempted to determine this. Second, the current study extends the co-word analysis method to multi-industry environment to find out its feasibility there (Vaughan and You (2010) study was in a single industry environment). Third, the current study was carried out in different types of Web environments including general Websites, blog sites, and Web news sites to find out if and how results differ in these different environments and which environment is more conducive to the keyword analysis. While Thelwall and Sud (2011) study was carried out only on general Websites, Vaughan and You (2010) compared results from general Websites with that from blog sites and found that the latter is a better source to collect Web co-word data. The current study also includes the news Websites to find out how it compares with other types of sites that have been studied before. These purposes of the study lead naturally to our research questions as follows.

Research questions:

  1. 1.

    Does the Web visibility of a company as measured by the number of Webpages on which the company name appears correlate with the company’s business measures.

  2. 2.

    Can co-word analysis show business relationships among companies when applied to a multi-industry context.

  3. 3.

    Which type of Websites (general Websites, blog sites, or Web news sites) is more conducive for data collection.

Methodology

To address our research questions, we selected a group of American industries as well as companies within each industry to study, collected company financial data, selected search engines and collected various types of keyword data using the search engines. In addition, we also collected Web hyperlink data (both inlink and co-link data) because we want to contrast results from keyword analysis with that from inlink and co-link analysis to find out if keyword analysis can supplement inlink and co-link analysis in light of the shortage of commercial search engines from which to collect hyperlink data.

Industries and companies in the Study

Five diverse US industries were selected for the study: information technology, media, heavy construction and engineering, mining, and banking. These industries cover a broad range of economic features and various degrees of exposure on the Internet. They range from traditional industries (mining and construction) to more information-centred industries (IT and media). To make an objective selection of companies within each industry, we consulted industry reports produced by Mergent (http://www.mergentonline.com), a reputable business database. Mergent reports list top companies (usually 9–10) for each industry. All companies listed were included in the study. All the reports are dated 2010 (Mergent 2010a, b, c, d) except the one for the heavy construction which is dated 2009 (Mergent 2009) and which was the most recent report for that industry at the time the reports were consulted (September 14, 2010). The IT industry report listed nine companies and the other four industry reports each listed ten companies. All these 49 companies were included in the study. The complete list of all companies together with all data about the company that were used in the study is shown in Table 3 in Appendix.

Collecting company financial data

For the purpose of the study, we decided to use financial variables of revenue, profit, and assets because they are the most commonly used variables of financial performance (revenue and profit) and financial position (assets). We collected financial data from Yahoo! Finance (http://finance.yahoo.com/) as it contained these three types of data. Yahoo! Finance data were provided by Capital IQ (a Standard and Poor’s business). Specifically, we entered the company ticker (see Table 4 in Appendix) into the search box of “GET QUOTES” and then retrieved the financial data that we wanted. Company Massey Energy Co. was not available at Yahoo! Finance because the company was acquired by Alpha Natural Resources Inc. in June 2011. We obtained this company’s financial data directly from the 2010 Annual Report as registered in the SEC’s Edgar System (http://www.sec.gov/edgar.shtml). All financial data used in the study were for year 2010, the year that we collected all Web data, and they are shown in Table 4 in Appendix.

Collecting Web keyword data

Two types of Web keyword data were collected: the number of occurrences of company names (keyword count) and the number of co-occurrences of names of a pair of companies (co-word count). In both scenarios, the acronym, rather than the full name of the company, was used. For example, Intel is used instead of Intel Corp. while Cisco is used for Cisco Systems Inc. The decision to use the acronym rather than the full name was based on the fact that the former is more likely to be used on Webpages. This is also consistent with the co-word data collection method in earlier studies (e.g. Vaughan and You 2010). The proper acronym for each company was determined based on common use as shown on Webpages. See Table 3 in Appendix shows acronyms used in the study.

If an acronym consists of more than one word, it was searched as a phrase by using quotation marks around the acronym. For example, the acronym of Time Warner was searched as “Time Warner”. Keyword counts were collected by entering the company name as the query term and then recording the number of hits of the query. The co-word counts were determined by entering the pair of company names as the query and then recording the number of hits of the query. For example, the co-word count of companies Time Warner and Intel was searched as “Time WarnerIntel. Boolean operator AND was not used to connect the two acronyms because AND was the default search operator in Google which was used to collect Web keyword data.

Web keyword data were collected from three types of Web sources: the general Web, blogs, and Web news. Three Google search engines (www.google.com, www.google.com/blogsearch, and www.news.google.com) were used to collect the three types of Web data, respectively. Google was chosen because it is the most popular search engine on the Web and had the largest coverage of Websites. Another reason that we used Google was that at the time of the study, fall 2010, Bing and Yahoo! did not have blog search engines. Data from the general Web were collected on Oct. 6, 2010 while data from blogs and news sites were collected on Oct. 11, 2010.

Collecting Web hyperlink data

The Website address of each of companies in the study was searched using Google and then manually checked to ensure that it was correct. The vast majority of companies in the study have only one URL for their Websites. When a company had more than one valid URL, we checked each URL to find out which one had more inlinks and used that one for collecting inlink data. Ideally, we should use all URLs of a company in collecting inlink data. However, the search engine used for collecting inlink data, Yahoo!, could not handle the complex queries need for collecting co-link data with two or more URLs.

As discussed earlier in the “Background of the study” section of the paper, only Yahoo! could be used for inlink data collection at the time of the study (fall 2010). Further, co-link data could only be collected from Yahoo! API while inlink data were still available through Yahoo!’s Site Explorer. So we collected all inlink and co-link data through Yahoo! API. Yahoo! had two inlink search operators: link and linkdomain. The “link” operator retrieved links to a particular page while the linkdomain operator retrieved all links to all pages of a particular Website or domain. We used the linkdomain operator because all links to the Website or the domain of a company are relevant to the company’s Web visibility and connectivity.

The query syntax for inlink data was: linkdomain:website1.com–site:website1.com; whereas the query syntax to collect co-link data was: (linkdomain:website1.com–site:website1.com) (linkdomain:website2.com–site:website2.com). We truncated the www portion of the URLs in the queries in order to capture links to all subdomains (e.g. mail.website1.com). The “-site:website1.com” part of the query let us filter out internal links coming from within the domain of the company itself. All inlink and co-link data were collected on Oct. 5, 2010.

Methods of data analysis

Descriptive statistics were generated (1) for each industry individually and for all industry as a whole to provide an overall view of the industries; (2) for each type of Web data to for a comparison of different types of Web data. Correlation coefficient tests were carried out to address research question 1. Spearman correlation coefficient tests rather than the Pearson correlation coefficient tests were used because the frequency distributions of Web data were very skewed. Correlation coefficients for different types of Web data were compared to determine if the keyword count data can replace inlink data and which type of Websites (general Websites, blog sites, and news Websites) is better for data collection (research questions 3).

To address research question 2, co-link and co-word matrices were analyzed using multidimensional scaling (MDS) to generate MDS maps. The raw co-link and co-word counts were normalized by Jaccard index to obtain a relative measure of the relatedness and then fed into SPSS version 17 for MDS analysis. We compared MDS maps of co-link with co-word data to find out if co-word data can replace co-link data. We also compared co-word data from different Web data sources (general Websites, blog sites, and Web news sites) to find out which data source is better (research question 3). Vaughan and You (2010) showed that MDS analysis of co-word data can position companies in a particular industry according to their business relationships. The current study extended that study to a multi-industry context and attempted to find out if the co-word analysis would map companies in the way that reflects a multi-industry business scenario: (1) companies are clustered according to their industry membership; (2) similar industries would be positioned closer. These are the criteria that we used to compare different MDS maps.

Results

Descriptive statistics

Descriptive statistics of inlink and keyword count data (keyword search of company names) are shown in Table 1. The overall pattern is that the IT industry was the most visible on the Web (having the highest inlink and keyword counts) while the mining and the construction industries had the lowest inlink and keyword counts. This pattern echoes the Web profile of these industries as we know them: IT industry is the leader in Web use while mining and construction had lower use of the Web for business purposes. When all industries are combined, there are more blog counts than inlink counts, which suggests that there will be no shortage of blogs from which to collect keyword data if we are going to replace inlink data with keyword data.

Table 1 Descriptive statistics of inlink and keyword (keyword search of company names) data

Correlation between Web data and financial data

Spearman correlation coefficients between Web data and financial data are shown in Table 2. All correlation coefficients are statistically significant (p < 0.01). Relating to research question 1, data here show that the Web visibility of a company as measured by the number of Webpages on which the company name appears correlates with the company’s business performance measures of revenue, profits, and assets. Comparing the three types of keyword data sources, correlations are higher for data retrieved from blog and news sites than that from the general Websites. This suggests that blog and news sites are better than the general Websites for this type of keyword analysis, a conclusion that is also reached in our co-word analysis that will be reported below.

Table 2 Correlation between Web data and financial data

Are keyword count data as good as inlink data as Web visibility or impact measures? We suggest that they are comparable. This is based on the comparison of correlation coefficients in Table 2 (inlink vs. Google Blogs and Google News) where the numbers are very close. So we conclude that the keyword counts of company names could potentially replace inlink counts to company Websites especially when the latter are not available. A further evidence that supports our conclusion is that the correlation between inlink counts and Google blog counts is 0.81 while that between inlink counts and Google news counts is 0.82; both are very high and significant (p < 0.01).

Co-link and co-word analysis

Four MDS analyses were carried out, one for the co-links and the other for the three sets of co-word data collected from Google, Google Blogs, and Google News. The stress values are all under 0.05 (0.049, 0.034, 0.029, and 0.025, respectively) which indicate that the MDS maps fit the data well. In the MDS maps reported below (Figs. 1, 2, 3, 4), companies are labelled in a way that will easily identify its industry membership. The first two letters in the labels identify the industry, e.g. “Ba” for banking and “Mi” for mining. The number following the two letters is the order that the company shows up in Table 3 in Appendix, e.g. Ba1 is the first bank in Table 3 in Appendix. For the convenience of reading the maps, circles that represent the companies in the maps are shown in different shades (from solid black to transparent) for different industries.

Fig. 1
figure 1

MDS map based on co-link data

Fig. 2
figure 2

MDS map based on co-word data collected from blogs

Fig. 3
figure 3

MDS map based on co-word data collected from Web News

Fig. 4
figure 4

MDS map based on co-word data collected from general Websites

Table 3 Web data of companies in the study

Figure 1 is the MDS map generated from the co-link data. Companies are clustered by the industries except those of the media industry. All IT companies are clustered close together except companies Ingram Micro (IT8) and Tech Data (IT9). These two are computer wholesalers, much smaller and different from the giants such as Apple, IBM and Microsoft. Industries that rely more on information technology (IT, banking, and media) are positioned on one side, contrasting with mining and construction industries that are located on the other side (the dotted line in Fig. 1 shows the division). A contrast between traditional industries and the information centred industries was also seen in an earlier co-link study of multi-industry companies (Romero-Frías and Vaughan 2010b).

Figure 2 is the MDS map generated from the co-word data collected from Google Blog. Clustering by industries is clear here than in Fig. 1 where media companies are not clustered together. The contrast between traditional industry (mining and construction) and the three more IT oriented industries seen in Fig. 1 is also shown Fig. 2. Overall, co-word data collected from Google Blogs is as good as or even better than co-link data in showing business relationships among companies.

Figure 3 is the MDS map of co-word collected from Google News. There are 47 instead of 49 companies in this map. Two companies, MDU Resources group (Co6) and Martin Marietta Materials In (Co8), had to be omitted from the MDS analysis because the co-word counts between these two companies and other companies are too few to have proper MDS analysis. Like in Fig. 2, companies are clearly clustered into the five industries with the exception of the two smaller IT companies as explained earlier and a few other companies. However, the pattern of division between the traditional industries (mining and construction) vs. the other industries is not shown here.

The MDS map generated from co-word data collected from general Google search engine is shown in Fig. 4. There is a rough division between traditional industries (mining and construction) vs. other industries. However, companies are not clustered by industries except the IT industry. Overall, this map does not show relationship among companies, which suggests that the general Web is not an appropriate source from which to collect co-word data.

Discussion and conclusions

The study found that the Web visibility of a company as measured by the number of Webpages on which the company name appears correlates with the company’s business measures. This finding parallels findings from earlier research which showed that the number of inlinks pointing to a company’s Website correlates with the company’s business performance measures (Vaughan and Romero-Frías 2010). The current study also found significant correlations between inlink counts and the keyword counts (the number of pages on which the company name appears), suggesting that the keyword count could substitute inlink count as an alternative indicator of business measures. Thelwall and Sud (2011) found that the organisation title mentions could be a measure of academic impact. Tying all these findings together, we conclude that the keyword count (the number of mentions of the organization name) could be a measure of Web visibility or Web impact for academic and business organizations, replacing the role that the inlink count has played in this regard. In terms of sources for data collection, the study found blog sites and Web news sites to be better than general Websites.

The study also found that the co-word analysis could show business relationships among companies even in a multi-industry context. This extends earlier studies that tested the co-word method in a single industry (Vaughan and You 2010; Vaughan et al. in press). When different data sources are compared, the study found that blogs to be a better source than general Websites. Vaughan and You (2010) reached the same conclusion so the advantage of blog pages over general Webpages seems to be clear, at least for studies of business Websites. The study also tested data collection on news Website; no previous study used this data source. It found that Web news sites is a better data source than the general Web but may not be as good as blog sites. Comparing results of co-link data and that of co-word data collected from blog sites, the latter is as good or even slightly better. So co-word analysis could potentially replace co-link analysis if an appropriate co-word data source is used.

A limitation of the study is that it was focused on one particular environment (business related Websites), so the conclusions on the potential of co-word analysis replacing co-link analysis and the relative advantage of blog data over general Web data may not be applicable to other studies (e.g. mapping academic relationships). This is the first study that tried collecting data from news Websites and the study is limited in scale, so the conclusion on the usefulness of news Websites for data collection may not be generalizable.

It is very important to note that the value of the study is not limited to business related Websites. Earlier studies have shown that the co-link analysis can be used to map relationships among various types of organizations such as academic (Ortega et al. 2008; Thelwall and Wilkinson 2004), business (Vaughan and Romero-Frías 2010), political (Romero-Frías and Vaughan 2010a) and government (Holmberg 2009). The co-word analysis parallels the co-link analysis in logic and method, so it is conceivable that the co-word analysis could be useful in mapping other types of relationships as well. It is important that we develop new Webometrics method such as co-word analysis in light of the declining data source for hyperlink analysis. This will not only keep the healthy development of Webometrics but also let us take advantage of the rich information available on the Web.