Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Technology has transformed the way we communicate in the modern digital age. No longer do we simply rely on speech and writing but also on a range of different forms of ‘e-language’. E-language is defined here as any communicative, interactive and/or linguistic stimulus that is digitally based and ‘incorporates multiple forms of media bridging the physical and digital’ (Boyd and Heer 2006: 1): from e-mails to discussion board threads, SMS messages and so on (‘e-language’ is also known as Computer Mediated Communication, CMC: see Walther 1996; Garcia and Jacobs 1999; Herring 1999 and Thurlow et al. 2004, and ‘netspeak’, Crystal 2003: 17). As a relatively new ‘genre’ of communication (Herring 2002), the definition and description of the features of e-language and how it compares and contrasts with spoken and written genres of communication is an on-going concern in studies of CMC, Applied Linguistics, Corpus Linguistics and beyond. This is something that will be examined in more detail in the current chapter.

Based on Crystal (2003: 17), there is a suggestion that spoken and written language effectively exist on a ‘continuum’ of formality (also see Condon and Cech 1996; Ko 1996; Herring 2007 for further discussions on the differences between spoken and written discourse). The ‘more’ formal language structures exist on the left of the continuum, where written language is conventionally positioned, and the least formal exists towards the right end of the continuum, where spoken language is conventionally perceived to be positioned (although obviously their positioning is somewhat fluid as no absolute positioning in this abstract notion can ever exist – it is a theoretical continuum not a static classification system).

Considered as a distinct genre of communication, Crystal suggests that ‘netspeak’ is perhaps somewhere in the middle, between spoken and written language (2003: 17). He suggests that there is essentially a blurring of traditional characteristics of spoken and written language, in digital communication, making it a combination of both of the more ‘traditional’ genres (also Biber 1993; Collot and Belmore 1996; Yates 1996; Crystal 2001 for further discussion). Others have added to this notion, instead suggesting that each e-language ‘mode’ (Murray 1988) is structurally, semantically and pragmatically different from one another as well as spoken and written language types, making their relative positioning along this continuum of formality highly variable (see Murray 1988; Baym 1995; Cherny 1999; Herring 1996).

Levels of formality in specific modes of e-language have already received attention from researchers (see works by Sutherland 2002; Hard af Segersteg 2002; Shortis 2007; Crystal 2008 for further details). For example, Tagg (2009) and Ling (2003) both report on the tendency for SMS messages to be immediate and personal, written in the first person and directed to specific recipients. Tagg adds to this, underlining that ‘the informal and intimate nature of texting encourages the use of speech-like language’ in this e-language mode (2009: 17, also see Crystal 2003; Oksman and Turtianen 2004). Similarly, Baron highlights that although email, as with texting and other common forms of e-language, is typed or ‘written’ rather than spoken, ‘participants exploit it for typically spoken purposes’ (1998: 36), and it therefore shares more similarities with communication situated at the spoken rather than written end of the continuum.

Levels of formality across e-language as a specific genre and the relationships that exist between individual modes, however, is something that remains under-explored in corpus-based analyses of real-life data. Initial developments in this area of research have been made by Knight et al. (forthcoming, 2012) who provided some preliminary observations about the frequency of pronouns and deictic markers in e-language, compared to written and spoken excerpts from the BNC.Footnote 1 This study is extended in the present chapter but with a focus, instead, on the use of forms of hedging in e-language. The corpus used in this chapter is CANELC, the Cambridge and Nottingham e-language Corpus, a one-million-word corpus of digital discourse taken from British contributors or those posting to British websites in 2010–2011. It includes data from discussion boards, blogs, tweets, emails and SMS messages, distributed according to Fig. 1 (word counts for each mode are included in this figure).

Fig. 1
figure 00071

The contents of the CANELC corpus

CANELC was built to allow for the querying of data at the general level of the genre of interaction as well as at the level of individual the communicative mode. So, using results from corpus-pragmatic based enquiries of CANELC, we will aim to create a deeper understanding of how different modes of e-language relate to Crystal’s notion of the ‘continuum’ of formality.

2 Corpus Pragmatics

2.1 Overview

The study of the pragmatics of language use has traditionally concentrated on spoken registers rather than written language because the latter tends to be ‘referentially explicit’ (McEnery et al. 2006: 104) while the former allows for a more ‘extensive reference to the physical and temporal situation of discourse’ (Biber 1988: 144) in the construction of meaning. Spoken interaction is, in other words, highly context specific, and meaning is not only determined by the specific spoken or written ‘sign’ (Morris 1946: 287) used, but by a range of other ‘extrinsic’; ‘social, cultural and interactive’ factors, and ‘intrinsic’, ‘cognitive, affective and conative’ factors that exist (Kopytko 2003: 45; also see Labov 1972; van Dijk 1977; Duranti and Goodwin 1992; Eckert and Rickford 2001; Fetzer 2004, for further discussion on language and context).

There is no one-to-one relationship between language form and function as the interpretation of a given message is highly dependent on the communicative function of a word or utterance, in a specific discursive context (for discussions of language and context see Labov 1972; Bates 1976; Nelson et al. 1985; Brown 1989; Halliday and Hasan 1989; Duranti and Goodwin 1992; Widdowson 1998; Green 2002; Scollon and Scollon 2003). In spoken communication, much of the discursive context is ‘shared’ (McEnery et al. 2006: 105) between a speaker and an interlocutor.

This affects the type of language used as there is a temporal and/or physical closeness in spoken discourse between the individuals as well as a shared knowledge about the immediate communicative context. This provides a ‘clear advantage in using contextual expressions such as I, there, or now, [for example,] which are shorter and more direct’ (Heylighen and Dewaele 2002: 301). Depending on the relationship and social distance between the speaker and interlocutor, speakers can thus use less formal expressions and a larger number of pronouns and deictic markers in this shared communicative space (see Fowler and Kress 1979; Chafe and Danielewicz 1987; Biber 1992; Biber et al. 1999; Leech 2000; Carter and McCarthy 2006; Atkins 2011). There is more of a gulf in spatial distance and time between writers and readers of written texts as there is no guarantee of when a text may be read or by whom. Written texts are not as contextually bound and thus often lack the shared knowledge and understanding between writer and reader, which often correlates with a decrease in the use of contextual (deictic) expressions in these texts.

While not necessarily true of all forms of e-language (instant messaging, IM, for example), the different modes of data included in CANELC are somewhat similar to one another in the fact that they do not ‘require that users be logged on at the same time in order to send and receive messages’ (Herring 2007: 13). The content sent via these different modes are ‘stored at the addressee’s site until they can be read’ by the recipient (Herring 2007: 13). They are not forms of communication which necessarily require an instant response as, again, IMs do and face-to-face (spoken) interaction does. They are, therefore, asynchronous (for more detailed discussion of synchronicity see Condon and Cech 1996; Ko 1996; Herring 2007).

This asynchronicity means that the data in CANELC is arguably structurally organised in a way that is more consistent with written than spoken language (which is also asynchronous). It is interesting, then, to note that it is actually often the case that only a few seconds or minutes passes between the time when a message is sent and attended to across different e-language modes, despite this asynchronicity. There may in fact only be a short delay between the time a message is composed and read/responded to (although there is likely to be some inconsistency in the average time taken across the different modes of e-language). This is likely to reduce the temporal and social distance between sender and receiver as highly context-specific information about the message (related to time) is more likely to be shared and understood.

As a consequence of this, as outlined in Knight et al. (forthcoming, 2012), there is often a frequent use of ‘temporal referents….deictic marking (as with the prolific use of personal pronouns)’ in e-language. These discursive features again hint at forms of communication that are potentially allowing for an immediate or near-immediate information exchange, a forum for communicating reports of events and incidents in near real-time, as the understanding of the temporal referent is shared’. There is a shared digital space rather than physical space, within which ‘the social, physical and temporal context is frequently changeable’ (Knight et al. forthcoming, 2012). This is contrary to what is expected from asynchronous communicating, aligning e-language more closely to more informal, spoken discourse, despite the fact it is not synchronous and is typed/written rather than spoken.

2.2 Hedging

In addition to pronouns and deictic markers, another pervasive feature that relates to levels of formality in discourse is the use of hedging (first coined by Lakoff 1972: 195). In pragmatics, hedges are ‘expression[s] of tentativeness and possibility’ (Hyland 1996: 433) which operate to ‘mitigate the directness of what we say and so operate as face-saving devices’ (O’Keeffe et al. 2007: 174 – for more information on politeness theory and the notion of ‘face’, see Brown and Levinson 1978, 1987). They are ‘pragmatic markers’ (Carter and McCarthy 2006: 223) which can be used ‘to downtone…..the force of an utterance for various reasons e.g. politeness, indirectness, vagueness and understatement’ (Farr et al. 2004: 13). The specific form, frequency and functions that hedges adopt also ‘vary relative to context’ (O’Keeffe et al. 2007: 174). Examples of hedging are seen in Fig. 2:

Fig. 2
figure 00072

An example of hedging, taken from the discussion board data in CANELC

We see the use of four hedges (in bold) in this discussion board thread. The contributor is making plans for her birthday evening, discussing the possibility of inviting a party of friends to a local pub to celebrate. Kind of operates as an inexact stance adverb, softening the content of the thread. As with maybe, kind of acts almost as a ‘downtoner’, as instead of saying ‘it would be nice to go the pub, especially since it is my birthday’, the use of this hedge provides an approximate reflection of what the contributor really means (Hübler 1983: 68). I figure also functions in a similar way, acting as a verb with a modal meaning, used to soften the meaning of the assumption about the pub, in order to mitigate against a potential face threat for the sender or receiver of the message, while particularly also has a similar effect as an omission of the adverb in this context would result in the utterance seeming blunt.

As face-saving devices, ‘softeners’ (Nikula 1997: 188), the frequent use of hedges is often linked to formal rather than informal contexts of communication (this is true of both spoken and written discourse, but given the tendency for written to be ‘more’ formal, the level of hedging is generally higher for written discourse vs. spoken discourse). Farr and O’Keeffe’s (2002) study of hedging in the spoken LCIE corpus (Limerick Corpus of Irish EnglishFootnote 2) best illustrates this pattern (2002). In this study, hedges were found to be most frequently used in institutional settings including teacher training contexts and radio discourse, with their use reducing in conversations between family and friends (see Farr et al. 2004) where there ‘fixed relationships’ (Clancy 2002), a closeness between speakers and listeners (creating less of need for participants to save face). The context where the fewest hedges were used in the corpus was in shop encounters. This is ‘perhaps explained by the lesser need to protect face in service encounters, where a customer and a server do not know each other, and where they are interacting within transactional roles’ (O’Keeffe et al. 2007: 176). The potential face threat is lower so the use of the mitigating hedging devices is not as essential in such discursive contexts.

Having said this, other studies have suggested that since it is performed in ‘real-time’ (Leech 2000), spoken ‘conversation is [often] more vague than written genres’ (McEnery et al. 2006: 105), so an increase in the frequency of certain forms of hedging functioning as vague language markers is often seen. For example, based on queries of the World Edition of the BNC (British National Corpus), Gries and David (2007) discovered that kind of and sort of were both forms of hedges functioning as vague stance adverbs that are frequently used in spoken discourse, in comparison to written discourse. Although, of these two clusters, sort of was significantly more common in written mode than kind of, while the reverse was found to be true of the spoken mode. Of written communication specifically, Biber et al. reported that the clusters kind of and sort of are both used more frequently in formal, academic prose than in other written registers (based on a study of the Longman Spoken and Written English Corpus, 1999: 560–561, other studies of these clusters have been carried out by Crystal and Davy 1975 and Quirk et al. 1985 – comparing their frequency of use between British and American English).

This pattern is inversely true of more private and personal forms of communication as opposed to more public forms (Carter and McCarthy 2006: 9–16). So written interaction, for example, that is most public (professional) and formal in nature (a government policy document for example), will likely see an increase in the number of vague stance adverbs used, when compared to a more personal expression of feelings, for example as this ‘softening’ function is unlikely to be required with close or intimate relationships.

Numerous other studies have been carried out on hedging in written discourse (Dubois 1987; Channell 1990; Drave 1995; Allison 1995), spoken interaction (see Crystal and Davy 1975; Brown and Yule 1983; McCarthy 1991; Cheng and Warren 1999; Jucker et al. 2003 for examples) and individual modes of e-language including SMS messages (Crystal 2001; Tagg 2009), Blogs (Myers 2010), Instant Messaging (IMs – Brennan and O’Haeri 1999), Discussion Boards (Atkins 2011) and Twitter (Benjamin 2011). More large scale corpus-based, studies have also examined vague language (arguably a sub-set of hedging) in both written and written discourse (Channell 1985, 1994; Kennedy 1987). To date, however, no studies offer an insight into hedging use across these different communicative genres. The current study aims to fill this research ‘gap’.

3 Analysis

3.1 Study Questions

To build on the foundations of what was previously discovered about levels of formality in e-language (using CANELC – Knight et al. forthcoming, 2012), the following sections focus on the use of hedges in more detail. The analyses address the following research questions:

  • Is there a significant difference in the frequency of hedging used:

    • Between all modes of e-language in CANELC, compared with data from the spoken and written BNC?

    • Between the different topic categories of data included in CANELC?

  • What do the frequency and use of this phenomenon reveal about the levels of formality within and across the different modes of e-language in CANELC?

To answer these questions, the following sections present results from an analysis the use of hedges in e-language compared to one-million-word samples from the written and spoken BNC samples (which contain 968,267 and 982,712 words respectively). Given that the size of the corpora used are slightly inconsistent, the results are normalised using statistical measures so accurate comparisons can be made. The analyses are conducted out using Rayson’s WMatrix software (2003) which includes utilities for carrying out word, cluster and parts of speech queries (centring around the production of key word lists and key-word-in-context, KWIC, outputs), and allows researchers to explore the patterned use of these features in a corpus. With the use of the WMatrix semantic tagger, common themes and semantic associations connected with corpora can also be queried using the software.

In addition to the ‘data’ taken from communication performed across the different e-language modes, CANELC also contains detailed metadata records: data about the data. Metadata is critical to a corpus as without it ‘the investigator has nothing but disconnected words of unknowable provenance or authenticity’ (Burnard 2005) to examine. As outlined by Knight (2011: 31, based on Burnard 2005) ‘the inclusion of this information assists in identifying the name of the corpus (administrative metadata), who constructed it, and where and when this was completed (editorial metadata), together with details of how components of the corpus have been tagged, classified (descriptive metadata), encoded and analysed (analytic metadata)’. Collectively, this information allows us to reconstruct aspects of the reality of the discursive context in which specific e-language messages were sent, allowing us to frame the language in a more contextually accurate way. The following metadata is included in CANELC:

• Author’s (and receivers) name, age, gender, nationality

• Content

• General topic of content

• Follow up comments/responses

• ‘Other’ relevant information

• Date and time composed

 

• Intended recipient

 

Regarding ‘general topic of content’, it is viable to note that in addition to the metadata information, data in CANELC is also broadly categorised by topic. This is based on the schema presented in Fig. 3.

Fig. 3
figure 00073

Topics featured in CANELC

Topics in category ‘A’ are aligned with more public concerns such as news, politics and current affairs, while those in category ‘F’ are more aligned with personal issues such as personal and daily life (with B-E existing almost on a continuum between these poles). The distribution of the CANELC data, by number of words, across these different topic categories is represented in Fig. 4.

Fig. 4
figure 00074

Approximate distribution of words across the 6 topic categories of CANELC (refer to Fig. 3 for data key)

Figure 4 illustrates that across the entire corpus there is a dominance of contributions in categories ‘F’ and ‘A’. The majority of data in category ‘F’ is included in the SMS messages and personal emails included in the corpus, which primarily contain language discussing topics concerning aspects of personal and daily life. More public, outward facing, topics such as business, finance and the news are frequently featured in the language of the blogs, tweets and discussion boards, although the tweet and blog sub-corpora have the most balanced distribution of contributions/word count across each of the thematic categories. Finally, CANELC also includes a number of business emails, which contribute to the high frequency of data type ‘A’.

While the assignment of the content to these thematic groupings was fairly transparent in some cases, other messages were slightly more ‘fuzzy’ and flexible, insofar as they discussed multiple topics ranging across the different categories. In these instances, when compiling CANELC, the data was given a range of category codes, so A/B/C rather than simply ‘A’. For the purpose of Fig. 4 and the analysis seen in Sect. 3.3, individual contributions are counted once across these groupings, so they are classified according to, crudely, their ‘best fit’. That is, even in instances where multiple categories were assigned, only one single category was counted. This was, subjectively, the category which is descriptively the ‘most’ appropriate for these contributions, that is, the one that is approximately the most representative/appropriate of that data. In other words if data was assigned the categories A/B/C, for example, and the content was described as being most dominantly ‘business related’ [i.e. category A], content was re-labelled as being category ‘A’ only.

The inclusion of this categorisation scheme provides a helpful way-in to querying levels of formality in CANELC as, in parallel with previous comments, the division of public vs. private can affect the levels of formality in a text. So comparisons of hedging within and across both the modes of data in CANELC and these different topics, can help us to assess how closely e-language compares with more formal (akin to the written end of the continuum) and informal discourse (positioned toward the spoken end of the continuum).

Given the level of contextual specificity, ‘hedging can be achieved in indefinite numbers of surface forms’ (Brown and Levinson 1987: 146), making it potentially difficult to draw up a ‘list of hedges’ (Clemen 1997: 236, 243; Nikula 1997: 190) to use as a basis of a study of this phenomenon. Despite this, across the literature there are specific words or expressions that are often used as hedges. For example, as outlined by Farr et al. (2004: 13–14) the most salient hedges are ‘core modal verbs’ and ‘verbs with modal meaning’ (O’Keeffe et al. 2007: 175 – e.g. might, may), ‘clausal items’ (e.g. I think, you know), ‘noun based expressions’ (e.g. the thing is), ‘degree adverbs’ (e.g. really, necessarily) and ‘stance adverbs’ (e.g. of course, sort of) and so on. The hedges that the present study will focus on are some of the most common forms that have been examined in past studies of this topic (based on Biber et al. 1999; Carter and McCarthy 2006; O’Keeffe et al. 2007: 175), and are forms which are frequent in the CANCODEFootnote 3 (Cambridge and Nottingham Corpus of Discourse in English), BNC, CECFootnote 4 (Cambridge English Corpus) and CANELC corpora. These are listed in Fig. 5. These terms were queried in the CANELC data.

Fig. 5
figure 00075

Some common hedges in spoken and written discourse

Some of the adverbs listed here, such as just, have the softening hedging function, but are also often used with intensifying and specifying functions in discourse. Just do it; its just about five oclock and well only be a couple of minutes late are examples of this. Of course is another examples of this, this cluster can be used as a hedge when it has a pragmatic function but it can also be emphatically and directly; Are you coming? Of course. So although we can define some frequent forms of hedges, a more qualitative screen by screen study is needed if we are to drill down into specific functions. The current study undertakes a more quantitative approach, but a more qualitative assessment of the data would be welcomed in future studies of this nature and are, indeed, necessary.

3.2 Frequency of Hedges

The frequency of use of the terms in Fig. 5 were queried across the entire corpus as well as each mode is presented and compared, along with the frequency of use seen in the written and spoken BNC sub-corpora. Results are shown in Fig. 6. Log-likelihood scores are also presented in this figure. These provide a statistical measure of the relationship between the frequencies, indicating whether specific patterns of significant differences are likely to exist by chance or not. In this figure, a ‘+’ log-likelihood score indicates that a particular rate of use is statistically higher in the CANELC corpus compared to the other parameter defined, while a ‘−’ log-likelihood indicates a statistically lower frequency of use in CANELC. Numbers in bold indicate that there is a statistical difference (measured using a log-likelihood score) in the frequency of usage across specific modes/genres to a p value of <0.01 (with a critical value range of 6.63–10.82) while those in italics mark a significant to p value <0.001 (critical value of 10.83). So an ‘+’ indicates an overuse in CANELC compared to the listed parameter and thus an underuse in the given category.

Fig. 6
figure 00076

The frequency of common forms of hedges used in CANELC, compared to the spoken and written sub-corpora from the BNC

In Fig. 6 we see that, for the terms actually, just, you know, probably, quite, really, thing, there is a significant underuse in CANELC compared to the written BNC corpus, while there is a significant overuse compared to the spoken BNC sub-corpus (to p < 0.001). Probably is significantly underused in the twitter data and overused in the email data (to p < 0.01 and p < 0.001) while really is overused in the discussion boards and SMS messages compared to rate of use across CANELC (to p < 0.001). Just is significantly underused in the blog data and overused in the SMS data, while you know is underused in the blog and discussion board data but overused in the email and SMS data and just is underused in the email but overused in the discussion board data. Finally, there is no real significant difference in the rate of use of quite and actually across the different e-language modes.

The only item that is significantly overused, at p < 0.01, in the spoken BNC and underused in the written compared to CANELC is likely. There are, however, some terms which are overused in CANELC, compared to both sub-corpora. These include apparently, guess and maybe. Of these terms, apparently is used at a near-consistent rate across all of the modes in CANELC, while guess is underused (to p < 0.001) in the blogs and significantly overused in the SMS (to p < 0.01) when compared to the other modes. Maybe and likely, on the other hand, are both underused in the blogs (to p < 0.001 respectively) but the former is overused in the SMS messages and the latter in the tweets (both to p < 0.01).

I think, kind of, broadly, typically and, to some extent of course are used at a significantly higher rate in CANELC than the written BNC (to p < 0.01), but no significant difference exists between the rate that they are used in the spoken BNC (aside from of course where the difference is to (p < 0.001)). Conversely, there is an underuse of the expression normally in CANELC compared to the spoken data (to p < 0.01) while there is no significant difference between the use of this term when compared to the written corpus. Kind of is used at a consistent rate across all modes in the corpus, while typically and normally are used at consistent rates across all modes aside from tweets and SMS messages where a slight underuse occurs when compared to CANELC respectively (to p < 0.001). Similarly of course is slightly underused in the SMS messages but slightly overused in the discussion board data (to p < 0.001) and I think is slightly overused in the email data, but used consistently across the other modes in CANELC.

Figure 6 also indicates that there is a slight overuse of only, seemingly and surely compared to the spoken BNC (to p < 0.01) while no difference exists between the rate of use of these words in CANELC versus the written BNC.

Frequently, possibility, relatively and, to some extent, generally are all underused in CANELC compared to the written BNC, while there is a near-consistent rate of use of these terms when compared to the spoken BNC data (to p < 0.01 aside from generally which is to p < 0.001). The rate at which frequently is used across each of the modes in CANELC is near-consistent while there is an overuse of possibility in the email data, an underuse of relatively in the tweets (both to p < 0.001) and a significant underuse of generally in the SMS and tweet data (to p < 0.01). Similarly, only is used at a near-consistent rate across the different modes while seemingly is slightly underused in the twitter data and surely is underused in the SMS data but overused in the discussion board data (to p < 0.001).

Necessarily, usually and sort of are all underused in CANELC when compared to the spoken BNC (to p < 0.01, p < 0.01 and p < 0.01 respectively) and, similarly, the first two of these terms are also underused compared to the written data (to p < 0.001 and p < 0.01 respectively) while sort of is slightly overused compared to the written BNC (to p < 0.001). Necessarily and sort of are used at consistent rates across all modes aside from the tweets, where a significant underuse of sort of can be seen when compared to CANELC (to p < 0.01). Comparatively, usually is significantly overused in the discussion board data and underused in the email data compared to the other modes included in CANELC (to p < 0.01 and p < 0.001 respectively).

Finally, we see no statistical difference in the use of arguably and partially when comparing CANELC to the spoken and written BNC, or across the individual modes of e-language.

3.3 Patterns of Use Across Topics

In addition to exploring the use of the hedges across the different modes in CANELC, we are able to look in more detail at differences in use across the topic categories detailed in Fig. 3. Figure 7 documents the frequency of word use across the different topic categories and provides a log-likelihood score of difference in use for each category compared to CANELC (note − a ‘+’ indicates an overuse in CANELC compared to a category, thus an underuse in the given category), while Figs. 8 and 9 tabulate the frequency of use across these topics compared to the spoken and written BNC (note − a ‘+’ indicates an overuse in the BNC compared to a category). Six sub-corpora of the CANELC data were created (for A–F) to draw these comparisons in the data.

Fig. 7
figure 00077

The use of hedges in the topic categories in CANELC

Fig. 8
figure 00078

The rate of use of hedges in the topic categories in CANELC, compared to the spoken BNC

Fig. 9
figure 00079

The rate of use of hedges in the topic categories in CANELC compared to the written BNC

From Fig. 7 we can see that none of the hedging terms are overused in data classified under topic category ‘A’ compared to CANELC, although just, maybe, quite and really are all significantly underused (to p < 0.01) and actually and typically are slightly underused (to p < 0.001). Similarly, Fig. 7 shows an underuse of a bit, like and stuff in this category when compared to the corpus as a whole (to p < 0.01). As documented in Figs. 8 and 9, actually, as used in category ‘A’ in CANELC occurs at a far less frequent rate than it does in the spoken and written BNC (both to p < 0.01) and the converse is true for relatively (to p < 0.01). While for frequently, likely, seemingly and partially, there is a higher rate of use in category ‘A’ than the spoken BNC, but a near consistent rate of use to the written corpus (to p < 0.01, p < 0.01 and p < 0.001 respectively).

Surely and typically are used at a higher rate in the category ‘A’ data in the spoken BNC data, but while surely is used at a near consistent rate to the written BNC, typically is far less frequent in A. The converse of this is true for typically. While arguably, possibility, roughly, only and generally, when classified in category ‘A’ occur at near-consistent rates to the spoken and written BNC data (as seen in Fig. 8) and relatively, although nearly-consistent to the spoken BNC, is used at a much higher rate in the topic ‘A’ data than the written BNC (to p < 0.01, as seen in Fig. 9).

For topic ‘B’, that is topics covering ‘culture, literature and the arts’, ‘fashion’ and ‘teaching, academia and education’, Fig. 7 indicates that the only significant differences seen are in the rate of use of quite and really, both of which are used at a rate higher than the average rate seen in CANELC.

Necessarily, normally, broadly and usually are terms that are most commonly classified under topic category ‘B’ in CANELC. The rate of use of these terms, in this category are shown to be nearly consistent to the rates of use in the spoken and written BNC, as no real significant differences are outlined in Figs. 8 and 9. There is, however, an underuse of sort of, in the category ‘B’ data compared to the spoken BNC (which is also most commonly classified under category ‘B’), while near consistent rates to the written BNC are shown.

Figure 7 indicates that there are no significant differences in the use of the search terms for topic ‘E’. There is, however, a significant underuse of really in CANELC compared to ‘C’, and an underuse of quite and an overuse of surely compared to ‘D’. These are the only real difference seen for these categories (to p < 0.01). None of the hedges explored were more frequently used in the data classified under topic category ‘E’ or ‘C’ than the other topic categories. The only ones frequently used in ‘D’ were arguably and sort of. Arguably is overused in this category compared to the average use in the spoken BNC, but near-consistent with rates of use in the written BNC, while sort of is used at a significantly lower rate in the topic ‘D’ data than the spoken and written BNC (to p < 0.01).

Finally, Fig. 7 highlights that just, maybe and really are all used at a significantly higher rate in the data for category ‘F’ than the CANELC average (all to p < 0.01) and usually is used at a lower rate than the CANELC average (both to p < 0.01). The first of these terms are also significantly overused compared to the spoken BNC, but significantly underused compared to the written BNC. It is the use of terms in this category that we see the most marked difference in frequency rates when compared to the written and spoken BNC data (Figs. 8 and 9).

Apparently, guess, just, maybe, stuff, or so and a bit are all used at a significantly higher rate in CANELC compared to both the spoken and written data (all to p < 0.01 aside from a bit and or so which are to p < 0.001 for the spoken and written data respectively) while like, quite, you know and thing are all underused in the category ‘F’ data compared to the spoken BNC but overused when compared to the written data (all to p < 0.01). Kind of, I think, probably and really are all significantly overused in the category ‘F’ data when compared to the written BNC but are used at near consistent rates to the spoken excerpt (to p < 0.01). Conversely, sort of is significantly underused in this data compared to the spoken BNC, but used at near-consistent compared to the written data and of course is used at near-consistent rates in the category ‘F’ data compared to both the written and spoken BNC.

4 Discussion

Of the hedges examined, the most commonly used forms featured in CANELC were:

From this we can surmise that:

  1. 1.

    Of the forms examined, the most frequent hedge used in CANELC is the adverb just, followed by really and only.

Seven of the top ten of these hedges featured in Fig. 10 were shown to be significantly underused in CANELC compared to the spoken BNC but overused compared to the written BNC. The first of these adverbs were also shown to be frequently used in the study of hedging in LCIE (Farr et al. 2004), but none of noted as common hedges in studies of written academic discourse (see Channell 1990; Clemen 1997; Gries and David 2007). As discussed by Atai and Sadr (2006) the use of full verbs, nouns and adjectives as hedges (in that order) are often the most commonly used forms in more formal, written contexts. Although hedges of these forms were common in the data, they were used far less frequently than the adverbial forms. This suggests that, by form alone, the use of hedging in e-language shows some clear similarities with those used in more informal, spoken discourse.

Fig. 10
figure 000710

Rank order of the 30 hedges in CANELC (by frequency of use)

More generally, of the 30 hedges examined, 15 were found to be more frequent in the spoken than written BNC sample than in CANELC. Of these terms, 11 were significantly underused in CANELC compared to the BNC (10 to p < 0.01 and 1 to p < 0.001) while only 2 were overused in CANELC. Similarly, there was a higher rate of underuse of the 15 terms most frequently used in the written data, although this was only seen with 7 of the terms (with 2 of these 15 being overused in CANELC). Across all 30 terms, we saw that 12 of them were significantly underused and 7 overused in CANELC compared to the spoken data, while 15 were overused and 8 were underused in CANELC compared to the written data. This can be summarised as follows:

  1. 2.

    Hedges that were most frequently used in the spoken rather than written BNC sample (and vice versa) were used at a significantly lower rate in the e-language data.

  2. 3.

    Of the forms analysed, a higher proportion were significantly overused rather than underused in CANELC when compared to the written data (15 vs. 8).

  3. 4.

    Of the forms analysed, a higher proportion were significantly underused rather than overused in CANELC when compared to the spoken data (12 vs. 7).

These findings suggest that the rate of hedging use in the e-language data is inconsistent with typical rates in spoken and written discourse. While more hedges were used compared to the written data, far fewer were used than in the spoken data. This provides an argument for classifying e-language as its own distinct genre (as suggested in Sect. 2).

When comparing the patterns of use across the different modes of data we also see the following:

  1. 5.

    Emails and discussion boards contained fewer disparities in the rate of under/overuse of specific hedging forms than other modes of e-language (i.e. they were most ‘similar’).

  2. 6.

    The SMS, discussion board and twitter data contained the most disparities in the rate of under/overuse of specific hedging forms than other modes of e-language (i.e. they were the least ‘similar’ modes of e-language).

In terms of relative frequencies (calculated as the number of hedges used per word in each of the modes) we see that:

  1. 7.

    Hedges were used at a more frequent rate in the SMS and discussion board data than the other modes (1:72 words and 1:86 words), while they were used at a near consistent rate across the twitter, email and blog modes (1:101, 1:103 and 1:105 respectively).

Again, this is an interesting finding as it is in the ‘most immediate’ form of e-language, SMS messages (which, from show a shorter delay in the response times to messages in CANELC), there is a tendency for a higher number of hedges to be used. For the SMS messages, given that the relationship between the sender and sendee is often ‘fixed’, with messages being directed at individuals or groups of people known to the sender, and are often classified as being of the ‘personal and daily life’ topic, the need for hedging to mitigate against potential face threats is assumed to be reduced, so the reverse of this is interesting here. Similarly, while it is not necessarily the case that discussion board members ‘know’ each other personally, this mode of e-language often involves a fixed community of contributors who respond to each other regularly, creating a closeness between those involved.

The data also reveals that dramatic differences are seen in frequency rates across the different topic categories, compared to corpus as a whole. Of all the hedges analysed, the most common topic of the content was classified under category ‘F’. When compared to the BNC, we saw that those terms in category ‘F’ were statistically overused in the ‘F’ data than in both the written and spoken BNC. This was true of 8 of the 17 terms featured under the category ‘F’ data in Fig. 8 (to p < 0.01 or p < 0.001). These patterns can be summarised as follows:

  1. 8.

    Based on frequency, content classified under the topics in categories ‘A’ and ‘F’ used more hedging than the other topic categories.

  2. 9.

    Of the hedges analysed, all were, on average, used at a less frequent rate in each of the topic sub-corpora when compared to the written BNC.

  3. 10.

    While all hedges were also used at a less frequent rate in the topic sub-corpora than in the spoken BNC, the difference in rate of use was less significant than when compared to the written BNC.

  4. 11.

    Hedges used in topic categories ‘B’, ‘C’ and ‘D’ were underused and overused a near-consistent rate when compared to the spoken BNC. Hedges used in the category ‘A’ data were most significantly underused in the data when compared to the spoken BNC.

As is perhaps to be expected, then, the more formal and the more ‘spoken’ topic categories (i.e. interpersonal contexts, category ‘F’) witnessed a higher rate of hedging use than was the case with the other topics. As we saw earlier, spoken discourse often utilises more hedges than written discourse, but more formal spoken and written contexts use more hedges than the informal ones. The content which concerns matters related to personal and daily life are more akin to spoken discourse (although at the more informal end) so the more extensive use of hedging in this category is as expected. Similarly, the topics in category ‘A’ are most akin to ‘formal’ discursive contexts (both across written and spoken genres) so the frequent use of hedging also aligns with expectations.

If we look at some specific forms of hedging in more detail we see that kind of and sort of are two hedges which have previously been found to be particularly frequent in formal language contexts, specifically academic discourse (Biber et al. 1999: 560–56; Poos and Simpson 2002: 1). We would thus expect them to be more prevalent in the content classified under category B, in ‘teaching, academia and education’. This pattern was not mirrored in the e-language content and, in fact, there was a general underuse of both of these terms across the topics, modes and corpus when compared to the spoken and written data.

5 Summary

This chapter has revealed that there is no clear-cut relationship between the use of hedging in e-language compared to written and spoken genres of discourse. The use of hedging across different communicative contexts (defined by topic categories) and across the different modes of e-language is fluid and not necessarily fixed, although when compared to standard (BNC) written and spoken modes of discourse the forms of hedging isolated for the purposes of this study appear to behave in a way that suggests greater internal similarity across the modes than similarity with the standard (BNC) written and spoken data. As initially suggested by Crystal (2003), there appears to be an argument to conceptualise e-language as its own distinct variety on the continuum of formality: between spoken and written discourse. The more immediate forms of e-language (e.g. SMS messages) are positioned closer to the ‘spoken’ end while the emails and blogs are better positioned towards the more formal, written end (based on what we have found here).

To build on what has been found here, a more qualitative, screen by screen study of the data would allow us to examine, more closely, specific functions of the common hedging forms analysed here. A closer observation of hedging use between specific contributors (according to gender and relationship, for example) may also help us to create a clearer profile of use across the different modes. Finally, a focus on a wider range of hedging forms and a clearer distinction between the individual functions of forms, in specific contexts, as well as extending the focus to synchronous forms of e-language (e.g. IMs) would add to the discussions. There is scope to carry out such investigations in future studies of this nature.