Introduction

Management research is mainly carried out using psychological/managerial construct analysis. This variable type is “a theoretical label that is given to some human attribute or ability that cannot be seen or touched, because it develops in the brain” (Brown 1988, p. 103). The standard inputs for empirically studying these constructs are obtained from structured questionnaires. For example, transformational leadership (TL), “a process by which one or more people influence others to pursue a commonly held objective” (Bohara and Tiwari 2015, p. 383), has been extensively measured through structured questionnaires (e.g., Avolio and Bass 1991; Carless et al. 2000; Dönmez and Toker, 2017; Rafferty and Griffin 2004). According to Edwards (2008), these instruments differ in the content representing TL (nonstandardized TL content) and commonly require administration fees. Moreover, Andersen (2018) assert that more clarity in the construct operationalization is required, and Valero and Jang (2020) warn of frequent problems concerning social desirability bias in TL instruments.

Recently, however, there has been increased scientific interest in developing and validating construct scales based on term dictionaries using text mining (TM). For example, some authors have studied entrepreneurial orientation (Short et al. 2010), communication, critical thinking, and leadership skills (Campion et al. 2016), health responsibility (Kjellström and Golino 2018), personal values (Ponizovskiy et al. 2020), and organizational culture (Pandey and Pandey 2017).

Although previous studies have generated valuable insights for organizational management research and related fields, we aim to address two concerns to encourage and efficiently facilitate the future development and validation of constructs from texts.

The first concern is the traditional procedure to study content validity, which is almost the same as that currently used in structured questionnaire data. It is based on traditional expert rater and the theory-driven approach (for construct definition and dimensionality) (Pandey and Pandey 2017; Ponizovskiy et al. 2020; Short et al. 2010). Such a standard procedure is valuable from the interpretivism paradigm; however, it is abstract and difficult to replicate on a large scale (e.g., big data) and presents complications in reproducing, replicating, and generalizing any findings. Besides, such a procedure is rarely informative about the linguistic properties of construct scales. These properties are fundamental, because the content reflected in textual data is a function of its linguistic constituents, structural relationships, and variables of the linguistic context (Espinosa 2017; Firth 1957). However, “most studies on terminology focus on nouns and do not take into account the use of terms in context.” (Campos and Castells 2010, p. 872). Moreover, considering Wüster (1979) (e.g., the importance of nouns), Cabré (1999, 2002), and Campos and Castells (2010) (e.g., verbs/adjectives as potential specific-domain units), a comprehensive dictionary construction should consider all these parts of speech (PoS).

Furthermore, the standard procedure (expert rater, theory-driven approach, and multiple/representative corpora) to examine the content validity of construct scales based on texts could be complemented by considering measures based on a combination of linguistics, computational, and psychometrical resources from a text-driven perspective. Previous studies (e.g., Campion et al. 2016; Kjellström and Golino 2018; Pandey and Pandey 2017; Pérez-Rave et al. 2021a; Ponizovskiy et al. 2020) highlight the value of adopting a mixed perspective (based on theory/deductive and data/inductive). However, the use of a text-driven approach in this psychological/managerial field has been limited to the use of multiple/representative corpora during dictionary construction to expand its vocabulary (Ponizovskiy et al. 2020). Hence, the nature of its linguistic properties and how these properties can support construct content validity are rarely identified and discussed.

The second interest concern is the construct representation based on observable indicators. The traditional representation comprises a consolidated list of words for each construct (one-item perspective, e.g., Campion et al. 2016; Ponizovskiy et al. 2020) or a multiple-item view. In the latter, a merged list of terms is randomly split into parcels (frequently three parcels for each construct, e.g., Pandey and Pandey 2017). Although this procedure tends to produce correlated items and enables confirmatory factor analysis (CFA), the representation of construct reflected in texts is merely justified in numerical purposes, lacks theoretical rationality, and does not guarantee the representativity of valuable linguistic manifestations of the construct. For example, the traditional merged list of terms or random parcels of terms, without controlling PoS, does not guarantee the representativity of perceptual entities (e.g., nouns or noun phrases), qualities (e.g., adjectives or adjective phrases), and actions (e.g., verbs or verb phrases) in the measurement model, which involve different brain mechanisms (Fyshe et al. 2019; Haan et al. 2000; Martin et al. 1995). Providing empirical evidence on this linguistic representation of models during the construct development and validation stages enables more construct completeness and new lexical, syntactic, and semantic analyses. Moreover, this representation differentiates the construct variability and errors/PoS during the CFA from an adapted correlated-uniqueness perspective (Batista-Foguet and Coenders-Gallart 2000; Marsh et al. 1992). However, currently, it is unknown whether this representation (based on correlated PoS) could provide good performance concerning its psychometric properties.

The paper addresses these two concerns to propose a comprehensive alternative for studying psychological/managerial latent variables reflected in texts. The paper is based on and extends valuable suggestions from Pandey and Pandey (2017), Ponizovskiy et al. (2020), and the works of their predecessors (e.g., Short et al. 2010) by proposing a framework for developing and validating scales of psychological/managerial constructs from a theory/text-driven approach.

The paper provides four contributions. The first is complementary measures (coherence, polarity, PoS balance, convergence/differentiation, and commonality) to inform/guide content validity from a text-driven approach; these measures provide insights within and between dictionary construction steps. The second is a method to refine construct dictionaries, considering linguistic properties, entitled embedded voting, which uses word2vec and cosine similarity and emulates expert judgments from a text-driven perspective. The third is a dictionary of terms representing transformational leadership (TL) and original evidence on six performance criteria (content/internal/external validity, reliability, equity, and practical value). The fourth is an integrative framework to develop and validate measures of psychological/managerial constructs from texts. It consists of three stages and 16 steps that incorporate the first three entitled contributions and employ linguistic, psychometrical, and computational resources.

In addition, the contributions of this paper allow us to extend frameworks for processing/analyzing textual data to discover and examine latent topics in marketing domains, such as those by Markham et al. (2015) in new product development (defining questions, identifying data sources, creation of dictionaries and rules exploring marketing terms, data collection/analysis/assessment, data scoring, and decision-making); Amado et al. (2018) in big data in marketing (collecting literature, dictionary creation, building a document-term matrix, and performing topic modeling); Mathaisel and Comm (2021) in political marketing (preprocessing texts, performing topic modeling methods, sentiment analysis, and data visualization); and Pérez-Rave et al. (2021a) in multi-criteria decision-making processes at the strategic level with several applications to customer analytics (multi-criteria case, structuring of the case from a data-driven approach, scoring of latent topics, alternative prioritization, and robustness of the solution). These previous works are valuable for information extraction/recovery (e.g., topic analysis) from texts but do not cover the field of developing and validating psychological/managerial constructs from texts. This field demands a psychometrical perspective, a mixed approach (theory/data; Short et al. 2010; Pandey and Pandey 2017; Pérez-Rave et al. 2021a; Ponizovskiy et al. 2020), and to transcend the traditional focus on lexical analysis, predefined computational methods (e.g., topic modeling; Latent Dirichlet Allocation: LDA, Blei et al. 2003), and typical performance indicators merely focused on prediction capabilities. The area of psychological/managerial constructs requires qualitative and quantitative psychometrical rigor to ensure the quality of the outputs (e.g., dimensionality and construct scores): theoretically supported and empirically plausible/valid/reliable (Bergkvist and Eisend 2021; Fried 2017; Lievens et al. 2002; Martínez et al. 2006). However, unlike organizational management disciplines, the consideration of psychometrical resources in computational fields (e.g., text/data analytics) and their application to management/marketing domains are still in their infancy (Strohmeier and Piazza 2013; Pérez-Rave et al. 2020, 2021a, b). Thus, we adopt a mixed perspective to provide a more comprehensive, efficient, objective, and controlled framework in the development and validation of constructs from texts considering linguistic, psychometrical, and computational resources.

In addition, marketing research is intensive in latent variables and covers psychological/managerial constructs intrinsic to customers’ viewpoints, such as service quality, satisfaction, loyalty, perceived value, service experience, and service engagement (Cronin Jr et al. 2021; Pérez-Ravnde a Muñoz-Giraldo 2014). However, the study of latent marketing variables has been confined to structured questionnaires, which are invasive and partially explore the construct manifestations by limiting the respondents’ ability to freely express their opinions about an object of study (Kugbonu 2020). Indeed, recent studies warn of consistencies and a lack of standardization in the definition and measurement of marketing constructs (Bergkvist and Eisend 2021; Bergkvist and Langner 2017; Ptok et al. 2018). Furthermore, it is accepted in the literature that questionnaire data are focused on small sample sizes and are commonly affected by social desirability, acquiescence bias, blank responses, and straight-lining (Cornesse and Blom 2020; Grimm 2010; Holmes et al. 2019; Hume 2017; Kalugina et al. 2019). The developed framework is also valuable for marketing research/practice, since it is necessary to discuss and propose new/comprehensive/reproducible forms for operationalizing marketing constructs (Cronin Jr et al. 2021).

This paper is organized into five sections. "Introduction" section argues the need for the study within the literature. "Related works on psychological/managerial constructs reflected in texts" section exposes the basis of the study. "Methodology" section provides the developed framework and explains the proposed properties and methods. "Results of the empirical application of PMTM in the study of TL" section presents the results accompanied by interpretations. "General discussion" section comprises the general discussion of such results. "Conclusions and future work" section draws the conclusions and suggests future work.

Related works on psychological/managerial constructs reflected in texts

Table 1 provides a comparative analysis between our framework and most representative frameworks on the study of psychological/managerial constructs using TM from a theory/data-driven approach.

Table 1 Comparing frameworks suggested by representative studies on construct measurement from texts

Based on Table 1, the present study aims to:

  1. (i)

    enable the automated (or semi-automated) execution of dictionary creation and content validation concerning psychological/managerial constructs from texts controlling PoS by incorporating linguistic, psychometrical, and computational resources. This configuration of resources aims to attend to the first concern discussed in the Introduction to favor its use in large-scale studies (massive data) and reduce the subjectivity/abstraction in the traditional procedure of developing/validating construct measures from texts (expert raters, theory-driven approach, e.g., Short et al. 2010; Pandey and Pandey 2017; Ponizovskiy et al. 2020). Thus, we develop and use five original properties (coherence, PoS balance, polarity, communality, convergence/differentiation) that guide the development and content validation of psychological/managerial constructs reflected in texts, which have not been considered in previous works.

  2. (ii)

    represent and confirm constructs reflected in texts using a correlated PoS model and considering internal validity, external validity, reliability, equity, and practical value. Thus, we argue and validate a new psychometrical representation of constructs reflected in texts to ensure the representativity of valuable linguistic manifestations of the construct (perceptual entities, qualities, and actions); hence, this representation transcends the traditional representation based on word parcels (Campion et al. 2016; Ponizovskiy et al. 2020), which is merely justified in numerical purposes. Moreover, our methodological framework empirically validates this construct representation (measurement model based on correlated PoS) using a more comprehensive psychometrical perspective that considers internal/external validity, reliability, equity, and practical value.

  3. (iii)

    provide original evidence from a systematic application of such proposals from a theory/text-driven approach to develop and validate a new scale of TL. Thus, we extend the uses of text analytics by considering a construct (TL) not addressed in the reference works, which has been at the top of leadership research for more than three decades with support from structured questionnaires (Alqatawenh, 2018; Jackson, 2020; Kotamena et al. 2020). “MLQ (Multifactor Leadership Questionnaire, Avolio and Bass 1991, 1999, 2004) is the most frequently utilized TL instrument; its name has become synonymous with TL” (Brown and Keeping 2005, p. 249) and still requires effort to increase its operationalization quality (Andersen 2018; Edwards 2008). Our application case represents TL dimensionality from the four dimensions put forth by Avolio and Bass (1999, 20041991): individual consideration; idealized influence; intellectual stimulus; and inspirational motivation.

Methodology

Developed framework

Based on the considerations presented in Table 1, Fig. 1 details the proposed framework, entitled Psycho-Managerial Text Mining (PMTM). This comprises three integrative stages and 16 steps. Stage “A” refers to dictionary development and content validation, assuming that content validity can be “ensured by the plan and procedures of construction” (Nunnally 1978, p. 92, also cited by Pandey and Pandey 2017, p. 17) by adopting a theory/text-driven approach. Stage “B” focuses on confirming internal validity by developing a CFA representation comprising PoS and the correlated-uniqueness approach (Batista-Foguet and Coenders-Gallart 2000). This stage also examines scale reliability and confirms external validity based on a correlation analysis with reference/output variables and group contrasts (Martínez et al. 2006; Pandey and Pandey 2017). Moreover, this stage examines scale equity by considering personal factors, such as gender and age, usually implied in the diversity–validity dilemma (Martínez et al. 2006; Pérez-Rave et al. 2021a). Stage “C” focuses on exploring potential practical uses derived from the scale, which is essential in the implementation science that underlies TM.

Fig. 1
figure 1

PMTM framework to constructs’ dictionary creation/validation. Note PoS (parts of speech)

From a text-driven approach, PMTM (see Fig. 1) introduces five proposed linguistic properties, one method to emulate expert voting in lexical decisions (entitled embedded voting), and a representation of the measurement model based on correlated PoS. These are argued and explained in the following section.

Proposed properties for exploring content validity

We conceptualize and operationalize the five properties (content commonality, content polarity, PoS balance, embedded-content convergence/differentiation, and content coherence). The ideation and development of these properties emerge from a creative process utilizing an inductive perspective based on the authors’ knowledge of theoretical/methodological resources in management, psychometry and data science; and recognizing: (a) conceptualizations and methods for studying psychological/managerial constructs from psychometrical (Fried 2017; Lievens et al. 2002; Martínez et al. 2006) and data science approaches based on structured questionnaires (Pérez-Rave et al. 2021b, 2022); (b) the value of PoS to understand speaking styles and the thinking of individuals and organizations (Cabré, 1999, 2002; Ghosh and Mishra 2020; Wüster 1979); (c) the fact that nouns, adjectives, and verbs involve different brain mechanisms during text production (Fyshe et al. 2019; Haan et al. 2000; Martin et al. 1995); (d) the lack of understanding/exploitation of the linguistic context during tasks involved in the development and content validation of constructs reflected in texts, and the high subjectivity and limited reproducibility underlying these tasks (see "Related works on psychological/managerial constructs reflected in texts" section); and (e) the need to incorporate a data-driven approach to complement the theory-driven approach in the study and use of psychological/managerial constructs reflected in texts (Illia et al. 2014; Nunnally 1978; Short et al. 2010; Pandey and Pandey 2017; Ponizovskiy et al. 2020). In "Results of the empirical application of PMTM in the study of TL" section (Results), we empirically examine the proposed properties in the case of TL.

Content commonality

A frequent practice in multidimensional construct dictionary development is to remove shared words (not necessarily stopwords), arguing that this prevents lexical correlation during the dictionary design (Pandey and Pandey 2017; Ponizovskiy et al. 2020). However, using shared words is a natural situation that should be considered in multidimensional constructs. For example, based on attentional homogeneity, it is possible to assume that certain writers with an interest in a psychological/managerial construct may share aspects (e.g., a common body of knowledge, events, trends) about the construct, which are reflected in words used by them (Abrahamson and Hambrick 1997). In addition, according to semantic contextualism, the meaning of a word is defined by the context surrounding it (e.g., “you shall know a word by the company it keeps”; Firth 1957, p. 11). Hence, two lexically similar words can have different meanings depending on their uses/contexts. Likewise, the role of a word in a sentence can vary (e.g., quality, perceptual entity, or action) depending on the PoS that it is representing (e.g., adjective, noun, or verb) (Cheng et al. 2020). The proposed framework introduces content commonality as the construct’s global content reflected on both PoS and the construct dimensions. Table 2 illustrates the data visualization proposal to explore such a commonality considering d dimensions.

Table 2 Frequency matrix of common terms by PoS in exploring content commonality

When fijk represents the number of allusive terms to the intersection between the lists of type k (PoS) terms for i and j dimensions (D); if j = i then fijk is the total terms of i dimension (diagonal to each matrix in Table 2) by PoS (k). From Table 2, it is possible to calculate content commonality (C) metrics (by pairs of dimensions: Cijk, and PoS: Ck) based on classic probability (see Eq. 1):

$$\begin{gathered} C_{ijk} = \frac{{2f_{ijk} }}{{f_{ii} + f_{jj} }},\quad C_{k} = \frac{{fc_{k} }}{{\mathop \sum \nolimits_{i} f_{iik} }} \hfill \\ \forall k; i \ne j; i, j \in \left\{ {1, \ldots ,d} \right\}, \quad k \in \left\{ {a, n, v} \right\} \hfill \\ \end{gathered}$$
(1)

Based on Maier (1993), who studied commonality in the design of systems/structures in operation management contexts (also cited in Stake 2001), a rule of thumb for commonality between components based on proportions is: zero (0) deduces divergent systems (e.g., structures, family products, programs), 0.25 similar designs, and one (1) deduces identical designs. Stegmann (2014) and Sari and Adriani (2019) assume that similarity scores greater than 0.25 are a minimally acceptable representation of similarity between words. Thus, a multidimensional construct from texts should share certain global aspects between its dimensions but at the same time does not cover the entire content of each one of its dimensions. Hence, based on Maier (1993), Stegmann (2014), and Sari and Adriani (2019), the suggested C value (and its derived measures, see Eq. 1) should not be near zero but nor should it be high. The study of content commonality based on the proposed procedure can provide insights to improve decision-making processes from a text-driven approach during the development and content validation of psychological/managerial constructs reflected in texts.

Content polarity

In sentiment analysis in computational fields, polarity is the positive, negative, or neutral score representing the opinion of a linguistic expression (e.g., word, phrase, sentence, or document) (Korayem et al. 2012; Taboada et al. 2011). Several dictionaries enable this purpose; for example, SentiWordNet (derived from WordNet) is a domain-independent accessible resource for studying polarity (Esuli and Sebastiani 2006). Another valuable resource for the study of polarity is AFINN, which is based on the Affective Norms for English Words (Nielsen 2011) and comprises 2477 words scored between − 5 (very negative) to + 5 (very positive) (Sharma et al. 2015).

The importance of dictionary polarity detection in providing complementary evidence for content validity is that it is sensitive to the context the construct is used. For example, whether theoretically, a particular construct represents a favorable attitude or behavior in organizations or individuals (e.g., TL; teamwork, labor motivation), such a construct is expected to be reflected in texts through more positive than negative expressions (e.g., “those satisfied with life use more positive than negative words,” Quercia et al. 2012, p. 965). Hence, if a researcher is designing a dictionary on a favorable construct and its global polarity is negative, it signals that something may be wrong. Likewise, whether research addresses a construct, such as laissez-faire or burnout syndrome, the researcher should be warned when global polarity is positive. Thus, polarity detection can be a useful measure to inform researchers of possible content validity problems in construct dictionaries early and automatedly, supporting better decision-making processes during the development and content validation stages.

The proposed framework suggests the appraisal of two types of measures: one numerical (e.g., aggregating − 1, 0, and + 1 scores, or − 5 to + 5) and another categorical (negative, neutral, and positive). Numerical polarity facilitates the exploration of such a property based on conventional statistical summary (e.g., mean, deviation). Categorical polarity enables frequency analysis individually (negative, positive, and neutral) for each dimension (contingency tables). Several formal tests can be used analytically to contrast related hypotheses from this representation. For example, determining whether the global polarity is greater than zero in a construct dictionary using mean or median tests from a numerical perspective, or the chi-squared test from a categorical viewpoint.

PoS balance

This property is valuable when the dictionary construction is based on the narrow–wide approach (contrary to wide–narrow: an extensive list of terms evaluated through expert judgment to provide a refined list). This approach (narrow–wide) starts with a list of seed terms (e.g., theoretically supported definitions of a construct; narrow list), which is used to identify other semantically similar terms (wide list) to cover more linguistic contexts and to be more generalizable (e.g., “using in domain seed words improves the task performance over generic dictionaries,” Shakurova et al. 2019, p. 13). This narrow–wide approach is based on distributed-dictionary representations using measures, such as cosine similarity to expand the dictionary from a list of seed terms (Garten et al. 2018). In the development/validation of a construct dictionary, the representative list of seed terms should be theoretically supported and consider construct definitions, questionary items, and formal descriptions from construct guides (Pandey and Pandey 2017; Ponizovskiy et al. 2020; Short et al. 2010). Subsequently, whether the chosen-seed terms are a reasonable representation of a construct, it is also expected that its linguistic manifestations in terms of PoS are maintained, at least regarding the essential components: perceptual entities (nouns), actions (verbs), and qualities (adjectives).

Considering the General Theory of Terminology (Wüster 1979) [the importance of nouns for designating a concept] and the Communicative Theory of Terminology (Cabré 1999, 2002) [adjectives/qualities and verbs/actions can become domain-specific expressions] (Campos and Castells 2010), the proposed framework introduces PoS balance as a valuable property to inform researchers about complementary measures of construct validity. Word tagging from the PoS perspective: (i) describes different components in sentences (e.g., adjectives, nouns, verbs) (Cheng et al. 2020); (ii) provides a reasonable indication of document content without having to examine the entirety of the document considering unique/specific patterns or encoding additional information about the language used in the document (Katyshev et al. 2020; Whittaker and Tucker 2007); and (iii) allows the analyses of statistical properties of words within each class separately (Drożdż et al. 2009). Thus, when a seed dictionary of a construct presents content validity, PoS should reflect a balance according to the studied scenario: (i) In multidimensional constructs, all dimensions have the same PoS statistic distribution; and (ii) in uni/multidimensional constructs, this distribution is maintained during the dictionary expansion steps.

For example, in the first case, if a construct consists of two dimensions, it is expected that both “A” and “B” dimensions present the same PoS distribution (e.g., percentage of nouns, adjectives, and verbs). Such homogeneity ensures that variations in construct scores are mainly derived from construct content instead of considerable PoS variations (e.g., tone, structure, type of text, writing style). In the second case, a seed dictionary based on corpus is not merely a list of words. It also represents underlying linguistic patterns that the construct reflects in textual data. Hence, when PoS distribution is not reported/controlled during the dictionary expansion steps, the seed patterns could be significantly altered, affecting content validity (e.g., atypical PoS). For example, in a notably emotional construct (e.g., happiness), adjectives are expected to represent a crucial portion of PoS distribution, compared with a more rational construct, such as critical thinking. Figure 2 describes the proposed strategy for exploring the PoS balance of the construct content.

Fig. 2
figure 2

Proposed visualization and conditions for PoS balance. Notes PoS (parts of speech); a (adjectives); n (nouns); v (verbs)

Based on Fig. 2, Pearson’s chi-squared test is useful to explore the PoS balance between dimensions within each moment (initial, final). Two contingency tables should be compared to accomplish this. Thus, since each table (initial, final) corresponds to the crossing of two response variables with the same PoS classification method, several tests can be used, such as the chi-squared test. An adaptation from Grizzle et al. (1969) is proposed in Eq. 2 to calculate the expected values (eijr) under the null hypothesis (homogeneity of the distributions of the two tables; that is, PoS balance between moments):

$$e_{ijr} = f_{ + jr} \left( {\frac{{f_{ij1} + f_{ij2} }}{{f_{ + j1} + f_{ + j2} }}} \right)$$
$$\chi_{{\left( i \right)*\left( {j - 1} \right)}}^{2} = \mathop \sum \limits_{r} \mathop \sum \limits_{i} \mathop \sum \limits_{j} \frac{{\left( {f_{ijk} - e_{ijr} } \right)^{2} }}{{e_{ijr} }}$$
(2)

Here, fijr and eijr are observed and expected frequencies, respectively, allusive to dimension (row i), PoS (column j), and moment (r; two tables: r = 1. Initial and r = 2. Final), and f+jr represents the sum of all observations of column j within moment (table) r; eijr is distributed according to Eq. 2: chi-squared (\({\chi }^{2}\)) with df:\((i)*(j-1)\) degrees of freedom. The procedure to contrast the hypothesis of homogeneous tables (moments of dictionary development) was automated in Python and R. The R code will be shared on reasonable request to the corresponding author.

Embedded-content convergence/differentiation

In structured questionary data, a construct measure is attributed with convergence traits when it correlates with other measures of the same construct or with reference constructs with which such an association is expected (Martínez et al. 2006). Likewise, there are discriminant traits when the construct presents low association with constructs assumed to be less correlated or noncorrelated. In interpreting this procedure in the TM field, we consider a construct dictionary and three reference variables theoretically/logically associated with the construct under study in three categorical magnitudes: higher, medium, and low associations. Furthermore, we consider four embedded vectors, one representing the construct dictionary (Vc) and the others the reference variables (Vh, Vm, Vl). Then, if the construct presents convergence/differentiation capabilities, several conditions are linguistically reflected, some of which are represented in the following equation:

$$\begin{gathered} S\left( {Vc_{ik} ,Vh_{ik} } \right) > S\left( {Vc_{ik} ,Vm_{ik} } \right) > S\left( {Vc_{ik} ,Vl_{ik} } \right), \hfill \\ S\left( {Vc_{ik} ,Vh_{ik} } \right) > 0.25\,\forall i \in \left\{ {1, \ldots ,d} \right\},\quad k \in \left\{ {a, n, v} \right\} \hfill \\ \end{gathered}$$
(3)

where S is cosine similarity between two embedded vectors, i is dimension, and k is PoS (nouns, adjectives, verbs). Equation 3 represents the theoretically or logically stabilized relationships in the nomological network of the construct under study. Regarding the linguistic context to explore such relationships, a pretrained word embedding model can be used, such as the standard Google pretrained Word2Vec model (https://code.google.com/archive/p/word2vec/), which covers a vocabulary of 3 million terms (words and phrases) trained using roughly 100 billion expressions from a Google News data set (Mikolov et al. 2013). For example, whether a researcher is studying critical thinking, a word embedding vector of the construct can be extracted from the pretrained model using the sum or average of each word embedding present in the construct dictionary (Garten et al. 2018). Likewise, three potential reference word embeddings can be extracted for representing the following expressions: “creativity” (high expected association, Vh), “extraversion” (expected medium association, Vm), and “sport” (expected low association, Vh). Consequently, the researcher can calculate cosine similarity between the embedded vectors and verify Eq. 3. The use of cosine similarity is supported by its success and popularity in TM and related fields to summarize and describe association patterns between term vectors, generally accepting values greater than 0.25 (Sari and Adriani 2019; Stegmann 2014).

Content coherence

Liman et al. (2020) and Sekaran and Bougie (2016) both highlight that expert opinions about the constructs’ items, wordings, and phrases are a reasonable representation of content validity. Patrick et al. (2011) assert that content validity aims to demonstrate that a construct measure is appropriate and comprehensive. On the other hand, based on a text-driven approach, coherence means that humans understand and interpret the patterns discovered from TM (Chang et al. 2009). Thus, the meaning/interpretation of terms included in a construct dictionary is a reasonable (albeit not sufficient) proxy of construct content validity reflected in texts.

A resource increasingly used in big data analytics (intensive in text-drive approaches) is a word cloud; this allows researchers to summarize and describe language graphically and “is more significant if it conveys more information by itself with less information shared by other word clouds” (Cui et al. 2010, p. 123). The proposed framework uses word clouds to summarize and describe the content of the construct under study, combined with surveys of n participants to independently evaluate two word clouds in the case of a unidimensional construct (one for the studied construct and another as a distractor); or d word clouds (one per dimension) in a multidimensional construct (d dimensions; d > 2).

In any case, consider the following instructions: Please (i) analyze each word cloud. Then, you will be presented with statements about the construct (or its dimensions); (ii) Read them in detail and afterward; (iii) link each statement to the word cloud to which it has the greatest affinity. Here, a statement is a theoretically supported formal definition of the investigated construct (or dimension). For example, for “idealized influence” (II; a dimension of TL), a possible statement could be: a leader who is admired, respected, and conceived as a role model, thanks to their high standards of performance, ethics, health, and outstanding behavior (Avolio and Bass 2004).

To conclude whether empirical evidence in favor of the coherence of a construct is found, the proposed framework uses the contingency table exposed in Fig. 3, illustrating d dimensions.

Fig. 3
figure 3

Contingence d × d table to summarize results about content coherence. Notes f (frequency); n (total observations); d (number of dimensions)

Note in Fig. 3 that this procedure summarizes, in tabular format, categorical responses for the affinity between d word clouds (e.g., construct dimensions) and d statements (formal definitions). Based on Howe (1985), the normalized diagonal represents the proportion of coherence/agreement between the two components (word cloud and statement) from the respondent’s perspective. Thus, to conclude in favor of content coherence, three simultaneous criteria are proposed: (i) each element on the diagonal (see Fig. 3) is greater than any element outside the diagonal; (ii) the sum of elements within the diagonal is greater than half of the sample consulted. These results deduce that the human interpretation of the content summarized from construct word clouds tends to converge toward correct definitions and discriminate from other possible content. Besides, when (i) and (ii) criteria are satisfied, this paper suggests employing (iii) a Pearson’s χ2 test (degrees of freedom: product between d − 1 and d − 1) to support analytical content coherence, contrasting the possible association between rows and columns. That is, levels of word clouds (rows) are associated with (or are not independent of) statement categories (columns).

Embedded voting

In this subsection, we propose an original method entitled “Embedded voting.” In "Results of the empirical application of PMTM in the study of TL", we describe the practical use of this method considering TL.

A frequent practice in the dictionary development/validation of psychological/managerial constructs from texts is to use expert evaluations to decide upon controversies in the dictionary content. For example, the framework proposed by Short et al. (2010) and also employed by Pandey and Pandey (2017) includes the notion that researchers should “validate word lists using content experts.” Likewise, when there is no agreement between experts (e.g., to include a specific word in the dictionary or deciding in which dimension to include the word), a discussion process is held until a consensus is reached (Pandey and Pandey 2017; Ponizovskiy et al. 2020). We intend to emulate and refine this procedure by developing an automated-computational perspective assisted by machine learning using word embeddings. Details of the procedure of embedded voting are shown in Fig. 4.

Fig. 4
figure 4

Developed method for emulating expert raters: General illustration of embedded voting of n.experts considering “A” and “B” construct dictionary options

From Fig. 4, when there is controversy concerning whether a specific word is more pertinent to one dimension (“A”) or another (“B”), embedded voting receives the list of words of each dimension, among other inputs exposed in Fig. 4 (e.g., number of experts to emulate: n.experts). Then, preprocessing tasks are performed to obtain (i) a vector of controversial words between the dimensions (Controv), and (ii) two vectors of unique words (Aun, Bun). Likewise, two additional vectors (Vote.A, Vote.B) and one output matrix (Res) are initialized. After that, embedded voting emulates the decision-making process of each expert in the panel (from j in 1 to n.experts). Thus, each “expert” evaluates the association between an embedded vector of each controversial word (from term in Controv) and two aggregated-embedded vectors of unique words of both “Aun” and “Bun” dimensions. That is, the expert should decide whether each controversial word (vector) is more associated with dimensions “A” or “B,” which internally is performed from a word embedding approach. Subsequently, if the “expert” vote is to include the controversial word in dimension “A”, this receives one point (Vote.A = Vote.A[k] + 1); similarly, when dimension “B” is more related to the controversial word, it receives one point (Vote.B = Vote.B[k] + 1). Once all “experts” have voted, embedded voting decides in which dimension each controversial word should be assigned, based on the most votes (“Res,” Fig. 4). To generate the vectors, embedded voting considers relevant content representing both “A” and “B” dimensions, using a combination of the corpora as its input (CorpAuB) from which “A” and “B” dictionaries were formed. That is, both “A” and “B” constructs provide linguistic contexts to be considered in the decision-making process. Thus, a word embedding model is trained, then the required embedded vectors (vA, vB) are generated.

Embedded voting is one automated resource to help overcome eventual dictionary problems regarding four properties: commonality, coherence, PoS balance, and convergence/differentiation; hence, it is helpful in steps 4, 7, and 10 of the PMTM framework (see Fig. 1). The R code will be shared on reasonable request to the corresponding author.

Representation of the measurement model based on PoS

We operationalize a new representation for psychological/managerial constructs by considering PoS. In "Results of the empirical application of PMTM in the study of TL", we empirically examine this representation in the case of TL.

We intend to extract lexical proxies of qualities (adjectives -a or adjective phrases -ap), perceptual entities (noun -n or noun phrases -np), and actions (verbs -v or verb phrases -vp) commonly used in texts about the psychological/managerial construct under study. This approach is consistent with the importance of nouns as the essential representation of concepts (Wüster 1979) and adjectives/verbs as lexical units with the potential for capturing qualities/actions from specific domains (Cabré 1999, 2002). For example, according to Campos and Castells (2010), domain-specific adjectives should be treated independently during the development of dictionaries. Moreover, nouns, adjectives, and verbs involve different brain mechanisms during text production (Fyshe et al. 2019; Haan et al. 2000; Martin et al. 1995). Thus, the framework develops two formats of observable indicators based on correlated PoS:

  • Basic: n, a, and v.

  • Composed: np (using the n scores), ap (averaging scores of a and n), and vp (averaging scores of v and n). This format supports phrase formation based on context-free grammar (Chomsky 1955) and the aggregated vectors of words (e.g., Garten et al. 2018).

Figure 5 illustrates the two formats considering two construct dimensions (i, j) and a first-order model to ease understanding. However, these can be extended to more dimensions or complex structures (e.g., second-order model) or parsimonious models based on PoS parcels (e.g., averaging a, n, and v vector scores) that summarize higher order constructs.

Fig. 5
figure 5

Measurement model representation based on correlated PoS

Note in the proposed representation (see Fig. 5) that each error variance is a combination of natural errors and systematic components of each type of PoS. Analogously to questionnaire studies, one possible systematic component could be the method employed (e.g., personality measured based on a checklist of adjectives; Gough 1979). Another systematic component represented in correlated PoS can be linguistics considerations, such as the synonymy or polysemy of words. For example, multiple adjectives (or nouns or verbs) can present the same meaning, and various meanings can be attributed to a single adjective (or noun or verb) (Lochter et al. 2016). Thus, the exposed representation can be understood as an adaptation of the correlated-uniqueness CFA (Batista-Foguet and Coenders-Gallart 2000; Marsh et al. 1992) by considering PoS. The statistical expression of an observable indicator based on PoS type k for a particular dimension (D) is presented in the following equation:

$${\text{PoS}}_{k} = \lambda_{k} D + \delta_{k} ;\quad \delta_{k} = e + \gamma_{k}$$
(4)

where PoSk is scores (e.g., sum, average, or weighted average) obtained from a merged list of type k (PoS: a or ap-representing qualities, n or np—perceptual entities, and v or vp—actions) manifestations derived from a particular dimension (D). \({\delta }_{k}\) is the indicator error variance comprising both random error (e) and systematic components of the PoS effect (\({\gamma }_{k}\)).

To reach a conclusion concerning the plausibility of this representation (Fig. 5), the suggested metric for CFA, according to Credé and Harms (2015), is the chi-squared (\({\chi }^{2})\) metric, which contrasts a hypothesized model vs. a null model (latent variables are not required to reproduce the evidence: covariance). However, due mainly to the sensitivity of \({\chi }^{2}\) to the sample size, other complementary fit indexes (Credé and Harms 2015; Lévy and Varela 2006) are: RMSEA (root mean square error approximation), SRMR (standardized root mean square residual), CFI (comparative fit index), and TLI (Tucker–Lewis index). To interpret these indexes, it is typical to use cutoff values (e.g., proposed by Hu and Bentler, 1999; RMSEA < 0.06; SRMR < 0.08; CFI and TLI > 0.95) in constructs based on structured questionnaires (Lévy and Varela, 2006; Lorenz et al. 2021). However, there is controversy about the usefulness of these cutoff values, especially in higher order factors, which is argued by Credé and Harms (2015), who suggest that the better models are those that present a nonsignificant \({\chi }^{2}\), and at the same time low values for RMSEA and SRMR, as well as high values in the others (CFI, TLI).

Results of the empirical application of PMTM in the study of TL

This section is structured in the stages of the proposed framework (see Fig. 1) to develop and validate a new scale of TL from a theory/text-driven approach using data from organizational documents, interviews, essays, blog posts, surveys, and speeches by former US presidents.

Dictionary development and content validation (Outputs of steps 1–10 of PMTM)

Regarding “Step 1: establishing construct definition and dimensionality from a theory-driven approach” (see Fig. 1), the construct under study is TL. Considering Avolio and Bass (1991, 2004), this construct represents the capacity to inspire and guide followers towards collective goals and, simultaneously, modify their motivational basis, following their desire for achievement, self-improvement, and self-realization. Moreover, at present, their questionnaire (the MLQ™, Avolio and Bass 1991, 2004) is the most commonly used instrument to measure TL (Brown and Keeping 2005; DeDeyn 2021), which operationalizes such a construct in four dimensions: (1) Individualized consideration (IC): treating each collaborator in a differentiated way, considering their needs, capacities, goals, and expectations; (2) Intellectual stimulus (IS): stimulating followers to assume creative thoughts and behaviors and to search for innovative ideas and solutions through the promotion of reasonable doubt, the filtering of information, argumentation, and the questioning of supposed values and beliefs; (3) Idealized influence (II, two types: attributed and behavior): the leader is admired, respected, and conceived as a model to follow; followers identify with the leader and aim to emulate them. (4) Inspirational motivation (IM): fostering enthusiasm among employees, using teamwork to channel resources and capabilities to overcome the personal and organizational status quo and achieve better performance.

Concerning “Step 2: conforming three corpora types: seed, scientific, and pragmatic” (see Fig. 1); Table 3 describes the empirical data conformed to develop the dictionary.

Table 3 Data used for steps 2–10 of PMTM

After applying steps 3–10 of PMTM (see reproducibility in the supplementary material: “Supp1”), we present the results for the proposed properties (see "Proposed properties for exploring content validity"), considering the final TL dictionary (1073 words; nouns: 424; adjectives: 209; verbs: 440):

PoS balance

Table 4 provides evidence to examine whether there is a PoS balance between the structure of TL dimensions.

Table 4 PoS balance analysis considering initial (seed) and final dictionaries

Note, in Table 4, that the linguistic structure of the dictionary satisfies the PoS balance: chi-squared (6 degrees of freedom in each table) for the initial (based on seed corpus) and final (all corpora: seed, scientific, and pragmatic) moments are 0.919 and 0.6603, respectively. Moreover, the contrast between moments using Eq. 1 suggests that the initial dictionary expansion did not destroy the PoS structure underlying the seed corpus (formal definitions/descriptions of each TL dimension), with a chi-squared of 0.9061 and 8 degrees of freedom. Thus, the latent context (e.g., use, interpretation, and brain processing of words) underlying the initial content (formal definitions/descriptions) of the TL dimensions continues to keep its PoS balance, even though such content was extended based on additional contexts/corpora.

Content commonality

Table 5 presents evidence about content commonality in the final TL dictionary, specifying the frequency and proportion of shared terms by dimension (Cij) and PoS (Ck).

Table 5 Content commonality analysis for the final dictionary

Note, in Table 5 (considering the final TL dictionary), that in the case of nouns, 7 of the 424 terms are shared by all TL dimensions, representing 1.7% of content commonality (Ck) at the noun level and around 6.7% between pairs of dimensions (Cij). In adjectives and verbs, these Ck are 1.9% and 1.6%, respectively, and Cij range between 5.9 and 11.5%.

The evidence about content commonality (Table 5) prevents future problems of high lexical correlations that falsely show a convergence between TL dimensions. Likewise, the global terms comprising the final dictionary can be assumed as general-domain contexts, which also apply to theoretical and practical TL contexts. It is worth noting that global content commonality (Ck) before the use of embedded voting in both the initial dictionary (from the seed corpus) and its expansions (scientific and pragmatic) presented high values. For example, considering the seed corpus, commonality for pairs of dimensions (Cij) by PoS ranging from 27.9 to 39% in nouns; 18.6–32.7% in adjectives; and 18.6–30.3% in verbs. In the pragmatic corpus, these commonalities, before embedded voting, ranged between 32 and 43.6% in nouns; 25.4% and 52.2% in adjectives; and 30.3% and 38.4% in verbs (detailed outputs in Python are in “Supp2”, supplementary material). These values justify the traditional practice based on expert discussions to generate a consensus when construct dictionaries are developed based on human tasks. However, in this paper, we proposed and applied embedded voting, a method to emulate this agreement process using embedded vectors computationally (e.g., word2vec) from a machine-learning perspective.

Embedded-content convergence/differentiation

Table 6 describes the associations (cosine similarity) between the vectors of TL dimensions (merging nouns, verbs, and adjectives) and the reference vectors with which it is expected that the dimensions have high (Vh), medium (Vm), and low (Vl) associations. Based on the conditions stated in Eq. 3, the evidence provided (Table 6) supports the content convergence/differentiation of the developed dictionary from a linguistic-computational perspective. For example, the cosine similarities between TL dimensions ranged from 0.924 to 0.944; these are greater than the similarities between TL dimensions and reference vectors (0.835–0.352).

Table 6 Content convergence/differentiation analysis for the final dictionary

Table 6 shows that TL dimensions present more similarity with Vh (0.751–0.835; words used frequently by leaders rated high on TL, e.g., “team,” “members,” “provide,” “purpose,” “teach,” “information,” “concerns,” “solutions,” “creating2”; Salter et al. 2013, p. 65) than Vm (0.575–0.6; words frequently used by leaders rated low on TL, e.g., “results,” “schedule,” “tasks,” “colleagues,” “measured,” “budget;” Salter et al. 2013, p. 65) and Vl (0.352–0.461; negativity words). Associations between TL dimensions and Vl should be lower and not necessarily negative, because the TL vectors are not exclusively based on adjectives but also on nouns and verbs.

In all cases, the similarities calculated from embedded vectors (TL dimensions and reference vectors) satisfy the conditions stated in Eq. 3 in favor of content convergence/differentiation. In other words: (i) the TL dimensions share more information with them than with other measures; (ii) the TL dimensions are sensitive to changes in the measures with which they are associated; (iii) TL dimensions reflect expected association patterns with other measures (high, medium, low associations).

Content coherence

Figure 6 provides the word clouds used to explore the content coherence of the construct dictionary. “A,” “B,” “C,” and “D” labels are II, IM, IS, and IC, respectively.

Fig. 6
figure 6

Word clouds (a random sample of words from the dictionary vocabulary) used to explore human interpretation of the TL dictionary. “A,” “B,” “C,” and “D” labels are II, IM, IS, and IC, respectively

The labels were hidden from the ten evaluators (one Ph.D. professor, six MSc. professors, and three practitioners, all in management areas). The statements (formal definitions for TL dimensions) were presented to the evaluators in separated sections and in the following order: IM, II, IC, and IS statements (e.g., inspirational motivator: fosters enthusiasm among employees and inspires them to overcome personal/organizational status quo and achieve higher levels of performance).

Table 7 presents the contingency table that describes the affinity perceived by the evaluators, considering two scenarios: original (n: 10 respondents) and resampled (5000 replicas).

Table 7 Content coherence analysis considering word clouds from the final dictionary

Note that in all cases (Table 7), the correct options were chosen by the respondents in at least 50% (A–A: 9/10 respondents; C–C: 5/10) of cases; likewise, the independence hypothesis between word clouds and correct statements was rejected with a chi-squared value of 43.2 (9 degrees of free) and p value near to zero. In other words, the evidence supports that humans adequately interpret the content of the dictionary under study.

Content polarity

It is expected that a transformational leader has/employs more positive than negative emotions/words (Diebig et al. 2017). Table 8 categorically and numerically summarizes the results.

Table 8 Content polarity analysis for the final dictionary

Table 8 illustrates that the number of positive words is greater than negative ones in all cases, with ratios (pos/neg) ranging from 2.25 (IS) to 23.5 (IM). In addition, as expected, IM presented the highest positivity (21% of its terms) and IS the lowest (13% of its terms), which is consistent with the notion that (i) motivational thinking/behavior is characterized by positive emotional language; and (ii) intellectual thinking/behavior is characterized by reasonable doubt, questions, consideration of risks and opportunities, rational persuasion, impartiality, and fewer emotional decisions (e.g., “a sense that the structures and processes are orderly and rational,” Poghosyan and Bernhardt, 2018, p. 3).

In addition, PMTM automatically controls the polarity of adjectives of all TL dimensions during the dictionary construction steps. For example, in the case of IS, adjectives such as “stupid,” “obsolete,” and “fear” [a non-adjective expression] were automatically detected and removed during the seed corpus processing; and “negative,” “poor,” and “wrong” during the final stage (pragmatic corpus). This content polarity was controlled without human intervention, and the evidence provided in Table 8 satisfies the expected results. For details of the history of these results across all steps of the dictionary construction, see the supplementary material (“Supp1”).

Validity, reliability, and equity (outputs of steps 11–15 of PMTM)

Confirming the plausibility of the internal structure and examining reliability

We employ three formal corpora (text data), one from organizations (letters to shareholders—2018 Fortune companies; e.g., this data type was also used in Pandey and Pandey 2017 for studying organizational culture, and Josef and Helena, 2019—effective leadership using LIWC2015, Pennebaker et al. 2015) and other texts from individual contexts (online emulated job interviews—formal, and blog posts—casual; these blog posts were also used in Ponizovskiy et al. 2020 for studying personal values). Subsequently, we calculate scores for each TL dimension using the expression (in log scale) proposed by Pérez-Rave et al. (2020). Table 9 describes the corpora processed to examine the internal validity of the dictionary.

Table 9 Data used for step 11 of PMTM “Confirming the plausibility of the internal structure of the scale, and reliability”

For each format type of PoS (basic: a, n, v; and composed: ap, np, vp; see Fig. 5), we contrast three models with the four theoretical TL dimensions (II, IM, IS, and IC; Avolio and Bass 1991, 2004): single, first-order, and second-order structures (each including the described correlated PoS). The analyses were performed in R (R Core Team 2021) using lavaan (Rosseel 2012) under both maximum likelihood (ML) estimation and MLM (maximum likelihood with robust standard errors and a Satorra–Bentler scaled test statistic), considering the corpora described in Table 9. The results are provided in Table 10. Additional detailed outputs are presented in the supplementary material (“Supp2…”).

Table 10 Results of CFA with correlated PoS using basic (a, n, v) and composed (ap, np, vp) formats

Table 10 reveals that the two formats of the observable PoS (basic: a, n, v; composed: ap, np, and vp) support the plausibility of the construct (TL). The first-order and second-order structures (the latter except in “Corpus 3: Blog Authorships Corpus”; here, it did not converge) using the basic format were notably plausible (e.g., Chisq/df from 1.32 to 2.32 using ML and 1.14 to 1.46 with MLM; and CFI min: 0.933, SRMR max: 0.07). However, a more parsimonious model of TL (single factor) was also plausible. Moreover, in the second-order model based on the basic format (a, n, v), the first-level factors (TL dimensions) presented moderated/low reliabilities (e.g., composite reliability: 0.451 to 0.7019 in “letters to shareholders”; 0.42 to 0.61 in “online job interview”), but at the second-level, the factor (TL) presented high composite reliability (e.g., 0.94 and 0.91 in “letters to shareholders” and “online job interview,” respectively).

Globally interpreting these results, the basic format of PoS (a, n, v) helps represent TL from texts in two cases: (i) when the interest is focused on TL globally (e.g., single-factor model) or (ii) when its dimensions will be measured as PoS parcels ensuring the representation of qualities (adjectives), perceptual entities (nouns), and actions (verbs) of the global construct (TL).

Parcels are frequently employed as indicators of multidimensional constructs in CFA (Weng 2019), such as TL (Aryee et al. 2012, averaged items into TL dimensions). In addition, these are helpful in cases with relatively small sample sizes (e.g., type 2 errors are reduced Rahaman et al. 2020; Xie 2020), high numbers of indicators for constructs (Lan and Chen 2020), or restricted correlations, among others (Kishton and Widaman 1994). However, instead of using the traditional random parcels based on merged lists of words, we are ensuring that each parcel in each TL dimension includes (averaging) three essential elements: qualities (adjectives), perceptual entities (nouns), and actions (verbs) concerning the construct under study.

To illustrate this strategy under the basic format of PoS, we perform a CFA with four PoS parcels (2 degrees of freedom). In the corpus of letters to shareholders (n: 186 obs), the results were (in parentheses MLM estimations): χ2/df: 1.945 (1.576); CFI: 0.994 (0.995); TLI: 0.982 (0.984); RMSEA: 0.071 (0.062); SRMR: 0.021. Likewise, factor loadings were: 0.69 (II); 0.77 (IM); 0.70 (IS); and 0.89 (IC); and composite reliability for TL was 0.849. Using the corpus of online interviews (128 obs), we found: χ2/df: 0.1995 (0.1755); CFI: 1 (1); TLI: 1 (1); RMSEA: 0.000; SRMR: 0.009. Factor loadings were: 0.6 (II); 0.82 (IM); 0.68 (IS); and 0.66 (IC); and composite reliability of TL was 0.784.

On the other hand, concerning the composite format (ap, np, vp), the evidence (see Table 10) suggests that when the interest is to contrast theories comprehensively (e.g., higher order factor), such a format is preferable to the basic PoS. In the three corpora, the first-/second-order models of TL were both plausible and reliable. For example, for the second-order model using the corpus of letters to shareholders, we obtained: χ2/df: from 2.49 to 4.86 (with ML estimations) and 2.21 to 2.44 (with MLM); CFI min: 0.959, and SRMR max: 0.09. However, the single-factor model presented a bad fit (e.g., χ2/df ranging from 18.7 to 56.6 using ML and 10.79 to 27.9 with MLM). In addition, the composite reliabilities of the second-factor model ranged from 0.89 to 0.93 for dimensions and 0.85 for the global factor (second level). In this same corpus, Cronbach’s alpha for dimensions ranged from 0.88 to 0.93 and for global TL (averaging the dimension scores) it was 0.84.

Note that, in both cases (CFA with correlated-PoS based on both basic or composed PoS formats), the multidimensionality of TL is supported; this is consistent with several works using self-report data, such as Avolio et al (1999) and Tejeda et al (2001). Moreover, as expected, correlations between aggregated scores derived from basic (a, n, v) and composed PoS (ap, np, vp) are high (II: 0.89, IM: 0.93, IS: 0.9, IC: 0.94; and TL: 0.96), which supports the convergence between the two types of measures.

To illustrate the application of the remainder of the PMTM steps, the following sections are developed using the basic format of PoS, averaging scores of adjectives, nouns, and verbs within each TL dimension and averaging dimension scores to obtain TL scores.

Confirming external validity (convergent and discriminant capabilities)

To analyze associations with reference variables from textual and nontextual data (e.g., Pandey and Pandey 2017; Ponizovskiy et al. 2020; Short et al. 2010), we collect and use five corpora from formal and casual contexts, which are described in Table 11.

Table 11 Data used for step 12 of PMTM “Confirming external validity”

Regarding data set “1. Blog Authorships Corpus scores for personal values, and LIWC2015 variables” (n: 8869 obs; see Table 11), the correlation analysis is shown in Table 12.

Table 12 Correlations based on data from the Blog Authorships Corpus and LIWC2015 (n: 8869 obs.)

Table 12 identifies that the scores for all TL dimensions are positively correlated (most of which are significant) with variables, such as “conformity” (0.22 to 0.65), “stimulation” (0.58 to 0.76), “achievement” (0.48 to 0.56), “power” (0.22 to 0.61), “focuspresent” (0.35 to 0.67), and “work” (0.29 to 0.60). These results reinforce the scale convergence, considering measures that are expected to be positively associated with TL. Likewise, the scores of TL dimensions present negative associations (most are significant) with variables, such as “hedonism” (− 0.41 to − 0.06), “percept” (− 0.41 to − 0.35), “focuspast” (− 0.68 to − 0.47), “motion” (− 0.63 to − 0.34), “leisure” (− 0.45 to − 0.3), “home” (− 0.64 to − 0.49), “informal” (− 0.54 to − 0.22), and “swear” (− 0.52 to − 0.28). These are evidence in favor of the discriminant capability of the developed scale.

In addition, all correlations between the TL dimension scores presented high values (all significant), ranging from 0.76 (IS) to 0.89 (IS, IC), which also favor the scale convergence. Likewise, these correlations (between TL dimensions) were greater than those between TL dimensions and other variables, which advocates for the discriminant capability of the scale. Furthermore, between TL dimensions, there is also discriminant capability; for example, although all TL dimensions were negatively associated with ‘hedonism’ (− 0.41 to − 0.06), only IS was statistically significant. Similarly, IS was the only dimension that showed significant associations with “risk” (0.36) and self-direction (0.33); in addition, the highest correlation between “power” and TL dimensions was with IS (0.61, significant). Likewise, II and IM evidenced a positive (non-significant) association with “reward” (0.2 and 0.22, respectively), whereas IS was negative (non-significant; − 0.16) and IC practically nil (− 0.03). Furthermore, the aggregate measure of TL (averaging the TL dimension scores) presented significant correlations that were expected. For example, “conformity,” “universalism,” “self-direction,” “achievement,” “insight,” “focuspresent,” and “work” positively significatively are correlated with TL, but “percept,” “focuspast,” “relativ,” “motion,” “leisure,” “home,” “informal,” and “swear” are negatively and significatively correlated with TL.

Detailed correlations between the TL scale scores and the remaining 13 reference variables (described in Table 11) are in the supplementary material (“Supp3”). Next, we will summarize the main findings:

Regarding data set “2. MBTI personality including the last 50 things blog posted” (n: 8675 obs; see the data described in Table 11), we calculate the point-biserial correlation between scores of TL dimensions and variables representing personality types from MBTI (previous binarization: e.g., 1. Extraversion, 0. Introversion; 1. Sensing, 0. Intuition). Again, we found evidence from perceptual data (MBTI) supporting the convergence and discriminant capabilities of the TL scale. In all cases, correlations between the scores of TL dimensions were positive and significant (0.26 to 0.28); in addition, these were greater than correlations between these and the other variables (from − 0.04 to 0.09). Moreover, IM was negatively significantly associated with ‘think’ (− 0.04; i.e., tends toward ‘feeling’), while IS presented a positive-significant association with ‘think’ (0.09). Likewise, TL showed a tendency toward decision-making preferences instead of perceiving (correlation with “judging”: 0.07; significant).

Regarding data set “3. Managerial essay corpus and questionary data” (n: 179 obs for some items of MLQ; and 96 obs for items of NEO-PI-R; see the data description in Table 11), we found several logical and expected patterns: IM and TL (this latter, averaging scores) are positively and significantly correlated with ‘extraversion’ (0.21 and 0.238, respectively), and negatively correlated (non-significant) with ‘neuroticism’ (− 0.229 and − 0.134, respectively). Besides, IC and TL showed negative and significant correlations with ‘laissez faire’ (− 0.186 and − 0.147, respectively); IM, IS, and TL revealed negative/significant correlations with a reactive style of leadership (Management-by-Exception: Active, MBEA), obtaining values − 0.222, − 0.157, and − 0.181, respectively. Hence, the evidence and the correlations between TL dimensions (from 0.273 to 0.445, all significant) suggest the convergence and discriminant capability of the TL scale.

Now, focused on the data set “4. Online job interview” (n: 128 obs; see the data described in Table 11), we calculate correlations between TL dimensions and two variables both from MLQ items (Avolio and Bass 2004): satisfaction with leadership (‘Sat.mlq’) and one TL proxy measure derived from averaging four items (one item for each TL dimension), entitled ‘TL.one.mlq.’

We found that ‘Sat.mlq’ presented a positive and significant association with IS (0.217, at 0.05 significance) and TL (0.16, at 0.1 significance). Likewise, the proxy of TL based on self-reports (‘TL.one.mlq’) was positively and significantly correlated with IS (0.233, at 0.01 of significance) and TL derived from texts (0.191, at 0.05 of significance). Moreover, II and IM evidenced positive (nonsignificant) associations with ‘TL.one.mlq’ (0.147 and 0.132, respectively). These results are relevant, because the values of both positive/significant correlations and positive/nonsignificant correlations are consistent with the validity criteria used by several studies, among them Ponizovskiy et al. (2020, s.p): “a typical correlation found between linguistic measures and self-reports is in the range of 0.1–0.2.” Hence, we suggest that the evidence derived from online interviews and questionnaire data favors the convergence capability of the developed scale.

On the other hand, focused on the data set “5. Champion vs. Contender companies using 2018 Annual reports from 1000 Fortune list” (n: 60 obs, 30 for each company type; see the data description in Table 11), Fig. 7 provides a visual comparative analysis.

Fig. 7
figure 7

Mean plots of TL scores using reference groups and data derived from 2018 Annual Reports. Notes: II (idealized influence); IM (inspirational motivation); IS (intellectual stimulus); IC (individual consideration)

Figure 7 shows that in all dimensions and TL (averaging scores of its dimensions), except for IS, champion companies had more TL scores than contenders, which was also analytically supported using t tests, obtaining the following 95% confidence intervals: II (0.047, 0.211); IM (0.040, 0.255); IS (− 0.034, 0.137); IC (0.045, 0.247); and TL (0.036, 0.201). This complementary evidence from a formal-organizational context reinforces the convergence (toward champion group) and discriminant capabilities (from contender group) of the developed TL scale.

Globally interpreting all evidence exposed derived from textual and nontextual data, formal and casual contexts, 43 reference variables, and one reference group (champions vs. contenders), we can suggest that the PMTM framework provided a TL dictionary with notable external validity in terms of convergence and discriminant capabilities.

Confirming criterion validity

We employ two corpora/data sets: “MBTI personality including the last 50 things blog posted” (see the data description in Table 11) and “2018 letters to shareholders including two financial variables from 1000 Fortune list.” In the first case (8675 obs), we intend to separately predict each dichotomy of MBTI personality from TL dimension scores using three machine-learning methods. In the second, we calculate Pearson’s correlations between the scores of interest construct (TL) and two criterion-variables (Cameron and Bohannon 2000; Martínez et al. 2006) of financial type: revenues/assets and profits/assets (182 complete obs). For the first case, Fig. 8 describes the predictive capability of three machine-learning models (logistic regression, classification trees, and bagging) regarding the output variable (preferences according to MBTI) using two samples: training (70% of obs., 5951) and validation (30%; 2550).

Fig. 8
figure 8

Accuracy of TL dimensions based on textual data using machine learning methods. Notes: TL (transformational leadership)

Figure 8 shows that models based on the developed scale can reasonably contribute to the prediction of dichotomies of personality traits (based on MBTI), with accuracies of around 77% in extraversion–introversion, 62% in sensing-intuition, 86% in thinking-feeling, and 57% in judging-perceiving.

In the second case, Table 13 provides correlations between the TL dimensions (based on letters to shareholders) and two financial indicators: revenues/assets and profitable/assets.

Table 13 Correlations with financial output variables of Fortune companies (182 obs)

Table 13 reveals significant correlations between several TL dimensions and the two output variables. For example, II is positively significantly correlated with revenues/assets (0.214) and profitable/assets (0.22), and IC with profitable/assets (0.158). Likewise, TL (averaging scores of its dimensions) is also positively correlated with revenues/assets (0.123, at 0.1 of significance) and profitable/assets (0.187, at 0.05 of significance). Furthermore, IS positively correlated with profitable/assets (0.10; nonsignificant at 0.1 level).

Considering the evidence from Table 13 and Fig. 8, we can suggest that the developed scale of TL from texts presents traits of external validity based on criterion validity.

Examining equity of the scale

We employ data sets from two different sources. The first data set consists of 93 observations (“3. Managerial essay corpus…”, described in Table 11); the second comprises two random subsamples (1000 and 4000 obs) obtained from “1. Blog authorships corpus scores…” (see the data description in Table 11). We estimate five regression models for each sample size (93, 1000, and 4000 obs) using TL dimensions and their average (TL) independently as response variables, and gender and age as regressor variables (two of the most frequent factors involved in the diversity–validity dilemma; Martínez et al. 2006; Pérez-Rave et al. 2021b). In all cases, gender and age were not statistically significant (α: 0.05). In addition, we carried out bootstrap regressions using 8000 replicas in each sample size scenario. The percentile intervals at the 95% level for gender coefficients in IS regressions were: (− 0.1918, 0.0061) in small, (− 0.0425, 0.0363) in medium, and (− 0.0375, 0.0008) in large scenarios. Likewise, the intervals for II were: (− 0.1142, 0.1373) in small, (− 0.0348, 0.0335) in medium, and (− 0.0309, 0.0023) in large sample sizes. In all cases, the intervals included zero.

Interpreting the regression results globally with and without bootstrapping, the evidence suggests that traditional personal factors do not significantly affect the scores produced by the developed scale. Thus, a classic problem in structured questionnaires (diversity–validity dilemma) may not be so in textual data using PMTM.

Practical value of the scale (output of step 16 of PMTM)

To illustrate the practical value of the developed scale, we used a collection of 208 speeches by former US presidents, publicly available from Brown (2016; http://www.thegrammarlab.com). We chose the last four US presidents in the period 01/2001–01/2021: B.Clinton (39 obs); B.Obama (48 obs.); D.Trump (82 obs); and GW.Bush (39 obs). Figure 9 details the TL scores (dimensions and average) for the US presidents under analysis, using mean plots with confidence intervals at the 95% level.

Fig. 9
figure 9

Mean plots with 95% confidence interval for scores of the last five US presidents (01/2001–01/2021). Notes We sum a constant (10) to scores for better visualization. TL (transformational leadership); II (idealized influence); IM (inspirational motivation); IS (intellectual stimulus); IC (individual consideration)

Figure 9 shows that the developed scale is valuable for analyzing individual differences. This analysis type is essential in describing the extent to which individuals are like one another (Loughry and McDonough 2002) and predicting several performances. Thus, the TL scale developed using PMTM allowed us to discover significant differences between President Trump and the others analyzed. Trump presented the lowest score in IM, IS, IC, and TL (average). However, regarding II, the results did not show notable differences between the presidents (“GW.Bush–B.Clinton” presented the most distance, but the p value was 0.0695 using Tukey’s range test).

The results in Fig. 9 are consistent with other studies on TL in which language, behavior, and other aspects of Trump showed themselves more oriented toward transactional leadership than TL (Sternberg 2020). For example, according to Salter et al. (2017, p. 65), “Donald Trump used a greater percentage (M = 2.51%) of transactional words than Ted Cruz (M = 1.28%).” Furthermore, the high scores of Trump found in II (see Fig. 9) compared with the results in other dimensions (IM, IS, IC, and TL) are consistent with Lunbeck (2017), who highlights the fact that followers have a fascination with Trump and that he knows about this fascination. This charisma attribution is also found in Williams et al. (2020, p. 11), who use MLQ items (Avolio and Bass 2004, 1991) and conclude: “…Donald Trump [had] a higher score on the computed variable indicating perceptions that he is highly charismatic.”

In summary, this practical illustration, derived from the TL scale developed and validated using the PMTM framework, reveals that Trump presents a leadership style that differs from the other presidents considered. This distinctive feature is consistent with Fenner and Piotrowski (2018, p. 11): “the executive style of President Donald Trump has generated substantial empirical and theoretical attention.”

General discussion

This paper considers the fundamental basis for developing and validating psychological/managerial constructs from texts, which have been systematically nourished with linguistics, psychometrical, and computational resources derived from three integrative stages and 16 steps. From this systematization, the developed framework (PMTM) extends the previous valuable works by providing four contributions:

The first, focused on the creation and content validation of measurement models of psychological/managerial constructs reflected on textual data, is a procedure comprising the initialization, expansion, and use of PoS during the creation and validation of construct dictionaries. This strategy is inspired by a logical and frequent method of construct operationalization based on structured questionnaires in sentences (items) formed by combinations of adjectives, nouns, and verbs (and auxiliary words). Thus, this strategy assumes that the study of psychological/managerial constructs is not limited to self-reports, because how a person manifests their feelings, skills, or beliefs also comprises their natural language (speaking or writing). Thus, PMTM has demonstrated that language can be used as a means for this type of analysis, which is consistent, for example, with the notion that “the ability to talk about an emotion without it being physically present is a key component of natural language description of emotion” (Kazemzadeh et al. 2016, p. 5). According to Moulin (1992), a writer or orator chooses relevant information from the world, then builds a conceptual map (concepts and relationships) and expresses this through an oral or written discourse (linguistic level) to describe beliefs, sentiments, sensations, knowledge, behaviors, and attitudes. These expressions are highly subjective (Zhou and Zhang 2003) and, as responses to a structured questionnaire, natural language also represents underlying perceptions and diverse expressions about specific phenomenological/behavioral manifestations of constructs.

In summary, the proposed/developed framework offers more control, reproducibility, efficiency, and evidence-based decision-making during the construction and content validation of psychological/managerial constructs reflected in texts, thanks to five proposed/developed properties informing patterns of the linguistic environment of the text producer in an automated (or semi-automated) manner. In other words, after preparing the seed corpus, PMTM automatically transforms unstructured data (texts) into a structured format and carries out dictionary expansion steps whose performance (content validity) is based on and guided by five linguistic properties (commonality, polarity, coherence, convergence/differentiation, and PoS balance). This procedure expands the traditional standard for context validation in psychological/managerial constructs from texts, which is highly dependent on human tasks, subjective, limited to small data, and challenging to reproduce/replicate. Thus, PMTM facilitates the implementation of more complete and efficient tasks based on seed, scientific, and pragmatic corpora, and multiple linguistical/computational/psychometrical resources.

The second contribution is a new representation of constructs’ measurement models: a CFA with correlated-PoS. This new model representation considers three observable variables based on entities (nouns or noun phrases), qualities (adjectives or adjective phrases), and actions (verbs or verb phrases). Thus, from the semantic compositionality principle (phrase creation from word combinations; Fyshe 2015; Mitchell and Lapata 2010), a construct operationalized as a function of separated or aggregated PoS (in basic or composed formats) is more justifiable and linguistically complete than one merely comprised of a merged list of terms or random parcels of word lists. For example, Fyshe (2015) applied such a principle to advance, among others, the understanding and interpretability of phrase formation from combinations of separated nouns and adjective vectors. Haan et al. (2000), in the field of neuropsychology, provide evidence supporting a distinction in how the brain represents and processes nouns and verbs, reporting that verbs demanded more brain involvement than nouns. Martin et al. (1995) found a distinction in how actions (verbs) and color words (adjectives) are processed in the brain. Fyshe et al. (2019) also inform the literature of differences in brain representations of nouns and verbs during phrase formation processes.

On the other hand, PMTM recognizes that the success of a TM solution in business management fields should be determined by evidence that is standard in the usage domain (Strohmeier and Piazza 2013). However, the traditional exploratory scope of text/data solutions (data-driven approach) is insufficient to confirm psychological/managerial latent variables. Therefore, PMTM adopts a theory/text-driven approach to take advantage of both construct theory (e.g., definitions, dimensionality) and data methods/technologies based on linguistic, computational, and psychometrical resources, and demonstrates (using multiple data sets; primary/secondary data; individual/organizational environments; and casual/formal domains) that a CFA representation with correlated PoS satisfies internal/external validity, reliability, equity, and practical value properties.

The third contribution is the provision of original empirical evidence derived from the framework application in creating and validating a novel scale for measuring TL based on textual data, consisting of 1073 expressions: 424 nouns, 209 adjectives, and 440 verbs. This scale is not invasive and allows studies to cover multiple empirical manifestations (entities, qualities, and actions) of the most frequently recognized dimensions of this construct: II, MI, IS, and IC. Likewise, the scale can be used in multiple types of texts, such as individual (e.g., blog posts, interview transcriptions, open questions from surveys) or organizational documents (e.g., annual reports and letters to shareholders) in several natural settings, including formal (e.g., public enterprises reports, presidential speeches, job interviews) and casual (e.g., social media or autobiographies) environments.

Comprehensively considering the discussed insights as a whole, the fourth contribution of the paper is a novel framework (PMTM) capable of inspiring, supporting, guiding, renewing, and reconfiguring future academic or business works in the creation, validation, and use of constructs (reflected in texts) involved in business management and related areas, such as marketing research. For example, PMTM provides conceptual and methodological guides to contribute to the challenge recently put forth by Cronin Jr. (2021), which comprises the need for new perspectives and proposals to comprehensively and efficiently conceptualize and operationalize marketing constructs and transcend the traditional dependency on questionnaire data.

Thus, marketing research and practice can benefit from the proposed/developed framework by exploiting service and customer texts to create measurement models (and obtaining their constructs’ scores) of psychological/managerial constructs considering scientific papers, social media, open questions from questionnaires, interview transcriptions, and stakeholder questions/complaints/claims. Likewise, organizations can use the measurement models (and their construct scores) produced through the proposed/developed framework to (a) diagnose levels of service quality, satisfaction, trust, value, and behavioral intentions; (b) create customer hyper-segmentation strategies to transcend the traditional use of sociodemographic factors (e.g., age, gender); and (c) relate construct scores with financial/social/environmental performance measures to better understand the contributions of marketing strategies on organizational performance in the context of the digital economy.

On the other hand, from a theoretical perspective of data science, the proposed/developed framework (PMTM) can also be understood (and hence used) considering the three (compression, probabilistic, and microeconomic) theoretical perspectives of data mining argued by Mannila (2000). Applying the compression perspective to the present scenario, PMTM can be understood (and used) as a set of capabilities oriented to find, in texts from multiple domains (for dictionary creation), the underlying valuable qualities, perceptual entities, and actions (compressed data) suggested by a construct under study, thus controlling its PoS. From the lens of the probabilistic TM perspective, PMTM can be understood (and used) as a set of capabilities focused on discovering certain structures from compressed data either reflected in PoS frequencies (e.g., during content validation) or construct scores (e.g., during internal/external validation).

Therefore, linguistical and computational resources such as cleaning data, PoS tagging, collocations based on grammar patterns derived from chunks, cosine similarity, distributed dictionary representations, and word embeddings are justified and included in PMTM to discover valuable underlying patterns in the format of compressed data. Likewise, traditional procedures in structured questionary data analysis, such as Pearson’s chi-squared tests, correlation analysis, and CFA (with correlated PoS using basic or composed PoS formats), are justified and included in PMTM to examine whether there is reasonable empirical evidence regarding certain beliefs/assumptions about the object of study (probabilistic perspective).

In addition, considering the TM microeconomic perspective (Manila 2000), PMTM can be understood (and used) as a set of capabilities to discover actionable patterns; that is, decision “x” that leads to the maximum utility f(x). This view enabled the inclusion of machine-learning methods in PMTM, which involve splitting the sample into at least two subsets, one for training models (or patterns) and another for validation (Pérez-Rave et al. 2019). Although machine learning is one of TM’s most popular approaches, its use in developing and validating psychological/managerial constructs is scarce. Hence, we consider the suggestions of Pandey and Pandey (2017) and Ponizovskiy et al (2020) concerning the need to examine forms to incorporate machine-learning methods into the development/validation of constructs. Thus, PMTM includes one new method (embedded voting) and a new property (embedded-content convergence/differentiation), both of which are based on embedded vectors (e.g., word2vec). In addition, we also use logistic regression, classification trees, and bagging to analyze predictive capability from a supervised perspective.

Conclusions and future work

The present study recognizes the contributions of Pandey and Pandey (2017) and Ponizovskiy et al (2020) and their predecessors (e.g., Short et al. 2010) and extends them toward a more enriched framework (PMTM) to develop and validate psychological/managerial constructs using TM considering linguistics, psychometrical, and computational resources. The evidence obtained from the application of the framework demonstrates both (a) the combination of linguistic properties and psychometrical/computational resources from a text-driven approach, effectively/efficiently guiding the development and validation of psychological/managerial constructs from texts; (b) constructing measurement models (from texts) operationalized as a function of correlated PoS emulating qualities, perceptual entities, and actions, presenting a good performance in several corpora regarding internal/external validities, reliability, equity, and practical value.

In future work, the PMTM framework should be validated considering other psychological/managerial constructs and contexts. Another future line of study is conducting a sensibility analysis of the proposed linguistic properties considering several violations and nonviolations of content validity. This opportunity will facilitate the examination of the extent to which such automated properties (and their methods, for example, embedded voting) may complement not only human tasks but also replace them. Future studies can also benefit from this paper by integrally using the PMTM framework or one or more of its stages, properties, and methods, considering the compressed, probabilistic, and microeconomic perspectives from which PMTM was interpreted/discussed.

In the marketing field, this paper can be used as a foundation on which to exploit customer/service textual data using the proposed/developed framework, which considers a theory/text-driven approach and incorporates linguistic, psychometrical, and computational resources. Thus, this paper serves as a comprehensive, reproducible, and efficient template to transcend the dependency on questionnaire data and overcome the current lack of standardization and inconsistency in the operationalization of marketing constructs.