Physical punishment is a concerning societal problem that violates children’s rights to protection from violence. Solid research shows that physical punishment negatively affects children’s adjustment across multiple domains of functioning (e.g., emotional, interpersonal, cognitive) on both a short- and long-term basis (Gershoff, 2002; Gershoff & Grogan-Kaylor, 2016). A meta-analysis of 75 studies that focused exclusively on spanking yielded 79 unique significant effect sizes; of these, 99% indicated associations between spanking and adverse child outcomes, including aggression, antisocial behaviour, externalising and internalising behaviour problems, negative relationships with parents, lower moral internalisation, and slower cognitive development (Gershoff & Grogan-Kaylor, 2016). The effect sizes did not vary based on whether studies were longitudinal, experimental, or cross-sectional in design. Furthermore, the relationships between spanking and child outcomes were similar to those between more severe physical abuse and the same child outcomes. Given the pervasive adverse effects of punitive parental behaviour, examining some critical cognitive components underlying such rearing practices is essential.

Social Information Processing (SIP; Milner, 2003) theory focuses on some critical cognitive processes underlying parental punitive behaviour. SIP theory suggests that physical force against a child can be explained by a series of parental cognitions, including unrealistic expectations of the child, lack of perceived parental control, and hostile attribution biases (Milner, 2003). These important set of parental cognitions (can also be classified as follows: (1) evaluative cognitions, such as attitudes toward child-rearing, which are characterised by pre-existing beliefs regarding the use of specific parenting practices; (2) analytical cognitions, including “conflict-promoting attributions” that are centred around attributing responsibility to the child for challenging behaviour; and (3) self-efficacy cognitions that involve parents’ perceptions of their sense of competence in their parenting role and their disciplinary styles (Bugental & Johnston, 2000).

Given the importance of parental cognitions, it would seem critical to design assessment tools that capture various related aspects, such as attitudes, attributions, and parenting self-efficacy linked to parental punitive behaviour. Therefore, the main objective of this study was to examine the factor structure of the Positive Discipline in Everyday Parenting Questionnaire (PDEPQ), a measurement tool designed to assess critical cognitive components of punitive parental practices. Moreover, we made some revisions to the original version of the instrument and conducted a valid and reliable assessment of the newly designed measure of parental cognitions.

The Positive Discipline in Everyday Parenting Questionnaire

The Positive Discipline in Everyday Parenting questionnaire (PDEPQ) is rooted in SIP theory, which attributes punitive behaviour to punishment-related parental cognitions. Durrant et al. (2014) designed this questionnaire to measure cognitive predictors of physical and non-physical punishment, which are the targets of the Positive Discipline in Everyday Parenting programme (PDEP; for more information, see Durrant, 2007, 2013). The main programme objectives are to reduce parents’ support for physical punishment, shift attributions around child behaviours, and improve parenting self-efficacy. The measure was initially developed in 2012 to gather data about PDEP outcomes as the programme evolved and was piloted in many countries, particularly throughout Asia. However, several issues with the initial version of the PDEPQ were identified.

First, there is limited information on the psychometric characteristics of the PDEPQ. Moreover, Durrant et al. (2014) reported that the reliability of the Parenting Self-Efficacy scale is in the low range (α = 0.49–0.52). Therefore, a detailed psychometric analysis is required to increase confidence in the primary questionnaire used to evaluate the PDEP programme. This paper presents two studies involving a series of item-level analyses to determine the psychometric properties of the PDEPQ and then apply all the necessary modifications to achieve a more reliable and valid measurement instrument. These two steps resulted in a revised version of the PDEPQ, which can be reliably used in programme evaluation and parenting assessment. The PDEPQ includes the following scales: Approval of Physical Punishment; Approval of Non-Physical Punishment; Subjective Norms; and Parenting Self-Efficacy.

Approval of Physical Punishment

The Approval of Physical Punishment scale measures the degree to which parents favour using physical punishment (i.e., favourable attitudes) and covers the age range from young children to adolescents. This scale includes items tapping into parental attitudes toward various types of physical punishment, including spanking, slapping, and pinching. Generally, attitudes are “a mental and neural state of readiness, organised through experience, exerting a directive or dynamic influence upon the individual’s response to all objects and situations with which it relates” (Allport, 1935, p. 810). In child-rearing practices, attitudes toward physical punishment are critical factors in determining the strategies parents are likely to use (Holden, 2020; Vittrup et al., 2006). For instance, Ateah and Durrant (2005) identified four variables as most predictive of parental physical punishment use, namely (1) approval of physical punishment use, (2) attributions for the child’s transgression, (3) perceived seriousness of the child’s transgression, and (4) anger in response to the transgression. These four variables (except for parental anger, which is not classifiable as parental cognitions) accounted for 54% of the parents’ physical punishment use variance.

Moreover, approval of physical punishment emerged as the strongest predictor, which included other variables such as a mother’s childhood history of physical punishment, knowledge of child development, and repertoire of positive disciplinary strategies (Ateah & Durrant, 2005). Perron et al. (2014) studied 2340 Canadian parents of children 2 to 12 years old. They found that after controlling for the child- and family-related factors, a favourable parental attitude toward spanking was the strongest predictor of the use of spanking. In addition, parents who held pro-spanking attitudes were nearly five times more likely to endorse physical punishment.

Interestingly, research on child-rearing attitudes has highlighted a gap in the measurement of parental attitudes toward punishment, partially due to the limited availability of valid and reliable measures (Holden & Buck, 2002). Current measurement instruments used to assess punitive parenting have limitations in that they focus on spanking and exclude other forms of physical punishment, such as slapping (i.e., hitting any part of a child’s body with the open palm), commonly used with older children and adolescents. Moreover, no existing measure is available on attitudes toward non-physical punishment. The PDEPQ is unique because it includes a scale to measure common forms of non-physical punishment, such as verbal threats, ridicule and time-out. Such responses could potentially lead to a child’s sense of rejection and isolation and, as such, are considered detrimental in that they might undermine the child’s confidence and self-esteem (Gittins & Hunt, 2019).

Approval of Non-Physical Punishment

The Approval of Non-Physical Punishment scale measures the degree to which parents favour using non-physical punishment (i.e., time-out, grounding, shaming). This is a construct that has not received much empirical investigation at the assessment and measurement levels. The use of non-physical punishment is a widely debated topic that has fomented interest among researchers in parenting (Holden et al., 2017; Larzelere et al., 2017). The most common type of non-physical punishment is time-out, which is a form of negative punishment that involves removing a child from the situation that caused challenging behaviour and placing them in an environment that is “less” reinforcing for a reasonably deemed amount of time (e.g., usually 2–5 min; Dadds & Tully, 2019). Dadds and Tully’s (2019) review highlighted variations in the delivery of time-out, whether inclusionary (e.g., time-out chair) or exclusionary, and how different modalities might impact behavioural outcomes and the child’s emotional and social development.

Time-out is presented as an effective strategy to deal with challenging behaviour in several widely used evidence-based programmes, including Triple P (Sanders et al., 2014), Parent–Child Interaction Therapy (Querido et al., 2002), and Incredible Years (Webster-Stratton, 2001). Furthermore, this method has also been explicitly recommended by the American Academy of Pediatrics (2018) as a viable alternative to physical punishment. Despite immediate positive effects of time-out on child compliance that have been proposed as a potentially beneficial aspect of this procedure, other essential features of the context in which these disciplinary measures are implemented (e.g., attachment relationships, communication style, social context) need to be considered to gauge their overall impact on child development (Dadds & Tully, 2019).

Although there are claims about the effectiveness of time-out in addressing children’s behavioural challenges (Kaminski & Claussen, 2017; Larzelere et al., 2017), its use as an appropriate (and effective) disciplinary strategy remains debatable. Results from an international study—involving China, India, Kenya, Philippines, and Thailand - with a sample of 292 families indicated that the frequency of use of time-outs was linked to greater internalising symptoms among 8–12 years olds (Gershoff et al., 2010). Moreover, no evidence exists that using time-out is superior to non-behavioural disciplinary techniques in fostering positive developmental outcomes. For instance, a recent longitudinal study reported no differences in the emotional and behavioural functioning of children whose parents used time-outs regularly relative to those who did not engage in this parenting practice. Knight et al. (2020) concluded that time-outs have neutral (non-detrimental) effects on children’s psychological well-being. However, the authors did not include positive outcomes in their study, making it difficult to reach a definite conclusion on the beneficial effects of time-outs on children’s behaviour. Furthermore, longitudinal studies on the prolonged positive effects of time-out are lacking, pointing to the need for additional research to determine the impact of this parenting technique on children’s outcomes. Therefore, it is essential to design measurement instruments that capture other aspects of non-physical punishment, including time-out.

Subjective Norms

The Subjective Norms scale of the PDEPQ measures parental attributions of children’s behaviours; it includes items for all primary age groups (infancy through to adolescence) and is not constrained to a target child. These are parents’ interpretations of children’s behaviour as developmentally normative, as opposed to an intentional act on the part of the child (Durrant et al., 2014). For instance, a negative social norm might be the belief that children say “no” to make their parents mad. Such attributions will likely increase the likelihood of punitive parental responses (Holden, 2020). Similarly, parents’ negative attributions to children’s behaviour have been shown to relate to harsh parenting (Crouch et al., 2017; Sturge-Apple et al., 2020) and an increased likelihood of physical punishment use (Ateah & Durrant, 2005; Rodriguez & Wittig, 2019).

Concerning subjective norms, several parental self-report questionnaires tap into attributions for children’s challenging behaviours. Most measures assess the degree to which parents attribute negative intentionality to child misbehaviour (e.g., the child is mad at the parent, wants to embarrass the parent, does not care). However, most existing measures focus on specific child behaviours and exclude underlying assumptions concerning children of various age groups. A novel aspect of the Subjective Norms subscale of the PDEPQ is the focus on parents’ negative attributions for typical behaviours of children of different ages to allow the capture of various ranges of attributions from parents.

Parenting Self-Efficacy Scale

Finally, the Parenting Self-Efficacy scale measures parents’ sense of self-efficacy and competence in their parenting role. This scale was designed to measure parenting self-efficacy in general and dealing with challenging child behaviours without conflict. Parenting self-efficacy refers to beliefs concerning one’s abilities to manage the daily tasks of parenthood and a sense of competence in one’s parenting role (Fang et al., 2021). Parenting self-efficacy can also impact how parents respond to conflict with their children. For example, low parenting self-efficacy has been related to frequent use of physical punishment (Khoury-Kassabri et al., 2014; Rominov et al., 2016). On the contrary, high levels of parenting self-efficacy have consistently been associated with competent and positive parenting practices (Hamovitch et al., 2019).

Concerning existing measures of parenting self-efficacy, domain-general self-efficacy focuses on the global assessment of self-efficacy (i.e., overall sense of competence) and is considered independent of domain-specific self-efficacy (i.e., the individual’s self-perceived ability in a specific parenting task; Coleman & Karraker, 2000). A systematic review of 76 studies identified 34 self-report measures of parenting self-efficacy (Wittkowski et al., 2017), 22 of which measured domain-specific self-efficacy. These scales were compared on their psychometric qualities (e.g., content and construct validity, reliability) and general characteristics (e.g., ease of scoring, interpretability). Among the scales reviewed, the Self-efficacy for Parenting Tasks Index (SEPTI; Coleman & Karraker, 2000) did not receive a high ranking compared to other domain-specific scales, particularly in terms of content validity (i.e., receiving a score of 1—indicating doubtful methodology) and test-retest reliability (i.e., receiving a score of 0—indicating no information available on the replicability of the score; Wittkowski et al., 2017). Moreover, the Discipline subscale of the SEPTI assesses perceived self-efficacy concerning disciplinary measures but does not reference physical punishment or managing conflict with children. The Parenting Self-Efficacy scale of the PDEPQ is unique in that it includes elements of both global and domain-specific self-efficacy specifically tailored to assess confidence in resolving conflict without using punishment, and it is not constrained to only young children. Therefore, these items are applicable to both young children and adolescents. In addition, there was an effort to avoid language that assumed the child’s intent, such as the word ‘misbehaviour.’ Instead, specific behaviours were described so parents could generate spontaneous responses.

In sum, the four scales that are part of the PDEPQ allow researchers to capture some critical predictors of parental punishment that existing measures lack. In this study, the validity of the PDEPQ was examined in terms of content and construct. Moreover, a detailed psychometric evaluation of the measurement instrument, including a series of item-level analyses and a reliability assessment, was undertaken to ensure the development of psychometrically sound scales designed to measure cognitive dimensions of parenting related to parental punishment.

Study Objectives

Through two studies, we aimed to evaluate the PDEPQ-R (revised version) and gather data on its psychometric properties. Study 1 assessed its content validity via two samples: (1) a group of professionals in parenting research, child development, and psychometric assessment and (2) a group of parents. Following questionnaire modifications based on Study 1 findings, the goal of Study 2 was to gather data from a sample of 400 parents to examine the psychometric properties of the PDEPQ-R using Item Response Theory (IRT) analyses. The sample data were also used to gather evidence to support the construct validity of the four scales and their internal consistency. Finally, a sub-sample (n = 100) of parents was randomly selected among the 400 parents who completed the PDEPQ-R to complete the same questionnaire (i.e., no other measure was administered) four weeks following the initial administration to assess its test-retest reliability.

One significant advantage of this study was the combination of Classic Test Theory (CCT) and IRT methodologies to assess the psychometric characteristics of the PDEPQ-R scales. Generally, Cronbach’s alpha has been criticised as being population-dependent (i.e., coefficient changes depending on the test-taker sample) and subject to possible inflations with increasing items in a given scale (Reise & Henson, 2003). On the other hand, the IRT approach is not only population-independent (i.e., consistent estimates across different populations), but it also provides a reliability measure for each item (e.g., item information curve) at different levels of the underlying trait so that each item is associated with information parameters (De Ayala, 2013). Thus, the use of IRT, in addition to the traditional reliability assessment (i.e., Cronbach alpha), constitutes an advantage in interpreting the reliability of a measurement instrument in that it provides valuable information at various stages of scale development. In addition, this study used quantitative and qualitative techniques to examine the psychometric characteristics of the PDEPQ-R. These techniques were used sequentially, beginning with the content validity assessment (professional and parent feedback) and associated modifications, followed by the IRT analyses.

Study 1. Content Validity Analysis of the Revised PDEP Questionnaire

A content-based validity assessment of the PDEPQ-R evaluated whether the items composing the four scales were consistent with the constructs that they were intended to measure.

Methods

Participants and Procedures

The first sample comprised eight professionals from Canada and the U.S. whom we identified as having expertise in child development, parenting research, programme evaluation, and psychometric assessment. This list initially included 12 individuals contacted by email to participate as content experts for the PDEPQ-R. The sample of eight professionals was predominately female (87.5%) and White (87.5%), with a mean age of 44.9 years (SD = 12.2).

The second sample was composed of 10 parents recruited through two local community centres offering parenting services. Participants were included in the study if they were adults who reported being comfortable with the English language. Two focus groups (n = 5 each) were held, and this sample size of 10 is consistent with similar research employing focus groups as part of a content validity assessment (Barnes & Adamson-Macedo, 2007; Marsteller et al., 2017). Each parent focus group lasted about two hours. In each group, an overview of the study was presented, and then parents completed a consent form, a socio-demographic questionnaire, and a content validity assessment form. Subsequently, the parents took part in the focus group discussion. The focus group was not audiotaped to increase parents’ comfort level; however, field notes were taken. For both objectives (i.e., expert panel and focus group), approval was obtained from our university’s Office of Research Ethics and Integrity before study commencement. Consent for the expert panels was sought when the online questionnaire was administered, whereas physical copies of consent were provided to parents participating in the focus group.

The sample was composed exclusively of mothers whose average age was 35.2 (SD = 4.6) years. Most participants (75%) were married. One-third of the participants indicated that their ethnic/racial background was Arab/Middle Eastern (33.3%), followed by White (16.7%), Latin American (16.7%;), South Asian (16.7%), West Asian (8.3%), and Black (8.3%). Most mothers completed post-secondary education (66.7%) and were not employed outside the home (66.7%). Household income ranged from $20,000-$49,999 (40%) to $50,000-$89,999 (20%) and $90,000–109,999 (40%) in Canadian dollars.

Measures

PDEP questionnaire (PDEPQ)

The first column of Table 1 shows the version of the PDEPQ used as the starting point for the content validity evaluation. It is composed of four scales. Approval of Physical Punishment consists of 13 items on parental attitudes toward physical punishment. Each item is rated along a 6-point Likert scale from 1 (strongly disagree) to 6 (strongly agree), with higher scores indicating greater approval of physical punishment. Approval of Non-Physical Punishment consists of 10 items on parental attitudes toward non-physical punishment. Each item is rated along a 6-point Likert scale from 1 (strongly disagree) to 6 (strongly agree), with higher scores indicating greater approval of non-physical physical punishment. The Subjective Norms scale has ten items to assess parents’ negative attributions of typical child behaviours. Participants are asked to indicate their agreement with statements attributing typical parent-child conflict to intent on the part of the child. Parents respond to each item on a 6-point Likert scale from 1 (strongly disagree) to 6 (strongly agree), with higher scores indicating a stronger belief that typical child behaviour is attributable to intentional misbehaviour. Parenting Self-Efficacy consists of 9 items that measure the degree to which parents feel competent and confident in their ability to manage challenging parenting situations. Each item is rated along a 5-point Likert scale from 1 (never) to 5 (always), with higher scores indicating greater parenting self-efficacy.

Table 1 Changes to PDEP Items Based on the Content Experts’ and Parents’ Feedback

For the content validity assessment, the items of each of the four scales were rated on their representativeness (i.e., how reflective the item is of the underlying construct) using a 3-point scale from 1 (slightly representative) to 3 (highly representative) and their clarity (i.e., how clear the item is) using a 3-point scale ranging from 1 (low clarity) to 3 (high clarity). Participants were also asked to provide any suggestions they might have on how to improve the items.

Data Analysis

Descriptive analyses were employed to examine mean representativeness and mean clarity for each item using SPSS version 26. We also considered participants’ comments by way of focus group field notes that were examined using a note-based analysis, which is appropriate when the purpose of a study is narrowly defined (e.g., pilot-testing items; Krueger, 2014). The first author read all the notes to understand the main ideas communicated by the participants. Next, codes were generated to capture themes, and then similar participant comments were assigned to the same codes.

Results

Experts’ mean ratings of the representativeness of the Approval of Physical Punishment items were high (M = 2.8; SD = 0.3), whereas parents rated the items as moderately representative (M = 2.5; SD = 0.5). The mean clarity rating for professionals and parents was 2.7 (SD = 0.5), indicating high clarity for the items on this scale. The language in item APP6 was flagged as problematic because the term “bad words” was too specific. The term “drag him out” in item APP9 was considered inappropriate by several professionals and parents. Moreover, professionals and parents highlighted the term “dangerous” in item APP13 as lacking contextual information. See Table 1 for additional details on the flagged problematic items during this phase.

The mean representativeness of the Approval of Non-Physical Punishment items was high for professionals (M = 2.7; SD = 0.3) and parents (M = 2.6; SD = 0.4). Mean clarity ratings were also high among professionals (M = 2.6; SD = 0.5) and parents (M = 2.7; SD = 0.4). However, parents reported several concerns with item wording. They indicated that the term “extra chores” in item ANP1 was too general. Both professionals and parents also indicated that the term “punish” in item ANP2 was too general and could be misinterpreted as physical punishment. The term “hairstyle” in ANP3 was also noted as odd. Both professionals and parents indicated that item ANP4 was too general and could include children’s favourite activities (e.g., electronics, sleepovers) instead. Several parents noted Item ANP6, as sibling rivalry is not exclusive to one sex. Most parents (and several professionals) agreed that item ANP8 was too harsh and implied humiliation and should be changed to reflect more inductive forms of punishment. Parents indicated that this item could be reworded to match the definition of non-physical punishment (i.e., taking away privileges).

Subjective Norms items’ mean representativeness was high for professionals (M = 2.7; SD = 0.3) and parents (M = 2.6; SD = 0.6). Mean clarity ratings were also high among professionals (M = 2.7; SD = 0.5) and parents (M = 2.8; SD = 0.4). The term “are rude” in SN3 was substituted with an alternative phrase. One parent expressed difficulties understanding some terminology (e.g., defiant in SN1, tantrum in SN2). Most parents thought Item SN10 reflected a gender stereotype and suggested removing it.

The two samples rated Parenting Self-Efficacy items as moderately representative (M = 2.5; SD = 0.6). Parents’ mean representativeness was high at 2.8 (SD = 0.5). Mean clarity was moderate for both professionals (M = 2.5; SD = 0.2) and parents (M = 2.9; SD = 0.4). There were concerns by professionals that items SE6 and SE9 were not representative of the self-efficacy construct.

Conclusion

Overall, professionals and parents provided evidence that the items adequately represented and clearly stated the four PDEPQ scales. However, the two participant groups also made several suggestions for rewording. We reviewed the questionnaire items considering the feedback and made several changes, especially when the feedback was similar across both participant groups. The revised version of the PDEPQ is outlined in the second column of Table 1. It is also important to note that each PDEPQ scale was retained as a unidimensional measure of separate constructs. Therefore, scale scores are not meant to be summed or combined into a total score.

Study 2. Examining the Psychometric Properties of the Revised PDEP Questionnaire

This second study had two objectives. The first objective was to assess the revised PDEPQ (PDEPQ-R) psychometric properties using Item Response Theory (IRT) methodology. The second objective was to gather information on the convergent validity and reliability of the PDEPQ-R after incorporating changes from the IRT analyses.

Methods

Participants and Procedures

Data collection was conducted online. Parents were recruited across several community organisations in Canada through study links posted on agency websites and social media platforms. Inclusion criteria were: (1) having at least one child between birth and 18 years; (2) being at least 18 years of age; (3) having access to a computer; and 4) being able to read English. Parents who agreed to participate in the study were invited to enter a draw for one of four $40 gift cards. Online recruitment for the first part of the study ended after achieving the minimum required sample size necessary to carry out our analyses with sufficient statistical power (n = 400; Hulin et al., 1983). A subsample of parents (n = 100) was randomly selected and then contacted via email after four weeks of the initial assessment for the test–retest reliability check.

The sample of 400 parents was predominantly female (93.8%) and White (92.8%). The mean age was 35.2 years (SD = 4.6), and the majority (68.5%) were married. The number of children in the home ranged from 1–5, with a mean of 2.0 (SD = 0.9) and a mean age of 7.1 years (SD = 4.9). Most parents had obtained a post-secondary degree (81.6%), and (62.8%) were employed outside the home. Household income was distributed as follows, in Canadian dollars: Less than $20,000 (10.3%); $30,000-$49,999 (14.5%); $50,000–$89,999 (33.6%); $90,000–119,999 (21.3%); $120,000–150,000 (10.3%); and over $150,000 (10%).

The subsample of 100 parents who participated in the test-retest reliability portion of the study was also predominantly female (94%) and White (92.8%). Their mean age was 36.3 years (SD = 7.3), and the majority (67%) were married. The number of children in the home ranged from 1–5 (M = 1.9, SD = 0.8), and the mean child age was 8.1 years (SD = 6.7). Most parents had obtained a post-secondary degree (86.0%) and were employed outside the home (63%). Household income was distributed as follows, in Canadian dollars: Less than $29,999 (13%), $20,000-$49,999 (17%), $50,000-$89,999 (34%), and $90,000–119,999 (14%), $120,000-$150,000 (6%) and over $150,000 (16%).

Measures

Revised PDEP Questionnaire

The PDEPQ-R has four scales: Approval of Physical Punishment, Approval of Non-Physical Punishment, Subjective Norms, and Parenting Self-Efficacy. A total of 14 items were modified, and one was deleted (see Table 1).

Attitudes Toward Spanking Scale (ATS; Holden et al., 1995)

This scale consists of 10 statements (e.g., Sometimes a spank is the best way to get my child to listen) rated on a 7-point scale from 1 (strongly disagree) to 7 (strongly agree). The ATS measures parents’ attitudes toward physical punishment, with a higher score indicating a more positive attitude. The measure has demonstrated good internal consistency in a sample of college-educated mothers (α = 0.81–0.91; Holden et al., 1995) and the current sample of 400 parents (α = 0.87).

Parent Cognition Scale (PCS; Snarr et al., 2009)

The PCS consists of 30 items to assess how parents endorse child-responsible or parent-causal attributions for challenging child behaviour. Respondents are asked to think about their child’s challenging behaviour over the past two months and rate possible causes on a 6-point scale from 1 (always true) to 6 (never true). Items are reverse-scored so that higher scores indicate greater endorsement of the attributions (either child-responsible or parent-causal). There are nine child-responsible attributions (e.g., My child tries to get my goat or push my buttons), seven parent-causal attributions (e.g., I’m not patient), and 14 distractor items that are not used in the scoring (e.g., My child is in a stage). For this study, the Child-Responsible Attribution subscale was used to assess convergent validity because of its conceptual similarity with the Subjective Norms scale of the PDEPQ-R. The Child-Responsible subscale has been found to possess good internal consistency in a community sample of couples (α = 0.81; Snarr et al., 2009). In addition, the Child-Responsible subscale had excellent internal consistency in the current sample of 400 parents (α = 0.90).

Self-Efficacy for Parenting Tasks Scale—Discipline subscale (SEPTI; Coleman & Karraker, 2000)

The Discipline subscale of the SEPTI includes eight items (e.g., I have more trouble with discipline than any other aspect of parenting) measured along a 6-point scale ranging from 1 (strongly disagree) to 6 (strongly agree). It assesses perceived self-efficacy concerning disciplinary measures and has been found to possess high internal consistency in a sample of mothers (α = 0.86; Coleman & Karraker, 2000). In addition, we found good internal consistency for this subscale in the current sample of 400 parents (α = 0.86).

Data Analyses

IRT Analyses

Before the IRT analyses, the PDEPQ-R items were examined for the accuracy of data entry, missing values, and fit between their distributions and the assumptions of multivariate analysis. Samejina’s (1969) Graded Response Model (GRM) was used as the primary response function model for the IRT analyses. GRM is appropriate for polytomous data, such as Likert-type items with an underlying response continuum. IRTPRO 4.20 (Cai et al., 2011) was used to conduct these analyses with Maximum Likelihood (ML) as the default for the estimation method (Cai et al., 2011) for each of the four PDEPQ-R scales, unidimensionality and local dependence were examined. Unidimensionality was assessed by comparing the ratio of the first with the second eigenvalue in principal component analysis, with values greater than 2 indicating unidimensionality (Hambleton et al. 1991; Lord, 1980). Moreover, the variance explained by the first factor should be at least 20% of the total variance (Reckase, 1979). This test was conducted prior to the IRT analyses to ensure unidimensionality holds. Local dependence (i.e., dependence among items) was assessed using the LD- χ2statistic, with values exceeding 10 indicating a violation of local dependence (Chen & Thiessen, 1997).

Following an assessment of model dimensionality, the model-level fit was assessed through the goodness-of-fit statistics that provide information concerning discrepancies in responses from the respective parameters. Hence, the Root Mean Square Error of Approximation (RMSEA; Steiger & Lind, 1980) was used to assess the model’s fit with values lower than 0.06 for a close fit of the statistical model (Hu & Bentler, 1999). The Akaike information criterion (AIC; Akaike, 1974), Bayesian information criterion (BIC; Schwarz, 1978), and M2 (Maydeu-Olivares & Joe 2005; Maydeu-Olivares & Joe, 2006) were used to compare competing non-nested models. Lower values of these indices indicate a better fit.

Several parameters and statistics were also examined at the item level. Item retention was based on concurrent consideration of all the available indexes. Discrimination parameters are equivalent to factor loadings and represent the strength of the relationship between the item and the latent trait. Discrimination is as follows: 0.01–0.24 is ‘none’; 0.25–0.64 is ‘low’; 0.65–1.34 is ‘moderate’; 1.35–1.69 is ‘high’; and >1.7 is ‘perfect’ (Baker, 2001). Concerning the Item Information Curves (IIC), a visual inspection of the item information functions is performed. Typically, narrower and higher curves imply higher accuracy, whereas shorter and wider curves imply lower accuracy. Generally, very low-height IICs may indicate a poor fit of the item and, as such, could be used to make judgments regarding removing any specific item (De Ayala, 2013).

Reliability and Construct Validity Analyses

Internal consistency was assessed by way of Cronbach’s alpha coefficients. The following cut-offs were used: >0.9 is excellent, >0.8 is good, >0.7 is acceptable, >0.6 is questionable, >0.5 is poor, and <0.5 is unacceptable (George & Mallery, 2003). Test-retest reliability was assessed by determining the correlation between scores on the PDEPQ-R measured twice over four weeks. For the test-retest section of this study, only the PDEPQ-R questionnaire was administered. Pearson correlation coefficients were also used to assess convergent validity. Correlations were examined between both the Approval of Physical Punishment and Approval of Non-Physical Punishment scales from the PDEPQ-R and the Attitudes Toward Spanking scale (ATS; Holden et al., 1995); the PDEPQ-R Subjective Norms scale and the Parental Cognition scale (PCS; Snarr et al., 2009); and the PDEPQ-R Parenting Self-Efficacy scale and the Discipline subscale of the Self-Efficacy for Parenting Tasks Scale (Coleman & Karraker, 2000). The following cut-off values were used: r = 0–0.19 is a very weak correlation; r = 0.20–0.39 is a weak correlation; r = 0.40–0.59 is a moderate correlation; r = 0.60–0.79 is a strong correlation; and r = 0.80–1 is a very strong correlation (Mueller, 1997).

Results

Data Screening

Data were initially screened for coding errors by checking the range of values and data entry errors. There were no missing data or significant departures from multivariate normality, except for two items from the Approval of Physical Punishment scale, APP6 (Skew = 3.6; Kurtosis = 14.1) and APP10 (Skew = 6.3; Kurtosis = 50.7). These two items had minimal variability across response categories (87–94% of parents strongly disagreed with these items). For the Approval of Non-Physical Punishment scale, ANP7 (Skew = 3.4; Kurtosis = 14) had low variability across response categories (85% of parents strongly disagreed with this item). Attempts were made to render these items less skewed by using simple transformations (i.e., collapsing some low-frequency categories) and more complex methods (i.e., transformations). However, the programme did not provide a viable solution whenever these items were entered into the IRT model (i.e., no appropriate statistics were produced). Therefore, these three items were removed from further analyses.

Testing the Assumptions of Unidimensionality and Local Dependence

The results of the principal component analysis supported a primary dimension for the four PDEPQ scales. The first three eigenvalues for each scale were as follows: 6.05, 0.98, and 0.83 for Approval of Physical Punishment; 2.81, 1.32, and 0.91 for Approval of Non-Physical Punishment; 2.71, 1.28, and 0.91 for Subjective Norms; and 4.37, 0.69, and 0.64 for Parenting Self-Efficacy. Lord’s condition for essential unidimensionality was met by all scales, as demonstrated by the greater-than-two ratio of the first to second eigenvalue (Hambleton et al., 1991; Lord, 1980; Reckase, 1979). Specifically, the first to second eigenvalue ratio was 6.17 for Approval of Physical Punishment, 2.12 for Approval of Non-Physical Punishment, 3.42 for Subjective Norms, and 6.33 for Parenting Self-Efficacy. Moreover, the first factor of each scale accounted for more than 20% of the variance, supporting the unidimensional nature of all four scales (Reckase, 1979). The variance accounted for by the primary dimension of the four scales was 55.03% for Approval of Physical Punishment, 28.03% for Approval of Non-Physical Punishment, 30.05% for Subjective Norms, and 48.54% for Parenting Self-Efficacy (Fig. 1).

Fig. 1
figure 1

Principal Component Analysis Scree Plot

IRT Results for the Approval of Physical Punishment Scale

The revised version of the Approval of Physical Punishment scale (11 items) showed a good model fit, as indicated by the lower AIC, BIC, and RMSEA values (See Table 2). As shown in Table 3, most discrimination values for individual items were above 1.70, indicating a perfect fit of the items (Baker, 2001). The only exception was item APP9, which presented a moderate discrimination value (a = 1.04). The highest discrimination value was recorded for item APP2 (b = 5.46), indicating a strong relation of this item with the latent trait. The spread of category thresholds (−0.14 to 7.91) across different items seemed reasonably broad, implying a broad amplitude of coverage in measuring the latent trait. The IICs of APP2 and APP8 were noticeably higher than all the other items, suggesting high reliability for these two items (see Fig. 2). The lowest IIC was detected for item APP9, indicating high levels of measurement error for this item compared to other scale items. This item was deleted, and the model was rerun. The model fit improved as indicated by lower AIC, BIC and M2 values (see Table 2). Most of the LD-χ2 values for the items were acceptable (<10), except for item APP3, which shared unwanted variance with most of the scale items (range LD-χ2 from 12 to 29.3). Hence, item APP3 was removed, and the IRT model was rerun. The reduced model with nine items showed a slightly better fit than the model with 11 items and ten items (APP9 deleted), as indicated by lower goodness-of-fit statistics such as the AIC, BIC, M2, and RMSEA (see Table 2).

Table 2 Model Fit Indices for the PDEP Scales after Revisions
Table 3 Discrimination and Threshold Parameters for the PDEP Scales Items
Fig. 2
figure 2

Information Functions for Approval of Physical Punishment Items

IRT Results for the Approval of Non-Physical Punishment Scale

For the revised version of the Approval of Non-Physical Punishment scale (10 items), there was a good model fit, as demonstrated by low AIC, BIC, and M2 values and RMSEA values below the recommended cut-off (<0.06; See Table 2). As shown in Table 3, most discrimination values (i.e., a slope parameters) for individual items were moderate and ranged between 0.66 and 1.65 (Baker, 2001), implying moderate similarity among items in measuring the latent trait. Examining the statistics for each item revealed an ascending trend from response categories strongly disagree (threshold b1) to agree (threshold b5) strongly and covered a wide range of the latent trait (range from −0.33 to 4.72). The highest discrimination value was obtained for ANP10, indicating a strong relationship between this item and the latent construct. Moreover, the IIC of ANP10 was noticeably higher than all the other items, indicating the high reliability of this item.

Conversely, ANP9 received the lowest discrimination value, indicating a weak relationship with the latent trait compared to other items on the same scale. The lowest IIC was also detected for ANP9, which signifies a high level of measurement error for this item (see Fig. 3). Most LD-χ2 values for the items were acceptable (<10), except for ANP4, which shared unwanted variance with ANP3, ANP6, and ANP9 as indicated by LD-χ2 values above the cut-off (LD-χ2 range 24.5–11.00). To address the problem of local dependence between the items, ANP4 was eliminated from the model and rerun. Removing ANP4 and eliminating ANP9 (which had low discrimination and information) improved the model’s fit, as demonstrated by lower goodness-of-fit statistics that included the AIC, BIC, M2, and RMSEA (see Table 2).

Fig. 3
figure 3

Information Functions for Approval of Non-Physical Punishment Items

IRT Results for Subjective Norms Scale

The model fit statistics of the Subjective Norms scale (9 items) indicated a good fit, as demonstrated by lower AIC and BIC values and an RMSEA value below the recommended cut-off (<0.06; See Table 2). Discrimination values (i.e., the slope for individual items) ranged from 1.00 to 2.05, indicating a moderate to strong relationship to the latent trait (see Table 3). SN8 demonstrated a higher slope parameter, suggesting that this item has a stronger relationship with the latent trait and provides more information about different latent trait levels. All the LD-χ2 values for the items were acceptable (<10), excluding local dependence. The range of thresholds along the latent trait was broad for most items (overall range from −0.22 to 7.03) and in the correct order, implying that most items could measure a reasonable range of the latent trait. The IIC for SN8 was slightly higher than all the other items, indicating the high reliability of this item (see Fig. 4). However, all IIC values were approximately in the same range with no exceptionally shallow curves, suggesting moderate reliability for all the scale items.

Fig. 4
figure 4

Information Functions for Approval of Subjective Norms

IRT Results for the Parenting Self-Efficacy Scale

For the revised version of the Parenting Self-Efficacy scale (9 items), there was a good model fit, as demonstrated by low AIC and BIC values and an RMSEA value below the recommended cut-off (<0.06; See Table 2). Most discrimination values were above 1.70, indicating a perfect fit of the items. The only exception was item SE6, which presented a moderate discrimination value (a = 1.17; See Table 2), indicating a somewhat weaker relationship of this item with the latent construct compared to other scale items. The highest discrimination value was for SE9 and SE3, indicating a stronger relationship between these items and the latent trait. The range of thresholds along the latent trait was reasonably broad for most items (range from −7.42 to 9.22) and in the correct order, implying that most items could measure a broad range of the latent trait. LD-χ2 values for the items were acceptable (<10), supporting local independence. The IICs for both SE9 and SE3 were the highest and indicated high reliability for these two items (see Fig. 5). The lowest IIC was detected for SE6, indicating high measurement error. The removal of SE6 improved the model’s fit, as demonstrated by lower goodness-of-fit statistics for the AIC, BIC, M2, and RMSEA (see Table 1).

Fig. 5
figure 5

Information Functions for Self-Efficacy Items

Reliability and Internal Consistency Results

This second set of results relates to the classical reliability and validity analyses conducted on the PDEPQ-R after the confirmatory IRT analyses.

Internal Consistency

For the sample of 400 parents, the 9-item version of the Approval of Physical Punishment scale was found to have excellent internal consistency (α = 0.92). The 7-item version of the Approval of Non-Physical Punishment scale was found to have acceptable internal consistency (α = 0.74). The 9-items version of the Subjective Norms scale showed good internal consistency (α = 0.82). Finally, the 8-item version of the Parenting Self-Efficacy showed good internal consistency (α = 0.86). For the test-retest reliability assessment using the subsample of 100 parents, correlations for the PDEPQ-R scales were very strong from the first to second administration four weeks later and ranged from 0.81 to 0.94. The highest correlation was for the Approval of Physical Punishment (r = 0.94, p < 0.001) scale, followed by Approval of Non-Physical Punishment (r = 0.85, p < 0.001), Subjective Norms (r = 0.83, p < 0.001), and Parenting Self-Efficacy (r = 0.81, p < 0.001).

Convergent Validity

There was a robust and significant positive correlation between Attitudes Toward Spanking (ATS) and the PDEPQ-R Approval of Physical Punishment scale (r = 0.92; p < 0.001), indicating that greater parental approval of spanking was associated with greater approval of physical punishment more generally. Moreover, there was a moderate and significant positive correlation between the ATS and the Approval of Non-Physical Punishment scale (r = 0.47; p < 0.001), thereby indicating that parents who held more favourable attitudes toward spanking were more likely to approve of non-physical forms of punishment, such as non-physical punishment. In addition, there was a small positive correlation between the Child-Responsible Attributions subscale of the Parental Cognition Scale and the PDEPQ-R Subjective Norms scale (r = 0.14; p = 0.004), indicating that parents who tended to attribute challenging child behaviours to children’s characteristics also had more negative general beliefs regarding normative child behaviours. However, the association between these two variables was very weak. Finally, Discipline Self-Efficacy subscale scores were significantly and positively related to Parenting Self-Efficacy scores of the PDEPQ-R (r = 0.62; p < 0.001). This finding indicates that parents who reported greater effectiveness in implementing their parenting strategies (regardless of whether they are punitive) also reported greater effectiveness in engaging in non-punitive disciplinary strategies.

Discussion

This study evaluated and modified a questionnaire designed to assess constructs that characterise some of the critical determinants of parental use of physical and non-physical punishment. The four scales of the PDEPQ-R assess parental approval of physical punishment, parental approval of non-physical (emotional) punishment, subjective norms, and parenting self-efficacy. In Study 1, the content of the original version of the scales was examined by two samples, namely professionals in child development and parenting research as well as parents. The items were examined regarding representativeness and clarity, and the two samples offered suggestions about wording changes. Following this content validity assessment, we made changes when both participant groups highlighted the same suggestions.

Study 2 examined the psychometric characteristics of the newly revised questionnaire, PDEPQ-R, using a sample of 400 parents. A subsample of the parents was also re-assessed four weeks later to establish test-rest reliability. The validity of the constructs was examined by correlating the PDEPQ-R scales with other measures assessing similar constructs. The results provided evidence of good psychometric qualities for the PDEPQ-R. However, the analyses highlighted several problematic items that subsequently were removed. Following the removal of these items, the final version of the PDEPQ-R was confirmed. Using this final version of the PDEPQ-R questionnaire, a reliability (i.e., internal consistency and test-retest) and validity (i.e., convergent) assessment of the scales was conducted. Three of the four scales (the exception was the Approval of Non-Physical Punishment) had good internal consistency, indicating reasonably good agreement among items within the scales. Moreover, the test-retest results showed that all four PDEPQ-R scales were reliable in 100 parents over four weeks. Test–retest reliability coefficients ranged from good (.81 for Parenting Self-Efficacy) to excellent (.94 for Approval of Physical Punishment), providing strong evidence supporting the PDEPQ-R’s temporal consistency and invariance over time.

Furthermore, convergent validity was generally supported based on the sample of 400 parents. In line with the first hypothesis, the Attitudes Toward Spanking (ATS) score was significantly and positively associated with both the Approval of Physical Punishment and the Approval of Non-Physical Punishment scales from the PDEPQ-R. Therefore, the Approval of Physical Punishment scale taps into aspects of parenting closely related to facets of the ATS scale; indeed, many of the items on the two scales are similar in that they assess parental approval of spanking. This close alignment between the two measures presumably accounts for their high correlation. Interestingly, ATS scores were also positively related to Approval of Non-Physical Punishment, indicating that attitudes regarding spanking are linked with attitudes toward using non-physical punishment. This finding is novel in that it is the first to indicate a link between attitudes toward using “mild” forms of physical punishment, such as spanking, and approval of non-physical punishment. Parents with more favourable attitudes toward physical punishment appear more likely to hold favourable attitudes toward punishment strategies more generally (e.g., removal of activities, time-out). These findings align with research studies that have found small to moderate correlations between parental use of physical and non-physical punishment, as measured by the Corporal Punishment and Non-Violent Discipline subscales of the Conflict Tactic Parenting Scale (CTPS; Straus et al., 2003; Rodriguez, 2010). However, these subscales measure parents’ endorsement of behaviour rather than their attitudes. In this light, the PDEPQ-R contributes to existing research by providing additional evidence supporting these relationships. However, further research is needed to corroborate these findings.

In line with the second hypothesis, the Child-Responsible Attributions subscale of the Parental Cognition Scale (PCS) was positively and significantly related to the PDEPQ-R Subjective Norms scale. These findings indicate the link between parental beliefs about age-appropriate behaviours and attributions relative to the intentionality of the child (i.e., the child as the cause of challenging behaviour). Despite the significant correlation, the small magnitude of the coefficient indicates a weak alignment between the two constructs. However, it is essential to note that items on the Child-Responsible Attributions subscale tap into parents’ negative attributions about their children. In contrast, the Subjective Norms scale assesses parental beliefs about childhood behaviours at different ages and, as such, is not related exclusively to their child. This difference might explain the weak relationship between these two measures.

Finally, in line with the third hypothesis, the Discipline Self-Efficacy subscale was significantly and positively related to Parenting Self-Efficacy scale of the PDEPQ-R, indicating a strong alignment between these two constructs. It is important to note that, despite the relationship between the two scales in assessing one’s general sense of competence in the parenting role, the items of PDEPQ-R’s Parenting Self-Efficacy scale were more specific to the use of punishment and the ability to manage challenging parenting situations constructively (i.e., without anger or the use of punishment). In contrast, the Discipline Self-Efficacy scale items focus on self-efficacy in the global sense of discipline (e.g., setting rules and overseeing decisions around parenting) with no direct reference to using punishment or managing conflict with children.

Overall, the study results indicated that the PDEPQ-R possesses good psychometric properties and constitutes a promising instrument for assessing parenting practices. Although the PDEPQ-R was designed to assess the outcomes of the PDEP programme, it can be considered an evaluation tool for any parenting programme whose aim is to reduce child physical and non-physical punishment and promote positive disciplinary strategies. Moreover, PDEPQ-R’s focus on factors related to the use of child punishment (rather than a total score) makes it possible to examine in greater detail such critical factors as the approval of physical and non-physical punishment, subjective norms, and parenting self-efficacy. As such, the PDEPQ-R can be a helpful tool for advancing knowledge on the parental cognitions underlying child punishment strategies. Furthermore, the PDEPQ-R could potentially be used in clinical settings to provide health professionals with additional information on factors influencing parents’ use of various child disciplinary practices. This information can help guide clinical decisions around ways to tailor interventions to meet the specific needs of parents seeking support around child discipline.

Limitations and Future Directions

One study limitation included sample representativeness in that the sample of parents was almost exclusively composed of mothers. Only 6.2% of fathers participated in the IRT analyses, and no fathers were involved in the content validity of the PDEPQ-R based on focus groups. It is possible that the questionnaire’s psychometric properties would differ depending on the parent’s gender, so future research should attempt to validate (and perhaps adapt) these scales in father samples. Similarly, most parents in this study identified as White and highly educated, so additional questionnaire validity efforts with more diverse populations of parents are warranted. It should be noted, however, that an advantage of using IRT parameter estimates is that this statistical method is considered population-independent (i.e., it should provide relatively stable estimates across different populations). Second, potential weaknesses of the Subjective Norms scale indicate the need to further validate this construct to measure parental attribution to children’s behaviour.

Third, this study did not use attention checks and other manipulation techniques, which could have potentially negatively impacted scale validity. In addition, other forms of psychometric evidence remain to be tested (e.g., predictive, discriminant validity), so additional research in these domains needs to be considered. Fourth, the exclusive reliance on parent self-report could represent a validity threat due to shared method variance. Therefore, expanding the current results by including observational measures of attitudes and attributions would be important in subsequent validation research efforts of the PDEPQ-R. Finally, rewording the gendered pronouns in the PDEPQ-R to be more inclusive by using gender-neutral language would be important.