Introduction

Current developments in large language models (LLMs) have led to a surge of interest in the application of generative AI technologies in creative domains1,2,3,4,5,6. Recent research has largely skewed in favour of touting the significant opportunities associated with human–AI collaboration, where generative AI has been found to exhibit human-like levels of creativity7,8. Given this track record for exhibiting creativity in a selection of narrow domains, scholars have been quick to reason that generative AI can also augment and enhance human creativity9,10. However, research on human–AI creative collaboration is surprisingly limited. In this research, we therefore set out to improve our understanding of the factors that drive effective creative collaboration with AI.

First, we begin with the observation that generative AI has meaningful limitations in its production of creative work, which suggests that publicity surrounding its creative prowess may not yet map onto reality. For example, seasoned writers have been found to outperform (by a large margin) current LLMs (e.g., Claude, GPT-4) across a diverse range of creativity measures11. When judging a short story produced by GPT-4, an expert writer observed that “the events of this piece feel arbitrary, almost random” and “while that does grant it an unpredictability and a vague form of originality, it feels thoughtless”11, p. 20. In addition, AI generated creative writing has been found to be repetitive, following a homogenous form and writing style12,13, further contributing to its lack of originality and novelty14. This particular point becomes evident when comparatively evaluating large bodies of AI generated work, where similarities become obvious.

In addition, there are important omissions and limitations with the present literature on human-AI creative collaboration. A first limitation is a methodological one. Specifically, a great deal of empirical research that has sought to test the creative capabilities of LLMs have not used “expert” evaluators (e.g., domain experts evaluate creative outcomes). Instead, this body of work has made use of “lay” evaluations of creativity (laypeople evaluate creative outcomes) to substantiate claims of creative generative AI7,8,15,16. This is important because obtaining evaluations from experts is widely considered the “gold standard” when evaluating creative outcomes17,18. Furthermore, although several studies have compared AI generated creativity against a human benchmark8,16, a surprisingly limited amount of work has examined the creative output of human-AI dyads and the collaborative processes that undergird this. Hence, there is still much to be known about how AI systems should be designed to best support human creativity.

Finally, of the limited body of work that has researched the creativity of human–AI dyads (e.g.15), most lack the internal validity of controlled and robustly designed experiments. An illustrative example of a typical approach here is to provide participants with instructions on a set of prompts to provide GPT-4 and request that the creative output derivative of this process is inputted back into the experimental interface (e.g., Qualtrics). Such an approach, however, poses a clear threat to internal validity because the experimenter has little to no control over participant adherence to instructions19. Moreover, the evolving nature of publicly available LLMs presents further challenges because identical prompts at different points in time can produce very different results20 and changes to model specifications can lead to fluctuations in performance, even across short periods of time21. Thus, as many scholars seek to utilize widely accessible and publicly available LLMs to conduct research on the creativity of human–AI dyads, they also invite important threats to the validity of their findings.

We assume that the way people view their role in creative collaborations has an influence on the outcome of their endeavours22. This assumption therefore implies that how generative AI is utilized, as such affecting the role humans occupy in the creative process, will be highly consequential for creative outcomes. In the present research, we focus on the role of people serving either as a co-creator or an editor. The occupation of these two roles is expected to have a significant influence on people’s experience of creative self-efficacy in human-AI collaboration. When people are placed in the role of an editor, as is often the case by default in human-AI collaboration2,23, AI determines the initial direction of the creative process because it provides a default creative product that the editor must evaluate. This will likely pose a threat to the editor’s experience of self-efficacy because AI occupies the proactive role of generating a creative product and the editor occupies the reactive role of aiding with improving its work24,25. Moreover, placing people in the role of an editor can invoke an anchoring effect, where ‘editors’ are highly influenced by the default they are presented with and make insufficient edits as a consequence15. Such constraints, whether consciously or unconsciously processed, may also lead to insufficient editing because they undermine intrinsic motivation, an important antecedent to creative processes26.

Alternatively, when people are placed in the role of a co-creator, this is similar to a Cyborg archetype of human–AI collaboration, where people work in tandem with AI rather than divide the labour between the two27. When occupying this co-creator role, peoples’ belief in their ability to produce creative outcomes is nurtured because they set the creative direction, rather than simply react to AI-generated output28. In addition, opportunities for spontaneous improvisation are preserved when the creative product is developed iteratively because possibilities remain open29. Such facets of co-creation will likely nurture intrinsic motivation26, which is constructive for the development of self-efficacy30. We therefore argue that human–AI creativity will be best fostered when humans are placed in the role of a co-creator (vs. an editor) and this will be explained by the promotion of creative self-efficacy. Finally, as poetry embodies the most prototypical expression of creativity—spanning centuries of human history31,32—and enables us to test the effect of being a co-creator vs. editor on creative outcomes, this research focuses on poetry writing as the medium of creative expression.

We conducted two rigorously designed experiments in which we carefully designed and programmed state-of-the-art human–AI interfaces. Specifically, we find that people are less creative (as judged by professional poets) in a poetry writing task when they collaborate with a generative AI system and are placed in the role of an editor, relative to writing a poem on their own (Study 1). Interestingly, however, we further find that this creativity deficit disappears when people are instead placed in the role of a co-creator and this effect is explained by greater reported levels of creative self-efficacy (Study 2). Both experiments were approved for use with human subjects by an institutional review board. All conditions, measures and exclusions are reported; data are available at https://osf.io/3jqga/.

Results

Study 1

Participants were 101 individuals recruited via Prolific Academic in exchange for GBP2.50 (approximately USD3.00). Six participants failed to correctly answer an instrumental attention check and were removed from subsequent analyses (see below). Of the remaining 96 participants, on average, they were 27.23 years old (SD = 8.97) and 76% were female. Participants were randomly allocated to one of two experimental condition groups. In the first condition (human condition; n = 48), participants completed a poetry task unassisted. In the second experimental condition (human–AI condition; n = 48), participants completed a poetry task in collaboration with an artificially intelligent poetry generation system. A third experimental condition (AI condition) was added where the poetry task was completed entirely by the artificially intelligent poetry generation system (n = 50).

Participants were invited to take part in a study about creativity and were informed they would respond to some questions regarding their experiences of completing a creative task. The study was accessible via a URL link that was shared on our Prolific study advertisement. For participants in both conditions, the study begins with a welcome message, followed by a poetry task. Participants allocated to the human condition were provided with the following instructions:

The first half of our study will involve writing a poem.

For this study, you will be asked to produce a poem that is 8 lines in length, consisting of two stanzas that each have 4 lines (2 × 4 lines). A stanza in poetry is like a paragraph in prose writing.

The rest is up to you. It’s entirely up to you whether you want to make it rhyme or not.

There will be no time limit, please do take whatever time you need to write your poem.

Participants were then asked whether they understood the instructions for the task (yes/no) and given no time limit to write their poem. All participants in this condition selected ‘yes’. The instructions for participants in the human–AI condition were more elaborate. In addition to being told that they will be asked to write a poem 8 lines in length (two stanzas, four lines each), they were also told that they will first be provided with a poem produced by an algorithm—a set of instructions that are carried out in a series of computations/calculations, typically performed by a machine—and that they can edit the poem as freely as they wish before submitting it and finishing the task. More specifically, we instructed participants:

Once you are presented with the novel poem produced by the algorithm, you are free to edit and change the poem in whatever way you want. This part is entirely up to you. You can leave the poem as it is, make only a few minor adjustments, or delete the poem and start from scratch.

The algorithm used to generate novel poetry is a state of the art, scientifically validated, neural poetry generation system that has been shown to generate output comparable to human-written poetry33. The poetry system was trained exclusively on a corpus of standard, nonpoetic text—derived from the CommonCrawl corpus (https://commoncrawl.org/). After a series of filtering rules were applied to this vast and generic corpus of text (e.g. retain only sentences written in English that are 20 words or less), the resulting corpus from which the poetry generation system was trained on contains 500 million words in total, which is constructed from a vocabulary of 15 thousand unique words. The poetry generation system uses a recurrent neural encoder-decoder architecture in order to generate individual lines of poetry. In total, 2048 potential candidate lines are generated by the poetry system for consideration to be included in each line. The candidates generated are subject to rhyme and topical constraints. The rhyme constraint imposed by the system follows an ABAB rhyme structure (e.g. lines 1 and 3 rhyme, and lines 2 and 4 rhyme). Therefore, potential lines generated by the system for line 3 must rhyme with the line chosen for line 1. The topical constraint ensures that the poem generated by the system is topically coherent. For this, the system makes use of a latent topic model based on non-negative matrix factorization34. The topic model provides interpretable, topically coherent semantic dimensions, which are exemplified by the three most salient words. In short, these three topical words represent the ‘theme’ of the poem. Examples of combinations of three topical words are “sorrow, longing, admiration” and “goddess, ritual, shrine”. For this study, the system generates a poem based on one set of three topical words and this set is randomly selected from a list of 100 sets of three topical words. As the generation of potential lines is a sampling process, the sample of lines generated by the system are then subject to a global optimization framework to identify the line with the best match according to the constraints mentioned above. To rank order the sample of potential lines, each line is evaluated according to its compliance with the rhyme structure, its compliance with the topical constraint, the number of syllables it contains, and the log-probability score of two further mathematical criteria35. The line with the highest aggregated score across these dimensions is then selected for inclusion in the poem. This process is followed for all eight lines of each poem generated. Evaluations of the quality of poems produced by this poetry generation system revealed that raters considered the poems to be comparable to works by well-established English poets (W.H. Auden, E.E. Cummings, Philip Larkin, Sarojini Naidu, and Sylvia Plath), and superior to previous poetry systems (Hafez and Deep-speare) across dimensions of fluency, coherence, meaningfulness, and poeticism. This indicates the poetry generation system we have adopted for this research is state-of-the-art.

The interface for the assistive poetry generation system we incorporated in our study was modified and designed to include several features that can support the creative process of poetry writing and provide participants with a high degree of freedom. This is because the original poetry generation system was designed to automate poetry generation and not augment human poetry writing. Therefore, we developed an interface that builds upon this existing poetry generation system33 by implementing design changes with the goal to assist and facilitate human creativity (full details regarding the required back- and front-end development are provided in the Methods section). We described to participants in detail what each of these features are prior to the poetry task beginning. The first feature is the ability for participants to directly edit any line in the poem. The second feature is a dropdown function that is available for each line and contains alternative lines that participants can choose from if they prefer an alternative line to an existing line. A third feature provided participants with an updated list of alternative lines, if they directly edit and update one of the lines that they wish to browse alternative lines for. In this way, the dropdown list of alternative lines updates in real-time based on the newly edited line. Finally, as the poetry system generates poems that follow an ABAB rhyme scheme (e.g. lines 1 and 3 rhyme, and lines 2 and 4 rhyme), if a participant edits and updates lines 1 or 2, the updated alternative lines for lines 3 or 4 will rhyme with the newly edited line (see Fig. 1 for a visual illustration of the poetry interface). The goal of designing the interactive poetry system in this way was to provide participants with assistive, artificially intelligent input for the generation of their own poems. As in the human condition, participants were reminded that no time limit would be imposed and that they could spend however long they wished to write and submit their poem. All participants in this condition responded ‘yes’ when asked whether they understand the instructions of this poetry task (yes/no).

Figure 1
figure 1

Visual depiction of the poetry interface in the human-AI condition (Study 1).

Once participants submitted their finished poems, they subsequently provided their self-reported evaluation of the poem’s creativity.

To measure participant’s self-reported evaluations of the creativity of their poem, we instructed: “rate the extent to which you think the poem you submitted is creative” (1 = lowest creativity score; 40 = highest creativity score)”.

To evaluate the creativity of participant’s poems, we utilized the Consensual Assessment Technique (CAT), one of the most highly regarded assessment tools in creativity research17. In the CAT, evaluations of creativity are obtained from experts who are asked to evaluate creative products using their own expert sense of what is creative. We obtained evaluations from 10 poets in exchange for USD40 each. All poets we obtained evaluations from are professional writers and have had their poems published in either their own poetry books or as part of a poetry anthology that contains works from different poets. In addition, 6 of the poets possess a graduate degree (MA, MFA, PhD) in creative writing, English literature, or a related field. Poets were asked to evaluate the creativity of a collection of poems. This collection contained all the poems produced by participants in the human condition (n = 48), the human-AI condition (n = 48) and the poems generated in the AI condition (n = 50). Therefore, in sum, each poet evaluated the same set of 146 poems. Importantly, poets were not provided any information about the authors behind the poems or the circumstances under which they were written. We included poems generated autonomously by the poetry generation system to obtain a benchmark value for comparison with creativity ratings obtained in our two experimental conditions. In line with creativity research that utilizes the CAT35,36,37, poets were given the poems in different, randomly generated orders and asked to rate the creativity of each poem on a 1 (lowest creativity score) to 40 (highest creativity score) scale. They were also informed that their ratings of poems should be relative to one another, that they should use the full range for their scores (1–40), and that they would not have to justify or defend any of their ratings.

Independent samples t-tests reveal that participants in the human condition were significantly more likely to self-report their poems as more creative (M = 23.88, SD = 8.40), when compared to participants in the human–AI condition (M = 19.13, SD = 10.40, p = 0.016, Cohen’s D = 0.47). Next, we analysed how creative the poems were judged to be by expert evaluators (poets). A one-way ANOVA revealed that the creativity of the poems differed significantly across the three experimental groups, F(2, 145) = 63.48, p < 0.001, partial η2 = 0.47 (see Table 1). We then performed post hoc pairwise comparisons to compare expert evaluations of creativity across conditions. Poems in the human condition (M = 18.23, SD = 4.94) were regarded as more creative than poems in the human-AI condition (M = 12.55, SD = 4.32; p < 0.001) the AI condition (M = 9.45, SD = 2.00; p < 0.001), while poems in the human-AI condition were regarded as more creative than poems in the AI condition (p < 0.001).

Table 1 Variable means and standard deviations by condition (Study 1).

Study 2

Study 2 builds on the findings of Study 1 by testing whether a redesigned poetry generation system—one that fosters creative self-efficacy—would improve expert evaluations of creativity. We predicted that participants would exhibit greater creativity when placed in the role of a co-creator (co-creator human–AI condition), relative to being placed in the role of an editor (human–AI condition), and this effect would be explained by greater perceptions of creative self-efficacy.

Participants were 152 individuals recruited via Prolific Academic in exchange for GBP2.50 (approximately USD3.00). No participants failed to correctly answer an instrumental attention check. On average, participants were 35.11 years old (SD = 12.06) and 34.9% were female. Participants were randomly allocated to one of three experimental condition groups. In the first condition (human condition; n = 51), participants completed a poetry task unassisted and in the second experimental condition (human-AI condition; n = 50), participants completed a poetry task in collaboration with the same artificially intelligent poetry generation system used in Study 1. A third experimental condition (co-creator human-AI condition) was included where participants completed a poetry task in collaboration with a redesigned version of the poetry generation system (n = 51).

The study followed the same procedure as Study 1. However, we also examined the effects of participants collaborating with a new poetry generation system that was designed to support creative self-efficacy (co-creator human–AI condition. In the co-creator human–AI condition, participants collaborated with the same poetry generation system used in Study 133, however, the interface was redesigned so that participants and the poetry system would take turns writing one line of poetry each in a step-wise fashion. In other words, this collaborative and iterative process was akin to a conversation, where the participant begins the conversation by writing the first line of poetry, which is then followed by the AI system generating the next line, and so forth (see Fig. 2 for a visual depiction). In addition, to ensure that the system generates subsequent lines of poetry that align with the direction the participant wishes to take the poem, we had participants select a set of three topical words (e.g., “sorrow, longing, admiration”) from a list of 100 sets of three topical words. Thus, participants were also empowered to determine the “theme” of the poem by choosing their topical words before initiating the writing process. This differs from Study 1, where the poetry generation system produced a poem that was based on a randomly selected set of three topical words. Similar to Study 1, the poetry system in this condition followed an ABAB rhyme scheme, where the 4th (or 8th) line generated would rhyme with the 2nd (or 6th) line. It was also made clear to participants in this condition that they were free to edit all and any aspects of the poem at any stage. Thus, whereas the original poetry generation interface (human-AI condition) provided participants with a finished poem and a selection of advanced features to edit this poem, the redesigned interface (co-creator human–AI condition) empowered users to initiate the creative process by selecting their set of topical words and writing the first line of poetry. In turn, the system responded to each user-generated line by returning a subsequent line. Further building on Study 1, we also measured participant’s perceptions of creative self-efficacy as a potential mediating mechanism, in addition to their self-reported evaluations of the poem’s creativity.

Figure 2
figure 2

Visual depiction of the poetry interface in the co-creator human-AI condition (Study 2). The participant begins by writing the first line of the poem (a). After selecting the ‘update’ button on the right-hand side, the poetry generation system returns the second line of the poem (b). The option to select alternative lines is reflected in the feature ‘select candidate verse’ and participants can directly edit the line by clicking on the line. Once satisfied, the participant can write the third line of the poem in the entry below (c). This iterative process continues until the participant co-creates an 8-line poem consisting of 2 stanzas (2 × 4 lines).

To measure creative self-efficacy, we adopted a three-item (α = 0.92) creative self-efficacy scale developed by Tierney and Farmer28. The scale captured to what extent participants harboured the belief that they possessed the capacity to perform creative work effectively (e.g., “I was good at coming up with ideas for the poem”). The scale ranged from 1 (strongly disagree) to 7 (strongly agree).

To measure participant’s self-reported evaluations of the creativity of their poem, we instructed: “rate the extent to which you think the poem you submitted is creative” (1 = lowest creativity score; 40 = highest creativity score)”.

To evaluate the creativity of participant’s poems, we utilized the Consensual Assessment Technique (CAT), one of the most highly regarded assessment tools in creativity research17. As in Study 1, we obtained ratings from 10 (different) poets in exchange for USD40 each.

Creative self-efficacy

A one-way ANOVA revealed that participants’ creative self-efficacy differed significantly across the three experimental conditions F(2, 149) = 7.01, p = 0.001, partial η2 = 0.09 (see Fig. 3). We then performed post hoc pairwise comparisons to compare creative self-efficacy across conditions. Participants in the human condition (M = 4.71, SD = 1.45) reported greater creative self-efficacy than participants in the human–AI condition (M = 3.74, SD = 1.43; p = 0.001). Creative self-efficacy reported by participants in the co-creator human–AI condition (M = 4.62, SD = 1.43) did not differ significantly from participants in the human condition (p = 0.755) but was significantly greater than creative self-efficacy reported in the human–AI condition (p = 0.003).

Figure 3
figure 3

Creative self-efficacy reported across experimental conditions (Study 2).

Self-reported creativity

Next, a one-way ANOVA revealed that self-reported creativity differed significantly across the three experimental conditions F(2, 149) = 6.08, p = 0.003, partial η2 = 0.08. We then performed post hoc pairwise comparisons to compare self-reported creativity across conditions. Participants in the human condition reported greater self-reported creativity than participants in the human-AI condition (M = 23.29, SD = 9.24 vs. M = 16.96, SD = 9.56; p = 0.001). Self-reported creativity reported by participants in the co-creator human-AI condition (M = 21.16, SD = 9.03; p = 0.246) was also significantly higher than participants in the human-AI condition (p = 0.025), however, it did not differ significantly from participants in the human condition (p = 0.247).

Expert evaluations of creativity

A further one-way ANOVA revealed that expert evaluations of creativity differed significantly across the three experimental conditions F(2, 149) = 5.77, p = 0.004, partial η2 = 0.07 (see Fig. 4). We then performed post hoc pairwise comparisons to compare expert evaluations of creativity across conditions. Participants in the human condition received greater expert evaluations of creativity than participants in the human–AI condition (M = 15.74, SD = 9.24 vs. M = 12.53, SD = 4.45; p = 0.001). Expert evaluations of creativity received by participants in the co-creator human–AI condition (M = 14.70, SD = 5.06; p = 0.280) were significantly greater than participants in the human–AI condition (p = 0.026), however, these evaluations did not differ significantly from participants in the human condition (p = 0.278). See Table 2 for cell means and standard deviations.

Figure 4
figure 4

Expert evaluations of creativity across experimental conditions (Study 2).

Table 2 Variable means and standard deviations by condition (Study 2).

Mediation analyses

Next, we conducted mediation analyses by looking at whether creative self-efficacy mediated the link between being a co-creator (co-creator human–AI condition) versus being an editor (human–AI condition) in creative collaboration with AI and expert evaluations of creativity. To do this, we used Model 4 of the Process macro38 with 95% bias-corrected confidence intervals and bootstrapped samples set to 10,000. In this path model, creative self-efficacy was entered as a mediator and expert evaluations of creativity was entered as the dependent variable (Y). For the independent variable, we created a dummy-coded variable using indicator coding: the human–AI condition was given the value 0, and the co-creator human-AI condition was given the value 1. A significant indirect effect was observed from our dummy variable to expert evaluations of creativity via creative self-efficacy: indirect effect, B = 0.78, SE = 0.39, 95% CI [0.16, 1.68]. In addition, we also performed the same mediation analyses with a dummy-coded independent variable where co-creator human–AI condition was given the value 0, and the human condition was given the value 1. For this model, the indirect effect was not significant: indirect effect, B = 0.09, SE = 0.29, 95% CI [− 0.50, 0.78] (see Fig. 5).

Figure 5
figure 5

Mediation analyses of the effect of our experimental conditions on expert evaluations of creativity via creative self-efficacy (Study 2). *p < 0.05, **p < 0.01, ***p < 0.001.

Discussion

Taken together, our findings demonstrate the importance of how AI systems are designed and the role humans are intended to serve in the production of creative goods. We find that creative collaboration with AI is most effective when AI systems are designed to nurture end-users’ beliefs in their abilities to produce creative outcomes. More specifically, we find that this is accomplished through placing people in the role of a co-creator, rather than an editor. Our finding aligns fully with the overarching goal of effective human–computer interaction (HCI): to create intuitive and empowering interfaces that seamlessly integrate with human capabilities39,40.

Our results run counter to recent suggestions and beliefs espoused by technology leaders and scholars that generative AI can promote creativity by having AI generate outputs and people provide the final judgment on the quality of these outputs41,42,43. Or, put differently, when people are placed in the role of an “editor”. This is because preservation of autonomy is crucial for promoting peoples’ beliefs about their ability to produce creative products44,45. When occupying the role of an editor, the presentation of a default body of text for editing restricts autonomy because people are unconsciously influenced by default effects15,46. According to role theory47,48, occupying the role of an editor will also implicitly set expectations about the behaviours the role entails. Thus, the implicit assumption of a rigidly defined role will undermine self-efficacy as people will interpret their role as not requiring creative and explorative behaviours. In turn, this inhibits cognitive flexibility and divergent thinking49. Alternatively, assuming the role of a co-creator will give rise to expectations that one should fully express their ideas, experiment with different possibilities, and challenge themselves50,51. Hence, the roles given to people in creative collaboration with AI has a large influence on their self-efficacy and resulting creative contributions.

From a cognitive sciences vantage point, our research also highlights an interesting array of future research directions that relate to the cognitive processes underpinning creative collaboration with AI. For instance, when people occupy the role of a co-creator, this might activate broader associative networks and broaden cognitive focus52,53,54. Alternatively, occupying the role of an editor could activate analytical thinking and elicit a narrower cognitive focus55,56. The production of creative ideas has been found to be poorer in virtual teams, when compared to face-to-face teams, because it focuses people’s attention to a screen and narrows their cognitive focus57. Similarly, when people edit AI-generated content, their cognitive focus likely narrows in on the output generated. Future research could delve into the specific cognitive mechanisms activated during these roles, such as differences in attention (“does editing influence eye gaze?”), decision making style (“does co-creating promote spontaneous improvisation and risk taking?”) or affect (“does editing heighten cognitive load and induce stress?”). Finally, longitudinal studies could explore the long-term cognitive effects of consistent collaboration with AI on creative tasks, potentially identifying shifts in cognitive processes over time. For example, if persistent collaboration with AI habituates analytical thinking and a narrower cognitive focus, it would be worthwhile to examine whether this affects creative processes even in the absence of AI (deskilling effect).

Our findings have important implications for the way practitioners understand human-AI interaction in the context of creative collaboration. First, in creative industries such as advertising, design, and content creation58, implementing AI systems that emphasize co-creation can ensure that organizations tap into the strengths of both human capital and generative AI, though caution should be exercised to prevent deceptive practices59. This approach can empower employees and foster a more engaging and satisfying work environment60. Similarly, in education, these insights have implications for how AI can be leveraged to maintain and cultivate students' creative self-efficacy61. For instance, in creative writing classes, AI tools designed to place students in the role of co-creator could help them develop their writing skills by providing iterative feedback and suggestions, thereby enhancing their self-efficacy and creative work.

This research also has some limitations which at the same time provide exciting opportunities for future research. First, the artificial settings of the experiments may affect the generalizability of the results to real-world environments. The controlled experimental conditions do not fully capture the complexities of real-world creative processes where additional factors such as task type and organizational climate62 can influence creativity. Future research should thus aim to replicate these findings in more naturalistic settings, such as classrooms or professional work environments.

A further methodological limitation of this research is the long-term impact of co-creating (vs. editing) is not tested. Understanding how these roles influence creative self-efficacy and creative outcomes over extended time periods is important63, as initial improvements may diminish or change with prolonged use. Longitudinal studies are needed to examine the sustained effects of these roles.

Additionally, the expertise level of the human writer was not considered as a potential boundary condition in this research. The benefits of being a co-creator (vs. editor) may vary significantly between novice and expert writers. Experts writers could benefit less from co-creation because they are less likely to take AI-generated text at face value and insufficiently edit this text as a consequence64. Future research should investigate how varying levels of expertise affect our reported findings.

Finally, our research does not address whether using a more advanced generative AI system would unearth different results. Although we argue that co-creating (vs. editing) should elicit positive effects for creative self-efficacy and outcomes irrespective of the AI system in question, this assumption does warrant empirical validation. Future research should explore whether our findings hold in other systems, such as GPT-4o or Claude 3.5, to confirm whether the observed benefits of co-creation persist as AI technologies evolve.

Concluding remarks

As the production of creative work is increasingly shaped by generative AI, our conceptualization of what it means be creative has witnessed a notable evolution. We have examined human-AI creative collaboration through two methodologically rigorous experimental studies. Together, our findings suggest that for AI to successfully augment human creativity, it is a requirement that it promotes creative self-efficacy and places humans in the role of a co-creator, not an editor.

Methods

The present research involved no more than minimal risks, and all study participants were 18 years of age or older. All research studies received ethical approval from the Institutional Review Board (IRB) at the National University of Singapore (Ethical Approval Code: BIZ-MNO-20-0213), and all methods of the reported studies were performed in accordance with the ethical guidelines and regulations of this committee (https://www.nus.edu.sg/research/irb/resources/references). Informed consent was obtained for all participants. All manipulations and measures are reported. Data and syntax files are available on the Open Science Framework at https://osf.io/3jqga/.

In Studies 1 and 2, we recruited participants from the online sample recruiting platform Prolific (ProA; http://www.prolific.ac). To be eligible, participant’s first language needed to be English. ProA is an online platform explicitly designed for online participant recruitment by the scientific community65 and has been empirically demonstrated to provide higher quality data than alternative online platforms66. Moreover, in both studies, we utilized attention checks, an important method for enhancing data quality67, particularly when utilizing online platforms such as Prolific66. However, only in Study 1 did any participants fail this check (n = 6). Following prior research, we selected participants with an approval rating of at least 97%. Condition assignments were random in both studies, with randomization administered via front-end programming.

In both studies, integrating the poetry generation system with our web application (URL link) required both back- and front-end development. For our back-end framework, we chose Flask68—a small and lightweight Python web framework that provides useful tools and features that make creating web applications in Python more convenient. The programming language adopted for this is Python. To record and store data submitted by participants, Firebase was utilized. Firebase is a platform developed by Google that not only facilitates data storage, but also enables the development of mobile and web applications. We opted to use Google Cloud Platform (GCP) as our server and Gunicorn for deployment. Gunicorn is a widely used high-performance Python WSGI HTTP Server based on the UNIX system. For front-end development, we mainly used Jinja2 to implement our functions (e.g., login verification, submission of forms) and facilitate asynchronous transfers (data exchange between back- and front-end). Jinja2 is a full-featured template engine for Python. The interface we developed is built on Bulma—a free, open-source framework that provides ready-to-use front-end components that can be combined to build responsive web interfaces. More complicated components and functions required jQuery and AJAX. The programming languages we adopted for front-end development were HTML, CSS, and JavaScript. HTML is adopted to provide the basic structure of our web application, which in turn is enhanced and modified via the use of CSS and JavaScript. CSS is used to control presentation, formatting, and layout. JavaScript is used to control the execution of specialized functions (e.g., imposing time constraints on webpages).

Ethical approval

All studies research received ethical approval from the Institutional Review Board (IRB) at the National University of Singapore (Ethical Approval Code: BIZ-MNO-20-0213), and all methods of the reported studies were performed in accordance with the ethical guidelines and regulations of this committee (https://www.nus.edu.sg/research/irb/resources/references).