1 Objectives and structure of the paper

1.1 Premise

This paper presents the results of an online market research conducted in the spring and summer of 1999 by the Department of Computer and Management Sciences of Trento University, Italy. The study is part of a larger project whose principal aim is to identify the advantages and disadvantages of market research done online with respect to traditional methods and channels, and to look at its applicability in diverse product markets.Footnote 1 In methodological terms the objective of the research presented in this paper was to demonstrate the benefits of conducting online market studies for innovative products. Problems with such innovative products derive firstly from the fact that their characteristics cannot be thoroughly defined before conducting the research, and secondly their availability in commercial form usually requires further sizeable investments in research and trialling. Both of these issues are critical for CASE (computer-aided software engineering) tools, which use linguistic instruments to analyse documents in natural language, and are therefore based on technologies for natural language processing (NLP) developed in the field of artificial intelligence. Working from the perspective of a company attempting to decide which products to develop (from among different projects related to NLP-based applications), our objective was to evaluate the potential demand for NLP-based CASE tools. In conducting the study we made the reasonable assumption that the respondents (people involved in developing software systems) could be contacted easily by Internet; this prerequisite could not be guaranteed principally at a national level for other sectors studied previously (e.g. tourism or electronic commerce of groceries).Footnote 2 At the same time, a certain predisposition not to participate in the study was to be expected, whether because of time constraints (noted even during the initial explorative interviews) or because of an already high level of saturation. In fact, both of these assumptions were confirmed during the course of the research. Nonetheless, we emphasise that this paper focusses on the results of the actual content of the research, and hereinafter we describe only methodological aspects that are pertinent to the interpretation of the results obtained.Footnote 3

1.2 Objectives

As previously mentioned, the aim of the research was to analyse the potential demand for a CASE tool integrating linguistic instruments as a support to requirements analysis [2]. To give the context in which such a tool could be designed and used, the following paragraph first describes the role of natural language in requirements engineering and then classifies the possible applications of linguistic instruments, making reference to the architecture of an ideal NLP system and to the three fundamental activities of requirements analysis: elicitation, modelling, and validation [3]. Our market research refers principally to the support of conceptual modelling, an activity that to benefit from the use of linguistic instruments requires the design of a modelling module. The other activities could be supported by existing functionalities of an NLP system, with varying levels of performance.

It was found early in the study that none of the commercial CASE tools exploited linguistic instruments to support requirements modelling [4]; this meant, therefore, that the market research was to focus on a new product whose features could not be defined in relation to similar existing products (analysis of the competition). Numerous research projects do exist in this area, however, and serve as a testimony of the considerable interest in the use of linguistic instruments in requirements engineering [5, 6].Footnote 4 The common objective is to carry out a linguistic analysis of requirements documents in order to produce conceptual models of them.Footnote 5 Among the most recent projects, as an example, we can cite those described in [8, 9]. While a complete review is beyond the scope of this paper, it is worth noting how different approaches can be analysed by looking at two principal aspects (depending on the characteristics of the linguistic tools adopted):

  1. a.

    How “natural” the input language is, which is normally subject to restrictions regarding grammar, vocabulary, or both;

  2. b.

    How much intervention by an analyst is needed in order to process “semi-automatically” the text or to identify the key elements for conceptual modelling.

The survey described in this paper focusses on the first of these points, one that we deem of vital importance because whatever the approach adopted, the “naturalness” of the language directly affects the amount of effort needed to extract useful information from the documents. First, it was necessary to establish whether the documents gathered in the requirements elicitation phase were in ‘real’ natural language or in some type of restricted language, and if they were in natural language, whether the user or customer could be asked to describe the requirements using a more restricted language. In fact, if the documents are written in a ‘controlled’ language (restrictions on grammar or vocabulary), information can be extracted using syntactic or ‘shallow’ techniques, such as parse trees.Footnote 6 To obtain equivalent performances with documents in unrestricted natural language it is necessary to have a semantic representation of knowledge that embeds reasoning techniques. Such applications are currently being studied.Footnote 7 Moreover, the language used in the documents can be more or less linked to a particular application domain (for example, software for telecommunications), thus determining the degree of specialisation of the support linguistic tool to be used in the conceptual analysis, and therefore of its knowledge base. In other words, hypothesising that the basic NLP technologies are available, for a company that must decide whether or not to invest in the development of an NLP-based tool for requirements analysis, it is important to establish first if it is possible to design and realise a general-purpose tool to support software development for different application domains or if instead it is necessary to make further investments later to customise the tool for the different companies or customers it will eventually serve. These are all essential considerations in determining the investment necessary to convert a research prototype – like those developed in the existing research projects – into a commercial tool.

Results of preliminary interviews as well as the state of the art of existing prototypes led us to decide not to investigate the degree of analyst intervention requested nor the performance requested of the tool (point b: we limit ourselves on this point to giving some general findings that emerged while conducting the research). To do so would have required further investment in a more extensive market research; such study would be justifiable only with a positive outcome, certainly not guaranteed, relative to the issues related to point a). Moreover, to assess the potential market for an NLP-based tool for requirements analysis, we studied aspects related to the diffusion of methods and instruments of software engineering. In particular, we intended to verify whether requirements analysis is in fact considered critical in relation to other important activities in software development (testing, documentation, etc.).

1.3 Structure of the paper

The paper is organised as follows: the next section describes the context of an NLP-enabled CASE tool and summarises possible applications of linguistic tools for requirements engineering. This provides information on the design of the questionnaire and the eventual interpretation of the results. The third section outlines the plan of the market research, noting the different phases and focussing on the questionnaire and on the characteristics of the respondents. The main results of the online survey are presented in the fourth section, where they are analysed using a statistical technique referred to as correspondence analysis. The profiles obtained have revealed the existence of two market niches characterised by their diverse approaches to software development. Finally, some observations are given regarding the characteristics of the survey and the extendibility of the results. The conclusions summarise how the results of the survey can be used by those who develop software in general, and by those who design tools and environments for requirements analysis in particular.

2 The role of natural language in requirements engineering

Much has been written on the importance of requirements analysis. In order to show why environments and tools to support such analysis are less satisfactory than those available for the other phases of the software life cycle, we shall briefly review the distinctive features of requirements engineering, defined as:

...the systematic approach of developing requirements through an iterative cooperative process of analysing the problem, documenting the resulting observations in a variety of representation formats, and checking the accuracy of the understanding gained. [3, p. 13]

Thus evident is the central importance of communicationFootnote 8 and knowledge. Compared with other phases of software engineering, requirements analysis and conceptual modelling [15] present unique difficulties. Many of the activities involved are cognitive and require creativity as well as knowledge about information technologies and the application domain. Moreover, the recent advances brought about by business process re-engineering (BPR) and the inclusion of innovative components in information systems are broadening the scope of projects. As a consequence, the number of the actors, interactions, and languages involved have increased. Completing the picture are the needs of companies, which operate at ever higher levels of competitiveness and which demand increasingly flexible information systems.

In this context, the use of linguistic tools – more precisely of NLP systems – to support the development of software systems in general and requirements analysis in particular, may help the analyst to:

  • Concentrate on the problem rather than on the modelling;

  • Interact with other actors;

  • Take into account the various kinds of requirements (organisational, functional, etc.);

  • Achieve traceability as from the first documents produced;

  • Manage more efficiently the problem of the changing user requirementsFootnote 9.

As regards the possible applications of NLP systems to requirements engineering, it is worth noting that they are able to process both vocal and textual input, sometimes imposing restrictions such as limiting the vocabulary or the grammar.

NLP systems can be used to obtain, with different levels of performance, essentially three types of output:

  • Syntactic, semantic, or pragmatic analysis;

  • Text either in the same language or another one, natural or artificial;

  • Syntheses in the form of differently structured summaries or templates.

Figure 1 is a simplified scheme of an ideal general-purpose NLP system. It is important to remember that the systems for real applications are usually highly dependent on the task and on the domain.Footnote 10

Fig. 1.
figure 1

The architecture of a general-purpose NLP system

With reference to this scheme, linguistic tools of differing complexity and especially of differing maturity can be used:

  1. a.

    In the requirements elicitation phase:

    • To facilitate the digitising of requirements documents using speech recognition systems or NLP-based interrogation interfaces;

    • To reveal ambiguities and contradictions in documents describing user needs (see, for example, [12, 18, 19]);

    • To design questionnaires or interviews, by verifying the ambiguity of the questions;

    • For automatic analysis of replies to open-ended questions, interpreting and classifying their contents [20].

  2. b.

    To model requirements by extracting (directly from the text) the descriptions of the elements to be included in the conceptual models envisaged by the development method adopted, in particular UML (Unified Modelling LanguageFootnote 11) diagrams (see Fig. 2).

    Fig. 2.
    figure 2

    The models generation process

  3. c.

    To support requirements validation, by exploiting the generation functionality of NLP systems to produce descriptions in natural language based on the structures used to represent knowledge.

A complete vision requires noting that NLP tools can also be used for documentation, generating reports on the various stages of requirements collection and modelling; for traceability, allowing a link to be maintained between the texts used and the models produced; and for the translation of documents into various languages, something that becomes increasingly necessary in the design of international information systems.

The survey described in this paper concerns the second of these points, that is, the use of NLP techniques to support the development of conceptual models, given that it requires the design of a modelling module. All the other activities could be supported by existing functionalities of an ideal NLP system, albeit with different performances. The most important assumption is that the requirements documents, once analysed, can contribute to a “knowledge base” from which to extract elements deemed useful for modelling activities. There are two important aspects to note regarding projects for developing this type of instrument: (i) many of these projects are based on ad hoc NLP systems, and therefore do not appear to correspond to the requirements for scalability and robustness of real applications; and (ii) given the complexity of natural language, almost all of them expect that documents will be written in restricted language or that some revision of the text will have taken place before undergoing the automatic analysis. These two facts are worth remembering when interpreting the results of market research and when estimating potential investments in NLP technologies, and certainly when developing a CASE module to support requirements analysis.

3 Plan and realisation of the market research

The decision to investigate the market for an NLP-based tool for requirements analysis was made in the context of a joint research project with the Department of Computer Sciences of Durham University (UK) in which a prototype was developed of a CASE tool – called NL-OOPSFootnote 12 for requirements modelling according to the object-oriented approach [21, 22].

The market research described here was based on the administration of a questionnaire whose design required consideration of the experience gained throughout the development of NL-OOPS and of the methodology and techniques of online market research. Specifically, the research progressed in the following phases:

  • Preliminary survey

  • Identification of interview subjects

  • Designing and testing of the questionnaire

  • Selection of the contact method

  • Distribution of the questionnaire and reminders

  • Collection and analysis of the data

A description of each phase follows, with greater emphasis on the third phase (designing the questionnaire) and on the final stage (analysis of data).

Preliminary survey

The first step in the research project was to create a focus group composed of both companies that develop linguistic instruments as well as big and small businesses that develop software or offer services linked to the introduction of information technologies in the workplace. The goal of this phase was to collect information about the users’ needs that could be satisfied with an NLP-based CASE tool and to gather other information useful in designing the questionnaire. The researchers were immediately confronted with pessimistic views of tools which use NLP techniques to support requirements analysis. In particular, some focus group members expressed serious doubts that the language in the documents gathered for requirements analysis was sufficiently ‘natural’ to justify the adoption of a tool based on NLP techniques. Others questioned the technical feasibility of such tools, citing their own unsatisfactory experiences with other NLP applications such as translation programs.

Identification of interview subjects

In accordance with the objective of the study, the questionnaire was directed principally to persons involved in software development, and in addition to managers responsible for important decisions regarding the process of software development, including the decision to adopt methodologies and support instruments. From a statistical viewpoint, when dealing with a survey conducted via Internet, one of the main problems is to establish the degree to which the sample is representative of the target population, in this case the people or companies involved in software development. On the one hand, it is reasonable to assume that the intended respondents are reachable by Internet, while on the other hand the population has characteristics (number, size, geographic distribution, etc.) that are not documented. Given this and also considering the chosen methods of contact, the approach to the study is conceptually similar to a sequential sampling. Statistically, this would classify it as a descriptive study, and as such requires caution when extending the results outside of the survey sample.

Designing and testing of the questionnaire

Again considering the objectives of the study, in terms of both methodology and content, the survey was conducted only via Internet and it consisted of a questionnaire on a web page (see the Appendix).Footnote 13 This choice was the driving force during the design and testing stage, the aim being to have a concise questionnaire with close-ended questions in language as clear as possible.Footnote 14 As for the questions themselves, the choices were made as logical and pertinent issues emerged throughout the course of the focus group. After a phase of testing in which the questionnaire underwent the scrutiny – first directly and then online – of a select group of analysts and project managers, the final version was produced. The final questionnaire was divided into two sections, for a total of eighteen questions, and a final open question for further observations. The first group consisted of questions relating to the company (questions 1–4) and to the respondent (questions 5 and 6). The second part investigated processes of software production, so that one group of questions concerned the use of methodologies (questions 7–10) and tools (questions 13 and 14) in software development; another group dealt with documents used in requirements analysis (questions 11, 12, and 15) and the last three were about the efficiency of the development process (questions 16, 17, and 18). The respondents were also asked if they were interested in obtaining the results of the research or in viewing a demonstration of a prototype of an NLP-based CASE tool. The decision to introduce questions associated with an engineering approach to software development was made after verifying the possibility of using existing data. SurprisinglyFootnote 15, only a small amount of data was found, whether for the diffusion of object-oriented methodology or for the use of ‘classic’ models such as the entity-relationships models. These are important because the early research and conceptual models for linguistic analysis of requirements [7] looked to produce entity-relationships diagrams; moreover, these models can be seen as a particular case of the class models foreseen by the object-oriented approach. As regards the market for CASE toolsFootnote 16, in many cases they did not meet expectations and as a consequence did not have the desired market success [25]. We will have to wait for the adoption of the UML – developed about one year before the present research project began – as a standard for conceptual modelling by the OMG (Object Management Group); only then will there be a significant growth in the market for CASE tools, repackaged and renamed as object modelling tools or visual modelling tools. In short, the scarcity of data on the penetration and role of an engineering approach to software development influenced the choice of questions for the survey, but also, as we shall see, the ability to validate and extend the results.

The questions considered most important to verifying the existence of a market niche for an NLP-based CASE tool are those related to the documents used to collect requirements. In fact, as we have already seen, if documents are in real natural language, an even more sophisticated (and costly) technology is needed to develop an environment that effectively supports analysis using linguistic instruments. It is therefore useful to establish whether the company is in a position to require clients or analysts to describe requirements in a restricted language. Typical restrictions can include: (i) grammar – aiming to have syntactic constructions that are easier to analyse by requiring, for example, shorter phrases, using the active voice, by avoiding anaphorical references, etc.; and (ii) vocabulary – aiming to reduce ambiguity of terms. Moreover, in order to determine the degree of customisation required of a possible NLP-based tool, further questions dealt with the level of specialisation of the terminology and the domain knowledge required to develop the software.

In the questions related to the efficiency of production processes, respondents were asked in particular about the improvements that they would like to see (choosing from a list of eight possible activities considered critical, two of which are fundamental for the phase of requirements analysis) and how they could be achieved, the choice being between ‘internal delegation’, ‘outsourcing’, and ‘automation’. The final question was designed to ascertain whether the company was able to deliver the software systems or products without delays. Finally, in keeping with the general rule of market research, an incentive to participate was provided in the form of a random draw among respondents for tickets to an opera performance at the Arena in Verona.Footnote 17

Selection of the contact method

The objectives of the research and the characteristics of the tool inherently required a contact method that would permit efficient use of time and resources while at the same time reach the largest number of potential respondents. On this point, to take into account the fact that there is a high level of saturation – due to the large number of such survey requests that the respondents receive – we had initially thought to send the questionnaire to some specialised newsgroupsFootnote 18, highlighting the academic nature of the research. In the first phase we identified three newsgroups whose work is related to the research topic (comp.object, comp.software-eng, alt.comp.software-tools); another 21 newsgroups were later added to the list (the complete list is available at http://online.cs.unitn.it/). Nonetheless, after this method of contact proved less successful than expectedFootnote 19, we decided to contact the companies directly by email, supplying them with the address of the Web page where they could find and complete the questionnaire. The companies’ addresses were acquired online using search engines, in particular a directory of Yahoo! (http://www.yahoo.com – Computer > Software > Developers).

Distributing the questionnaire and reminders

As described above, the questionnaire was administered in two different ways. In a first phase it was publicised on a number of newsgroups devoted to software development (resulting in 44 completed questionnaires and 39 software companies) and in the second, requests to take part in the survey were sent by e-mail to 1541 addresses corresponding to 1234 software companies. By means of this second method, 107 completed questionnaires corresponding to 103 companies, were obtained. To get these results, it was necessary in many cases to send a message reminding the receiver to participate in the study, yet at the same time allowing him or her to explain the decision not to complete the questionnaire. Reasons given for not completing the questionnaire frequently referred to a lack of time and the large number of requests of this kind received (the email messages sent are accessible online at http://on-line.cs.unitn.it/). In addition, several addresses were incorrect, although the percentage was rather low (7.6%, 6.1% if calculated by number of companies).Footnote 20 Consequently, the number of valid contacts was 1424, corresponding to 1159 companies.

Collection and analysis of the data

A total of 151 questionnaires were returned, 91% within five days of sending the initial request or the questionnaire itself. The response rate calculated for the questionnaires sent via email was around 8%. This can be regarded as a satisfactory result when compared with traditional surveys conducted by post or fax, and with other surveys of software development, for which the response rate has been 3% [25].Footnote 21 In strictly statistical terms, the group of companies contacted – while constituting in itself a large number – cannot be taken as a representative sample of the population of software development companies. Given this, it is important that the results be interpreted in a descriptive mode, thus requiring caution in extending them. We shall see, however, that for some questions the quality of the survey results can be evaluated by comparing them with those obtained from other surveys and with data relative to the CASE market. The results of these comparisons are provided at the end of the next paragraph.

On a methodological level, the use of newsgroups confirmed that little effort was required to ask respondents to participate, but the low number of questionnaires completed may nullify this advantage. Furthermore, the use of newsgroups should be evaluated on the basis of the following factors: level of specialisationFootnote 22, number of messages, and presence of a moderator. In light of the results of our survey, in the case of very specialised newsgroups, even if the contents of the survey are relevant to them, in order to increase the response rate it is advisable to ask for the moderator’s consent, or to identify one or more newsgroup leaders who can legitimate the survey with their participation.

The initial analysis noted the geographic distribution of the respondents, most of whom are residents of European states or of North America (see Fig. 3). This first result of the research is supported by the analysis of similarities among different geographic distributions (using appropriate indices) showing, in fact, that these markets have similar characteristics. Given this, we present here results of the survey in its entirety, highlighting only those aspects where the geographic area of residence influenced the responses.

Fig. 3.
figure 3

The respondents by geographical area of residence

Eighty-six percent of the respondents fill roles relating to software development projects, 68% having occupied the role for more than six years.Footnote 23 Moreover, as to be expected, length of service influenced the position occupied in the company, so that programming work was more frequently performed by persons employed for the shortest periods, while those who had worked in their companies for 6–10 years were almost uniformly distributed among roles. To be noted is that the majority of European respondents selected ‘System Engineer/Architect’ but their American counterparts selected ‘Project Manager’, which may have been because different terms are used to denote the same role in the two areas. Some 29% of the respondents worked in companies with more than 100 employees, although small-sized companies were also well represented (Table 1).

Table 1. Company size

The core business of the companies surveyed in 77% of the cases is ‘Software’ and in 23% is ‘Websites’ or ‘Other’. As expected, the highest percentage of companies engaged in other types of business (or rather, also in other types of business) consisted of larger-sized ones. As regards the type of software produced, 42% of the companies developed software for niche markets (Fig. 4), with a high of 48% for North America. This may be due to the presence of a larger number of small-sized companies, given that 59% of companies with five or fewer employees, and 24% of those with more than 100, operated in niche markets. Software products were mostly sold to the end user: 84%;Footnote 24 only 13% sold to another software company, and 3% to software shops. Interestingly, given the nature of this type of product, all the companies that developed websites sold their products directly to the end users.

Fig. 4.
figure 4

Type of software

The next section provides a detailed analysis of the results of research into the existence of a potential market for an innovative tool to support conceptual analysis – a tool that has the capability to analyse documents written in varying levels of natural language.

4 The results of the survey and the potential demand for an NLP-based tool to support requirements analysis

We can identify three groups of elements that are useful in evaluating potential demandFootnote 25 for a CASE tool to support requirements analysis for documents written in natural language. They can be described as follows, taking into account their interrelatedness:

The market for instruments supporting software development and requirements modelling

How extensive is the market? How much competition is there? Do software developers use CASE tools? If so, which ones? (Normally the use of a CASE tool presupposes the adoption of a development methodology.) This last point was important both for establishing which conceptual models the tool should support (an aspect that became less important with the diffusion of UMLFootnote 26), and for reasons of compatibility and integration with existing tools.Footnote 27 Some information on this point could be obtained by means of the data on sales of CASE tools, but one question on this topic was inserted regarding the tools supporting requirements analysis and top-level design.

Features of the tool

The requirements principally influencing the investments necessary to develop a tool for requirements analysis based on linguistic instruments are (a) the language found in the documents gathered in the elicitation of requirements phase, crucial in identifying appropriate techniques and linguistic instruments, and (b) the degree of specialised domain knowledge required of the tool, which determines the degree of specialisation required of the producer of the CASE tool (generality). Also, given the state of the art of linguistic instruments, an important consideration is the performance required of the tool; in other words, how ‘good’ does it have to be to merit purchase?Footnote 28

Requirements analysis viewed as crucial

This is a vital element in identifying potential market niches and in ascertaining the tendency of users to invest in a tool that supports requirements analysis, as well as their willingness and ability to accept the changes that accompany the adoption of a new tool. Companies that have an engineering approach to software development have highly standardised processes and should therefore consider the activities lacking structure or support as crucial points demanding attention. A company employing a more informal or ‘craft’ process would not necessarily share this concern but would, however, be more interested in the use of natural language.

To glean the most useful information on these three points, we analysed the completed questionnaires in two phases. In the first phase we looked at individual answers, studying reciprocal relationships and dependencies. In the second phase we applied correspondence analysis [28], aiming to unveil the existence of profiles corresponding to potential market niches for an innovative CASE tool.

4.1 The market for instruments supporting software development and requirements modelling

As for the use of a tool supporting requirements analysis and top-level design, only 30% replied positively. As was expected, greater use was made of these tools in large-sized companies, reaching 51% in those with more than 100 employees, as is shown in the table of conditional distributions (Table 2). Not surprisingly, the use of these tools increases with length of service (rising from 17% to 36%) with analysts as the category of employee using them most frequently.

Table 2. Use of tools for requirements analysis and top-level design by company size

Moreover, 84% of the respondents stated that they used specific methodologies for software development. Size was a determining characteristic here: 78% of companies with five or fewer employees use specific methodologies and 93% for those with more than 100. The type of software or the sales channel does not significantly influence the use of methodologies, although role and experience seem to do so to some extent.

The best-known diagrams for data modelling, entity-relationship (E-R) diagrams, were used by 63% of respondents who adopted a methodology. Moreover, smaller company size corresponded to their more infrequent use (52% in companies with fewer than five employees, 73% in those with more than 100). The use of E-R diagrams was substantially greater among respondents who had worked longer in the computer business (increasing from 35% among those who had worked in the field for less than three years to 66% among those who had done so for more than ten). Finally, as regards the type of software, E-R diagrams were used to very different extents by respondents who developed general-purpose software (93%) and by those who developed network software (25%), while there were no substantial differences as far as the other items are concerned.

The percentage of respondents who used an object-oriented (OO) method was 68%, a percentage similar to that of E-R diagram users. The classification by company size shows a difference between companies with five or fewer employees (60% of which used OO methods) and those with more than 100 (74% of which do so). There are no significant variations with respect to years of experience, while there is a closer association with the position occupied within the company: the percentages ranged from 45% for programmers to 78% for system engineers/architects. An interesting comparison can be made in Table 3, where one notes that those who adopt OO methods were already accustomed to using E-R diagrams, thus indicating that they seemed more inclined to use an OO approach.

Table 3. Entity-Rrelationship diagrams and Oobject-Ooriented Mmethods

As far as the most widely used OO method, 77% of respondents who replied in the affirmative to the previous question declared that they use UML. This is a result which confirms the affirmation of UML as the industrial standard for OO modelling. It is worth mentioning that the survey was carried out approximately one and a half years after the adoption of UML by the OMG.

It also emerged that the great majority of the respondents who said that they did not use methodologies did not use tools for requirements analysis and top-level design either (90%): indeed, there is an association between the use of methodologies and CASE tools. Another finding to be emphasised is the connection between the use of CASE tools for requirements analysis or top-level design and the type of language employed in documents. Not unexpectedly, these tools were used more frequently when the language was more formal (24% with ‘Common natural language’ and 63% with ‘Formalised language’). Even if these results should be treated with caution, given the low number of companies surveyed, they seemingly confirm the inability of currently available CASE tools to meet the needs of natural language processing by yielding environments that are effectively useful. As far as the tools used are concerned, 52% of respondents who replied in the affirmative to the previous question declared that they used Rational Rose.Footnote 29 Rational Rose was the tool with the highest market share both worldwide and in Europe.Footnote 30 In 1998 it accounted for 33% of the market, with an increase of 79% on the previous year.Footnote 31 For this reason, the percentage found by our survey (52% for the year 1999) appears to be as one would expect.Footnote 32

4.2 Features of the tool

As noted, the type of language used in requirements documents determines the complexity of the linguistic instruments and of the NLP techniques to be used. When documents are written in a constrained language (a subset of natural language) – which imposes restrictions on the grammar, vocabulary, or both – simpler and more mature linguistic tools can be used. However, it is not usually possible to impose restrictions on the language employed. Firstly, because it is necessary to adopt a customer-oriented approach in the development of software applications. Secondly, because it is necessary to reduce the risk that the restrictions imposed on the language and the formalisms adopted will force the user, or even the analyst, to express what the models permit to be represented, rather than the real requirements of the system. The survey shows that, in both Europe and North America, requirements documents are furnished directly by the customer and integrated with interviews in around two-thirds of projects. The main difference between the two regions considered was the percentage of companies that conducted interviews with customers: 73% in North America and 58% in Europe, without significant differences in behaviour between small- and large-sized companies.

With regard to the level of the terminology in requirements documents, one finds that 79% of the latter are couched in natural language (Fig. 5). For the correspondence analysis, the final two modalities (structured and formalised language) have been merged.

Fig. 5.
figure 5

Level of terminology in the requirements documents

An analysis of the interdependence of the use of natural language with the other factors examined did not show any significant association with type of company, nor with the adoption of a methodology.

Another important aspect concerning both the potential demand for an NLP-based CASE tool in particular and software development in general is the domain knowledge required for an adequate understanding of the problem so that the user’s requirements can be defined. In fact, in the presence of high levels of specialist knowledge, the tool must be adapted to the needs of every customer if it is to operate efficiently in different corporate settings. By contrast, a very low level permits the development of a single standard tool able to operate in different fields of application. In this regard, it was found that respondents required an average (54%) to high (34%) level of domain knowledge. It also emerged that the higher the level of domain knowledge required to develop the software, the greater the use of methodologies (9% for low levels, 53% for average ones, and 38% for high ones) and of tools for requirements analysis and top-level design (2%, 56%, and 42%, respectively).

4.3 Requirements analysis viewed as crucial

As regards the efficiency of production processes, upon conclusion of the market study it was important to determine which software activities were viewed as crucial, as well as their weight relative to requirements (question 16).

In interpreting the answers to this question, it is worth noting that two selections were requested, thus having results above 100 percent. Fig. 6 shows that ‘Identify user requirements’ and ‘Model user requirements’ were cited as priorities by a high percentage of respondents.Footnote 33 Unlike in the case of ‘Identify user requirements’ – which was largely independent of the language used to model requirements (46% for ‘Common natural language’, 37% for ‘Structured natural language’, and 50% for ‘Formalised language’) and for ‘Testing the software’ (35%, 32%, 38%, respectively) – for ‘Model user requirements’ the percentages were 38% for ‘Common natural language’ and 13% for ‘Formalised language’, in accordance with expectations. Another noteworthy finding is that testing was viewed as crucial by higher percentages (ranging from 19% to 46%) of the respondents who used no tools at all. A similar pattern is displayed by the level of domain knowledge necessary, where at low levels of knowledge, testing was perceived as more important than all the other activities (63%, compared to 32% and 30% for medium to high levels of knowledge). Also of interest is the fact that ‘Learn to use a new tool’ was selected by a higher percentage of respondents declaring that they did not use a tool for requirements analysis than by those who instead said that they used a tool of this kind.Footnote 34

Fig. 6.
figure 6

Activities perceived as crucial in software development

The importance of this question requires a comparison of the results for Europe and North America (see Fig. 7). Also the correspondence analysis – reported in the second part of this section – was done taking into account the centrality of this question with respect to the objectives of the market research, in which the activities considered most critical become determinative when identifying profiles.

Fig. 7.
figure 7

Activities perceived as crucial in software development (Europe vs North America)

To the question ‘What would be the most useful thing to improve general day-to-day efficiency?’, the majority (64%) chose the option ‘Automation’, while ‘Outsourcing’ was selected by 7% and ‘Internal delegation’ by 29%. Contrary to expectations, no particular differences emerged among the replies to this question with respect to company size, where the only significant difference concerned companies with 6 to 20 employees, where the percentage selecting ‘Internal delegation’ was nearly double that for other company groups, a difference that may be due to organisational shortcomings. Interestingly, the percentage of respondents who used a methodology or a requirements analysis tool and believed it less important to increase the level of internal delegation was above the average of the entire sample. Instead, there were no differences regarding the documents available for requirements analysis.

Joint analysis of the two questions on the efficiency of software production processes shows that a larger percentage of respondents who believed it important to increase the level of automation had previously selected ‘Learn to use a new tool’ and ‘Model user requirements’ (Table 4).

Table 4. Efficiency of software development processes

For the final question, regarding the average delay in delivery of the software, the best performances were achieved by companies with 6–20 employees (29% of which delivered with less than one week of delay and 59% with less than one month) and by those who sold directly to the end consumer (probably for contractual reasons). Though not to a statistically significant extent, companies using formalised language delivered with the least delay, although there were no substantial differences as regards delays of more than one month (26% for common natural language, 33% for structured natural language, 25% for formalised language). A fair interpretation of these results requires one to remember that the answers do not factor in the length of the projects. Nonetheless, assuming that an average delay of less than one week corresponds to companies which on average deliver the software within the designated time, similar findings are reported in [32], where more than 80% of the respondents stated that their projects were sometimes or usually late.

Considering the purpose of this study, and particularly the question of whether there is a market for an NLP-based CASE tool for requirements analysis, the results presented thus far confirm the perception of requirements analysis as crucial for the development of systems, the widespread use of the object-oriented approach and of UML, and the important role of natural language. Specifically:

  • More than 80% of the companies adopt a methodology to develop their software, and nearly 68% of them adopt an object-oriented method (UML or one of the methods merged into UML).

  • The majority of the documents available for requirements analysis are in natural language and are either furnished by the customer or obtained by means of interviews.

  • The domain knowledge required is medium to high.

  • Tools supporting requirements analysis and top-level design are used in less than one-third of cases.

  • However, identifying and modelling requirements are perceived as being at least as important as testing the software.

  • A higher level of automation is indicated by around 64% of the respondents as the most useful means to improve day-to-day efficiency.

All of these elements work together to confirm the existence of a potential demand for a CASE tool based on NLP. To justify this claim, we undertook a correspondence analysis (CA) study. This meant using a statistical technique suited for the study of relationships between modalities with two or more distinguishable variables, usually qualitative. The main steps of correspondence analysis are concisely described as follows:

  1. 1.

    Define a cloud of points (rows and columns of a contingency table) in a multidimensional vector space.

  2. 2.

    Choose the metric structure on this space.

  3. 3.

    Produce the fit of the cloud in step 1 to a variable low-dimensional subspace onto which the points (row and column profiles) are projected for display.

  4. 4.

    Give an interpretation of the clusters of points corresponding to the projections of the rows and columns of the original contingency table; analyse their absolute contributions as guides to the interpretation of the underlying dimensions and their relative contributions (the so-called squared correlations) to indicate how well the points are described along the considered dimension.

The geometry of CA is very similar to Karl Pearson’s [33] geometric description of principal components analysis. The closeness of the points to a line, plane, or in general to a low-dimensional subspace is defined as the sum of squared distances from the points to the subspace. In general, it is important to avoid the direct comparison of the distances among the projections of row and column profiles because they belong to different low dimensional subspaces and the raw interpretation of their distances may produce misleading conclusions.

Here we have considered a CA involving one of the items of the questionnaire (‘What should be done more efficiently’) as a dependent variable and some other collected variables (number of employees, core business, kind of software produced, use of any methodology, starting documentation, level of terminology, use of any tool, knowledge of domain, thing to improve the day-to-day efficiency, average delay in delivering the software) as independent variables in order to verify whether and how much the answer to this item is influenced by the modalities of the other variables and to identify some relevant aggregations of modalities which can reveal the potential market demand for a CASE tool based on NLP.

We present here the result of the application of the CA based on the responses to the question regarding which activities are considered most critical (see Fig. 8).Footnote 35

Fig. 8.
figure 8

Output of the correspondence analysis [Two points (‘Other’ in question 16 regarding critical activities, and ‘Outsourcing’ for question 17) have not been represented because of their great distance from the centre (low frequency), thereby making the graph more comprehensible.]

An initial interpretation of the graph can be reached by looking at the axes. Specifically, one can interpret the vertical axis in organisational terms, assuming that the request for more automation rather than internal delegation is due to an already more or less solid organisational structure. The horizontal axis, meanwhile, corresponds to an engineering or to a more informal approach to software development depending on the use or non-use of methodologies and instruments to support analysis and designing.

According to this interpretation of the graph, there are two potential market niches. The first market niche corresponds to companies that adopt methodologies and instruments to support requirements analysis and top-level design. We can safely assume that they use an ‘industrial’ rather than ‘craft’ software development process. For this type of company, project evaluation is considered a critical activity, along with requirements identification. These two activities, among the possible activities listed in the questionnaire, are the most interdisciplinary and at the same time the most difficult to structure. In particular, for purposes of our study, requirements identification can be efficiently supported by tools able to analyse documents in natural language. Moreover, for this type of company, the tool should be specialised to have an appropriate level of domain knowledge for the given area of software development. The client provides requirements documents and the software produced is in turn delivered to the client. For a customer-oriented approach, this means having only a limited possibility to ask the client to write the documents in a restricted form of natural language; however, these companies sometimes receive the documents in a somewhat structured (formalised) form. In these cases it is possible to envision the use of less sophisticated linguistic techniques to analyse requirements documents in order to produce conceptual models using the object-oriented approach.

The second market niche includes medium- or large-sized companies that use neither methodologies nor instruments to support requirements analysis and top-level design. They do, however, perceive requirements modelling as critical, along with other activities such as software documentation and testing, which are already supported in varying ways by existing CASE tools. One can reasonably conclude that also this second group of companies constitutes a market niche for a CASE tool enabled by linguistic instruments. In fact, a CASE of this type could integrate the functionalities of a traditional CASE, favouring the adoption of an engineering approach in software development. Another activity deemed critical is to learn new tools, an obstacle that could be surmounted by adopting a CASE that makes extensive use of natural language. The indication of requirements modelling rather than identification brings to light the fact that a problem at the level of requirements specification can hide deeper problems related to requirements elicitation (these can be supported by speech recognition systems and by all the functionalities envisaged in point (a) of Sect. 2.). This is confirmed to some extent by the fact that identification, rather than modelling, of requirements is considered critical by the companies that adopt a more structured approach to software development.

An important aspect of this research is the broader application of the results. As noted, this research is descriptive, based on a large number of questionnaires (among the highest we have seen in our studiesFootnote 36), yet not fully representative of the population. The fact is that for the software industry, there simply is not enough information on the reference population to permit a meaningful and statistically correct extension of the results.

Having said this, we maintain that it is useful to make a comparison with data available in the literature. Table 5 summarises the most significant of these. Worth noting is the scarcity of existing data. Although the surveys to which these results refer are very different,Footnote 37 their similarities do stand out.

Table 5. Comparison with results relative to other surveys and the CASE market

We can also cite here some data found in [34], which contains detailed indications of the percentage of pages in natural language or similar forms – text with keywords, hierarchical enumeration, and tables – for three projects, having values ranging from 82% to 99% (73%, 43.9%, and 34.4%, respectively, only for natural language text).

Another aspect that enables positive assessment of the outcome of the survey is the low percentage of non-replies (1.65%) and the fact that in the case of replies for which the option ‘Other’ was selected, in 91% of cases a specification was given.

5 Conclusions

As the principal aim of this research project was to assess if there is a market for NLP-enabled CASE tools, the most important finding is that the majority of the documents available for requirements analysis are provided by the customer and couched in ‘real’ natural language, leading to the conclusion that the use of linguistic techniques and tools may perform a crucial role in providing support for requirements analysis.

Because an engineering approach suggests the use of linguistic tools suited to the language employed in the narrative description of user requirements, we find that in a majority of cases it is necessary to use NLP systems capable of analysing documents in full natural language. If the language used in the documents is controlled (giving a subset of natural language), it is possible to use simpler and therefore less costly linguistic tools, which in some cases are already available. Instruments of this type can also be used to analyse documents in full natural language, even if in this case more analyst consultation is required to reduce the complexity of the language used in input documents or to intervene automatically in the models produced as output. Moreover, needed in many cases, besides an adequate representation of the shared/common knowledge, is specialised knowledge of the domain. Once again, the management of expert knowledge requires more substantial investments to adapt the tool to the company’s needs.

As for the potential demand for NLP-based CASE tools, two company profiles have been identified, corresponding to two distinct market niches. The first is composed of companies having an engineering approach to software development and that indicated – of the two activities linked to requirements analysis – the identification of requirements as the more critical. In this case the tool could be configured as a module to integrate with the CASE tool already used by the company, and would provide support for phases where existing tools are insufficient. In the second market niche, the technologies of natural language are used to facilitate the adoption of a CASE tool and more generally of ‘best practises’ of software development, given that along with requirements modelling, these companies have also indicated as crucial activities in which the contribution of software engineering is well developed (testing or software documentation, for example).

We can also make some preliminary observations here regarding the features expected of a tool based on NLP, proceeding from interviews with systems analysts/engineers and project managers in both small- and medium-sized companies. Specifically, they confirm assumptions made regarding potential demand and interest in the following features:

  • The possibility to accelerate the production of analysis models and to rapidly create models to be used in interactions with users and in project groups. The fact that, for example, the class models may contain spurious classes or that some classes may be missing was regarded as less important if the models are produced automatically.

  • The tool was also regarded as useful for the training of analysts, with the presentation of texts and the corresponding models, both for junior analysts and for the retraining of those unfamiliar with the object-oriented approach (the latter problem seems to be more important for small-sized companies).

  • The possibility of integrating the tool with CASE tools for drawing diagrams using the elements singled out by the algorithm and using tools for documents management.

Finally, for some questions in the survey (e.g. the use of methodologies and E-R models or the use of support tools in the initial phases of development) the contributions this paper makes to the field go beyond the confines of the market research as described by the title. It confirmed some expectations (the diffusion of the object-oriented approach), which on the surface could appear obvious, yet have not been sufficiently supported by hard data. It also confirmed the presence of significant possibilities for the adoption of instruments and methods of software engineering [35].