Keywords

7.1 Introduction

Increasing volumes of data in a wide variety of disciplines are forcing academic, industry, and government to address optimum ways to analyze such data effectively. To assist with this, institutions are seeking to provide collaborative research environments in order to increase critical mass and be more competitive for research funding and graduate students. Supercomputers have been used for a number of years to process data from the physical sciences relatively successfully, but the data science revolution caused by the rapid increase in data from a variety of sources could not be satisfactorily addressed solely within the existing disciplines and structures. Therefore, it has been necessary to initiate a transformation of both the processes of knowledge discovery, and also the institutional environments in which this discovery takes place.

7.2 Sources of Data

The Moore-Sloan Data Science Environments comprising New York University, University of California at Berkeley, and the University of Washington (2018) outlined the key elements of a data science environment and also included a summary of current data source as follows:

Consider the scale and complexity of data sources coming online: simulations of scale and resolution unimaginable only a few years ago (e.g., global climate models, universe-scale n-body simulations), networks of tiny but powerful sensors (e.g., on the seafloor; in the forest canopy; in living organisms; in buildings, roads and bridges), high-bandwidth remote-sensing platforms (e.g., satellites like Terra and Aqua with Moderate Resolution Imaging Spectroradiometers, telescopes used for survey astronomy projects like the Sloan Digital Sky Survey and the Large Synoptic Survey Telescope), high-throughput laboratory instruments (e.g., gene sequencers, micro-and macro-scopic imaging equipment, flow cytometers, mass spectrometers), city-wide urban sensing platforms (e.g., connected vehicles, environmental sensors, ubiquitous cameras), repositories of open government data driven by a new culture of transparency, and social science data created in digital form (e.g., global economic indicators; social network data; consumer activities, including purchasing, mobile phone usage, and internet clickstreams). These advances share a common trait: they produce data with relentlessly increasing volume, velocity, and varietydata that must be captured, transported, stored, organized, curated, accessed, mined, visualized, and interpreted. Data-intensive discovery, or data science, is a cornerstone of 21st-century discovery [1].

7.3 Rationale for Data Science Institutes and Centers

The drivers for the creation of centers to provide a focus for these initiatives include many of the following:

  • Alignment with national and international priorities.

  • Provision of a top-slice of the institution’s budget to support the center.

  • Ability to attract significant grant support from funding agencies with priorities in the area of big data.

  • Ability to attract advanced level skills and expertise.

  • Infrastructure to provide a secure environment to deal with large datasets.

  • Opportunities to support research collaboration on big data across an institution.

  • Provide research expertise in machine learning, data mining, data visualization, data management, and statistics.

  • Provision of interdisciplinary expertise to address a wide variety of application areas.

  • Provide an appropriate environment to give support for major national initiatives.

  • Gain local and national industry support.

  • Gain civic interest and support.

  • Opportunities for graduate students to work on a variety of research problems.

  • Attractive environment to deliver Masters courses in Data Analytics.

Many institutions with significant investments in research and development have already set up such centers and others have plans to do so. Because the senior management of the university has decided such an institute/center is a high priority for the institution, it often provides funding for the setting up and operation of the facility. It is therefore not a direct charge on existing faculty budgets. Many such centers do not take faculty from existing faculty structures but are an additional resource center which is more centralized within the institution and where faculty and researchers may come together to access and share computational resources, expertise, and ideas.

7.4 Objectives and Functions of Data Science Institutes and Centers

Tables 7.1, 7.2, 7.3, and 7.4 show a sample of institutes and centers in a variety of institutions and countries. These are not listed in any order of priority or importance. The tables are only indicative and are not intended to be representative of particular countries, nor are all countries included.

Table 7.1 A sample of Institutes and Centers in Data Science in the USA
Table 7.2 A sample of Institutes and Centers in Data Science in Canada
Table 7.3 A sample of Institutes and Centers in Data Science in the UK
Table 7.4 A sample of Institutes and Centers in Data Science in France, Switzerland, Singapore, and Australia

A common driver in all the above organizations is the continually increasing sizes of the data to process in many disciplines.

There is a high degree of commonality in the objectives and functions of the data institutes and centers as illustrated in Tables 7.1, 7.2, 7.3, and 7.4. These include the following:

  • To advance R&D in data analytics in all disciplines.

  • Enable all fields, professions, and sectors to develop through the application of data science.

  • Connect government, industry, and academia.

  • Accelerate research, innovation, and training in data-intensive science.

  • Support interdisciplinary research.

  • Address current problems in society and the environment.

  • Develop new mathematical and statistical theory, and quantitative and computational methods.

  • Support graduate studies in data science (e.g., Masters degrees in Data Analytics and research).

  • Support postdoctoral researchers.

7.5 Graduate Education and Research

Most of the institutes and centers in the tables also provide a focus for graduate students to study for a Masters degree in Data Analytics and/or to do graduate research for a Ph.D. Many have access to faculty expertise for research and teaching by affiliated appointments with their home department/faculty, or by other linking arrangements such as becoming an Academic Fellow in the institute [2]. Many employ data scientists at Ph.D. level for the research programs within the institute, although attention has to be paid to the issue of the status of such staff within an academic institution to ensure that their career paths and prospects are supported.

7.6 National Centers in Data Science

Many countries have identified big data as an area requiring significant development and support. Their governments established working groups to evaluate the current trends in applications data and the needs and requirements of organizations and businesses, in order to evaluate what resources and skills are needed to meet current and future requirements. Some countries have established National Centers to lead and coordinate R&D in the field (for example, [3,4,5]).

7.6.1 USA

In the USA, the National Consortium for Data Science (NCDS) was established in 2013 as a collaboration of leaders in academia, industry, and government to address the data challenges and opportunities of the twenty-first century. The NCDS helps members take advantage of data in ways that result in new jobs and transformative discoveries. Its objective is to enable research, innovation, economic development, and build a data science community for the economy of the future [6].

The National Science Foundation (NSF) made a number of strategic recommendations with regard to realizing the potential of data science [7]. This included the following:

  • Create Data Science Research Centers.

  • Invest in research into Data Science Infrastructure that furthers effective data sharing, data use, and life cycle management.

  • Support research into effective reproducibility.

  • Fund research into models that underlie evidence-based data policy and decision-making.

  • Expand funding into deep learning, smart environments, and other artificial intelligence-empowered areas and their use in data-driven applications.

It also made number of recommendations with regard to education and training, creating collections of datasets, addressing aspects of the Internet of Things (IoT) in areas such as security and data privacy, and addressing architectural issues to support emerging data-intensive tasks.

In February 2018, the NSF announced $30 million in funding through its critical techniques, technologies, and methodologies for Advancing Foundations and Applications of Big Data Sciences and Engineering (BIGDATA) program. The grants were linked with support from Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, which have each committed up to $3 million in cloud resources for relevant BIGDATA projects over a 3-year period. A key objective of this collaboration is to encourage research projects to focus on large-scale experimentation and scalability studies [8].

Data.gov [9] is a U.S. government website launched in late May 2009 by the then Federal Chief Information Officer (CIO) of the United States, Vivek Kundra. Data.gov aims to improve public access to high value, machine-readable datasets generated by the Executive Branch of the Federal Government. The site is a repository for federal, state, local, and tribal government information, made available to the public.

7.6.2 UK

In the UK, the Alan Turing Institute was established in 2015 to be the national institute for data science and artificial intelligence. It was created by five founding universities—Cambridge, Edinburgh, Oxford, UCL, and Warwick—and the UK Engineering and Physical Sciences Research Council. Eight further universities—Leeds, Manchester, Newcastle, Queen Mary University of London, Birmingham, Exeter, Bristol, and Southampton—joined the institute in 2018. This linking into a UK Research Council and support by leading universities in the UK effectively established it as an important national initiative and resource.

In the UK, the Diamond Report in 2015 [10] made a number of recommendations for improved education in skills with regard to data analytics, and also proposed the following:

Many breakthroughs in the development of analytical methods and tools have happened at the intersection between different disciplines. An implication is that we need to support interdisciplinary, innovative research projects involving advanced data analytics, statistics and quantitative skills, and that calls for crossresearch council collaboration and funding. Our recommendation is for a top slice of the RCUK budget to establish a strategic fund through which interdisciplinary research is funded. RCUK could itself take a strategic and convening role in this space

and

There are currently many agencies in the UK exercising leadership to address the skills shortages arising in industry from the data revolution. However, no single body has all the answers to what are system-wide challenges. Collaboration is needed to address the national challenges identified in our research. We call on relevant stakeholders, including the Tech Partnership, the Royal Statistical Society, the UK Commission for Employment and Skills, the EInfrastructure Leadership Council, the Digital Economy Council, techUK, the ODI, HEFCE and the research and sector skills councils to set up a crosscutting taskforce around data analytics to identify good practices for education and skills provision and spur collaboration across industry.

and

…grassroots activities could be usefully complemented by a higher visibility network following the example of an organisation like the US National Consortium for Data Science, or Scotland’s Data Lab, a £11.3 million initiative supported by the Scottish Funding Council, Highlands and Islands Enterprise and Scottish Enterprise, which ‘enables new collaborations between industry, public sector and universities driven by common interests in the exploitation of data science, provides resources and funding to kick start projects, deliver skills and training, and helps to develop the local ecosystem by building a cohesive data science community.’ We believe that a data science network along these lines should be developed with involvement from existing communities of analytical practice, the Alan Turing Institute, the Data Lab, the Catapults, and major Data Science institutes at universities like Imperial College, UCL, Manchester and Warwick.

A report from Nesta in 2015 [11] also proposed the following:

The ‘big data explosion’ requires new analytics skills to transform big datasets into good decisions and innovative products.

Key findings

  • There isn’t a one-size fit all to creating value from data. Our research reveals three types of ‘Data active’ businesses: Datavores who base their decisions on data and analysis, Data Builders working with big datasets, and Data Mixers who combine data from different sources. We also find 30% of ‘Dataphobe’ businesses who seem to have given the data revolution a pass.

  • Data-active companies (especially Datavores and Data Builders) perform better than the Dataphobes. Our econometric analysis reveals that they are over 10% more productive than the Dataphobes after controlling for other factors.

  • Data-active companies are recruiting more analysts, and combining more disciplines to build a data science capability. This isn’t proving easy: For example, two thirds of Datavores struggled to fill at least one vacancy. 80% of them identified problems in at least one skills area. Data-active companies are particularly concerned about the lack of domain knowledge in analysts, the lack of people with the right mix of skills and the lack of experienced analysts.

  • Technology is changing fast in the data space, and employers are keeping the skills of their data analysts fresh through a variety of approaches. 80% do internal training. Significant proportions (between a third and two thirds) are using innovative training methods like data competitions, online courses and meetups. Only a fifth use universities to train their staff.

In addition, the Ada Lovelace Institute was set up in the UK in 2018 to coordinate R&D in Artificial Intelligence (AI) and ensure that AI is able to work for people and society [12].

7.7 Opportunities for Data Science Research

One of the principal opportunities of a Data Science Institute is to incorporate major research initiatives. An example of this is at the Leeds Institute for Data Analytics. This hosts the following:

  • MRC Medical Bioinformatics Centre (£7 million funding) [13] and

  • ESRC Consumer Data Research Centre (£11 million funding) [14].

The MRC Medical Bioinformatics Centre aims to create and sustain the infrastructure, facilities, understanding, and culture changes required to enable groundbreaking and productive bioinformatics research at the interface between the clinic, health records, and high volume molecular and phenotypic datasets. It also collaborates with a wide range of business organizations and healthcare providers. In addition, it has been able to leverage an additional £14 million of funding from other sources.

The Consumer Data Research Centre creates, supplies, and maintains data for a wide range of users [15]. It works with private and public data suppliers to ensure efficient, effective, and safe use of data in social science. It is led by the University of Leeds and University College, London, with partners at the Universities of Liverpool and Oxford. One of its functions is to provide a national service using the data that can be accessed via the data store, such as point of sale receipts, travel records, and market research data. It works with private and public data suppliers to ensure efficient, effective, and safe use of data in the social sciences.

Being able to host national data science research initiatives has the following advantages for a Data Science Institute:

  • Provides infrastructure for data storage, access, and analysis.

  • Provides a center and focus for data science expertise.

  • Opportunities for interdisciplinary collaborations.

  • Opportunities for cross-fertilization of ideas across disciplines.

  • Able to leverage further research grant funding in associated disciplines.

  • Opportunities to attract external sponsorship, endowments, and donations.

  • Attract interest within the institution in data science.

  • Attractor externally for data scientists and graduate students.

  • Attract national and international interest in the research.

  • Seminar programs with national and international researchers.

A number of these advantages also apply to a Data Science Institute that does not have an associated national facility under its aegis, but they may not be evident to the same extent. In some cases, the setting up of a Data Science Institute has attracted a national facility; in other cases, the institute has focused initially on a pre-existing national facility within the institution, and then broadened the infrastructure to support other disciplines.

7.8 Challenges in Data Science Research

Data science research faces a number of challenges. These include the following:

  • Discipline-based faculties and budgets within academic institutions do not generally support wider ways of working, because they are based on historic structures which can be difficult to change.

  • Interdisciplinary collaborations can be difficult to initiate and operate.

  • Grant awarding mechanisms can be biased toward single disciplines because of the way the funding agencies are compartmentalized along disciplinary lines.

  • Promotion criteria for faculty can be difficult to formulate and agree (particularly for tenure).

  • Career advancement within academia can be difficult in interdisciplinary areas.

Harvey [16] lists the current big data challenges as follows:

  • Dealing with data growth.

  • Generating insights in a timely manner.

  • Recruiting and retaining big data talent.

  • Integrating disparate data sources.

  • Validating data.

  • Securing big data.

  • Organizational resistance.

Because of the complexity of the analysis of large amounts of data, questions can arise with regard to the verifiability of the results that are obtained. In order to provide confidence in the publication of the results, it is increasingly required that it clearly specifies how they have been obtained, and give access to the data in order that the results can be verified. This can be facilitated by the use of open-source tools and by supporting reuse and open science [17].

The Moore-Sloan Data Science Environments: New York University, UC Berkeley, and the University of Washington (2018) [1] detail the aspects of institutional change that are required in order to implement data science successfully. These include the following factors:

  • Career paths and alternative metrics.

  • Education and training.

  • Software tools, environments, and support.

  • Reproducibility and open science.

  • Working spaces and culture.

  • Ethnography and evaluation.

Earnshaw [18] details the issues associated with interdisciplinary research and development. The following aspects were considered to be potential advantages:

  • Further the growth at the boundaries of existing disciplines.

  • Potential for knowledge transfer to industry and society.

  • Freedom and opportunity in new areas of research and development.

  • Publication of significant results.

  • Lines of reporting can be more flexible.

The following were considered to be areas where significant challenges could arise:

  • Lack of high-ranking interdisciplinary journals and conferences.

  • Lack of general respect in the peer community for interdisciplinary research publications—they can be perceived to be “less pure” than those in a single discipline.

  • Blue sky research is still perceived to be of higher kudos than anything that may be more applied.

  • Applications to grant awarding bodies and agencies—where reviewers from the different disciplines may have different views.

  • Reconciling different cultures and working practices in different disciplines.

  • Relationship to senior faculty and the university.

  • Tenure committee considerations, where criteria are well established for single disciplines but are often not so clear for interdisciplinary research. This can discourage junior faculty from working in this area as it can be perceived as high risk from a career point of view.

  • Employment of faculty on part-time or zero-hours contracts.

As interdisciplinary research gains traction, momentum, and international acceptance through initiatives such as data science, it is possible that some of the challenges listed above may be addressed more effectively than has been the case to date.

7.9 Conclusions

The growth in the significance and extent of data from many disciplines has been outlined. This presents a major challenge as it is not possible to analyze this effectively using traditional data processing methods. New approaches are needed, and the opportunities and advantages of data science and Data Science Institutes have been reviewed. Such solutions often require interdisciplinary methods, and therefore effective collaborations are necessary and important in these new environments. These represent a significant shift in the way traditional academic research has been performed within disciplines, and therefore major efforts are needed to address these challenges. However, the rewards are great as the majority of future research and development is likely to be in the context of large datasets which traverse disciplines in the way they arise, and in the methods used to extract verifiable information and knowledge from them.

Further Reading