Keywords

1 Introduction to Nursing Research

The purpose of this chapter is to understand how research in the digital age and big data are transforming health-related research and scholarship suggesting a paradigm shift and new epistemology.

Health science and the translation of research findings are not new. Florence Nightingale, a social reformer and statistician who laid the foundation for nursing education, conducted early health-related research during the Crimean War by collecting data on causality of death in soldiers. Nightingale was a data collector, statistician, and concerned with data visualization as indicated by her Rose Diagram, a topic of much research. The diagram originally published in Notes on Matters Affecting the Health, Efficiency, and Hospital Administration of the British Army. Founded Chiefly on the Experience of the Late War. Presented by Request to the Secretary of State for War graphically presented data indicating that more soldiers died because of disease than of their battle-related injuries. Nightingale was a pioneer at discovery and data-based rationale underpinning the practice of nursing and health-related implications. Informed by philosophy, she was a systematic thinker, who understood the need for systematic data collection (Fig. 10.1). Nightingale designed survey instruments and determined their validity through vetting with experts in the field. She was a statistician basing her findings on mathematics. Through using the findings of her work to change practice and policy, Nightingale reformed conditions for workhouse poor, patient care standards and the right to a meaningful death (McDonald 2001, Selanders and Crane 2012). Imagine what she might have done in the digital age, with computers, and big data.

Fig. 10.1
figure 1

Nightingale rose diagram. Source History of Information.com. (Accessed 5 Dec 2015)

Nursing research, as defined by the National Institute of Nursing Research, is knowledge development to “build the scientific foundation for clinical practice; prevent disease and disability; manage and eliminate symptoms caused by illness, and enhance end-of-life and palliative care” (NINR 2015). Grady and Gough (2015) further suggest that nursing science provides a bridge from basic to translational research via an interdisciplinary or team science approach to increasing the understanding of prevention and care of individuals and families through personalized approaches across the lifespan (Grady and Gough 2015).

1.1 Big Data and Nursing

The profession of nursing has been intricately involved with healthcare data since the beginning of nurse’s notes that documented patient care and outcomes. Notes and plans of care are reviewed, shared and modified for future care and better outcomes. The magnitude of healthcare data is complex thereby requiring nontraditional methods of analysis. The interweaving of multiple data streams to visualize, analyze and understand the entirety of human health demands a powerful method of data management, association, and aggregation. The digitization of healthcare data has consequently increased the ability to aggregate and analyze these data focusing on historical, current and predictive possibilities for health improvement.

The foundation of nursing research is the integration of a hypothesis-driven research question supported by an appropriate theoretical framework. The fourth paradigm, a phrase originally coined by Jim Gray (Hey et al. 2009), or data science, is considered by some to be the end of theory-based research, thus creating a reborn empiricism whereby knowledge is derived from sense experience. This view is not without its detractors. There were big data skeptics before the explosion took place. Dr. Melvin Kranzberg, professor of technology history, in his 1986 Presidential Address, commented that “Technology is neither good nor bad: nor is it neutral … technology’s interaction with the social ecology is such that technical developments frequently have environmental, social, and human consequences that go far beyond the immediate purposes of the technical devices and practices themselves” (Kranzberg 1986, p. 545). Kitchin (2014) further suggested that “big data is a representation and a sample shaped by technology and platform used, the data ontology employed and the regulatory environment” (p. 4). Kitchin’s statement reinforces the idea that data cannot explain itself but requires a lens (e.g., theoretical framework) through which to interpret the data. Data is raw/without interpretation and cannot interpret outliers or aberrancies suggesting bias. It provides a bulk of information where a specific analytic process is applied—but at the end, the data must be interpreted. Use of a theoretical framework provides a pathway of understanding that can support statistical findings of data analysis.

Bell (2009) suggests that “data comes in all scales and sizes … data science consists of three basic activities, capture, curation, and analysis” (p. xiii). He also comments on Jim Gray’s proposal that scientific inquiry is based on four paradigms: experimental, theoretical, computational, and data science (Hey et al. 2009). Table 10.1 provides an integration of Gray’s paradigms with general research paradigms. Gray’s fourth paradigm supports the integration of the first three

Table 10.1 Research paradigms

The explosion of big data (defined by volume, velocity and veracity) in healthcare provides opportunities and challenges. Healthcare providers, researchers and academics have the ability to visualize individual participant data from multiple sources (hospital, clinic, urgent care and school settings, claims data, research data) and in many forms (laboratory, imaging, provider notes, pharmacy, and demographics). The challenge of aggregating and analyzing these data streams is possessing usable standardized data and the right analytic tools with the power to aggregate and understand data types. The outcome of having appropriate data and tools is the ability to improve health care outcomes and develop predictive models for prevention and management of illness. Other advantages include those related to clinical operations, research, public health, evidence-based medicine/care, genomic analytics, device (wearable and static) monitoring, patient awareness, and fraud analysis (Manyika et al. 2011). Platforms have been developed to assist with the analyses of the various data streams, e.g., Hadoop, Cloudera CDH, Hortonworks, Microsoft HDInsight, IBM Big Data Platform, and Pivotal Big Data Suite. These platforms frequently use cloud computing—ubiquitous elastic compute and large data storage engines from the likes of Google, Amazon and Microsoft. Thus not only is the scientific paradigm changing, but also the compute paradigm from local processing to distributed processing. These changes are accompanied by a new software industry focusing on such areas as data compression, integration, visualization, provenance and more.

More specifically analytical methods, other than using a theoretical pathway to determine what data should be collected and the method of analysis should be considered to allow the data and not the theory to provide the pathway. Many methods are becoming available as a means to analyze and visualize big data. One method, point cloud, uses a set of data points in a three dimensional system for data visualization (Brennan et al. 2015). Other methods use various forms of data clustering such as cluster analysis (groups a set of objects/data points into similar clusters) (Eisen et al. 1998) and progeny clustering (applies cluster analysis determining the optimal number of clusters required for analysis) (Hu et al. 2015). A variety of methods exist to examine and analyze healthcare data providing rich data for improving patient outcomes, predictive modeling and publishing the results.

Theory-based research uses the method of schema-on-write (Deutsch 2013) which is usually a clean and consistent dataset but the dataset is more limited or narrow. This method, where data were pre-applied to a plan, requires less work initially but also may result in a more limited result. Research opportunities today provide the ability to broaden the scope and magnitude of the data by allowing for the expansion and use of multiple types/streams of data using the method of schema-on-read. Schema-on-read identifies pathways and themes at the end of the process, allows the researcher to cast a wide data net incorporating many types of structured and unstructured data, and finally applies the theoretical pathway to allow the analysis to ‘make sense’ of the data. The data is generally not standardized or well organized but becomes more organized as it is used. The data has the ability to be more flexible thus providing more information (Pasqua 2014). In summary, schema-on-read provides the ability to create a dataset with a multidimensional view; these traits magnifying the usability of the dataset. This expands the nurse researcher’s ability to explain the research question leading to development of preventive or treatment interventions.

The digital environment and diversity of data have created the need for interdisciplinary collaboration using scientific inquiry and employing a team science approach. This approach provides an environment to maximize self-management of illness, increase, maintain a level of individual independence, and predict usefulness of interventions within and external to professional health environments. Individual empowerment allows individuals to participate, compare outcomes, and analyze their own data. This is accomplished using simple, smart-phone applications and wearable devices. It is the epitome of self-management and participatory research.

1.2 Nursing and Data

The Health Information Technology for Economic and Clinical Health (HITECH) Act, an initiative passed in 2009, provides financial support for electronic health records (EHRs) to promote meaningful use in medical records thereby expanding EHR use and healthcare information. Nurse scientists previously gathered data from smaller data sets that were more narrowly focused such as individual small research studies and access to data points within the EHR. This approach provided a limited or constricted view of health-related issues. The unstructured nature of the data made data extraction difficult due to non-standardization. Using the narrow focus of the data inhibited the ability to generalize findings to a larger population often resulting in the need for additional studies. Moreover, EHR data focused on physician diagnosis and related data, failing to capture the unstructured but more descriptive data, e.g., nursing data (Wang and Krishnan 2014).

The big data tsunami allows nurse scientists access to multiple data streams and thus expands insight into EHRs, environmental data that provides an exposome or human environment approach, genetic and genomic data allowing for individuality of treatment and technology driven data such as wearable technology and biosensors allowing nearly real-time physiologic data analysis. Moreover, data sharing provides power that has not previously existed to detect differences in health disparities. Data aggregation generates a large participant pool for use in pragmatic studies to understand the depth and breadth of health disparities (Fig. 10.2).

Fig. 10.2
figure 2

Challenges of big data. Adapted from the high-level research design in the fourth discipline (p. 169)

For example, technology-driven data that impacts elder care and can prevent adult injury is fall-related data collected by retrieving data from hospital or home systems (Rantz et al. 2014). This data widen the scope of evidence-based healthcare by providing a multidimensional interpretation of the data. Larger, more inclusive datasets such as claims data from the Centers for Medicare and Medicaid Services (CMS) add to the ability to create a more complete view of health and healthcare. However, this data is constrained by what claims form information is captured and available for research purposes. Data digitization and open access publication provide a rich environment for all disciplines to access, correlate, analyze and predict healthcare outcomes. Future data sharing and reuse will likely capture more of the research continuum and process.

2 The New World of Data Science

Data science and resulting data use are not new and are growing at warp speed. It is estimated that 2.7 zettabytes of data were generated daily in 2012; 35 zettabytes of data per day are anticipated by 2020; 90% of current data were collected in the past 2 years and 5 exabytes are generated daily (Karr 2012). A brief history of data science shows that Tukey (1962) in describing his transformation from an interest in statistics to one in data analysis initiated the thought that there was a difference between the two. It was not until the mid-1990s that a more formalized approach to data (analysis) science was developed that began to look at the increased accumulation of data and data analytics (Tukey 1962).

The advent of data science is compared to Fordism. The world changed when Henry Ford discovered new methods of building automobiles. Manufacturing processes were modernized, modifying knowledge and altering methods of understanding the world. Fordism changed society and behaviors impacting everyone (Baca 2004). The big data explosion follows the same trajectory. New methods of capturing, storing and analyzing data have and will continue to have an impact on society. Today’s data has exploded into multiple data streams, structured and unstructured.

Rapid growth of the big data or data science ecosystem emphasizes the need for interdisciplinary approaches to interpreting and understanding health and healthcare data. The data explosion provides an environment with various data types require the same breadth of scientists to interpret the data accurately. Just as generation of data is from a plurality of sources so must the composition of the team assigned to its analysis. The data explosion provides nurse scientists, as well as other disciplines, the ability to work within highly diverse teams to provide deep knowledge integration and comprehensive analysis of the data. This expansive inclusion of expertise extends to employing the skills of citizen scientists.

The data explosion also provides a new world data alchemy allowing for transformation, creation and combination of data types to benefit healthcare outcomes through accurate decision-making and predictability. Data standards are becoming more prevalent. One key example of this prevalence is the NIH’s Big Data to Knowledge (BD2K) program recently establishing a Standards Coordinating Center (SCC). One example of standards work is Westra and Delaney success in having the Nursing Management Minimum Data Set (NMMDS) incorporated into Logical Observation Identifiers Names and Codes (LOINC) a universal system for test, observations and measurement (Westra et al. 2008). Computers are becoming ever more powerful and due to the ubiquitous nature of data and data-driven algorithms allow deeper and more complex analyses resulting in greater accuracy in patient care-related decision making (Provost and Fawcett 2013). Making use of existing and future data necessitates training of data scientists. The need for data scientists is growing with an anticipated shortage of between 140,000 and 150,000 people (Violino 2014).The combination of these elements—standards, data, analytics and a trained workforce—increases the accuracy and predictive use of data with numerous opportunities for scholarship, including publication and analytic developments.

3 The Impact of Data Proliferation on Scholarship

Scholarship (FreeDictionary: academic achievement; erudition; Oxford Dictionary: learning and academic study or achievement) was once solely a paper journal publication (p-journal) and an academic requirement for tenure. The era of big data and data science increases the realm of scholarship by adding a variety of publication/dissemination forms such as electronic journal publication (e-journal), web-based formal or informal documentation, reference data sets, and analytics in the form of software and database resources. Digitalization of online information and data provide fertile ground for new scholarship. The technological provision of shared data, cloud computing and dissemination of publications places scholarship in the fast lane for nursing and other disciplines. The information superhighway is clearly the next generation infrastructure for scholarship even as the academic establishment’s adoption of such change is behind that pace of change. Such a gap and the migration of a skilled workforce of data scientists from academic research to the private sector are concerns.

“Scholarship represents invaluable intellectual capital, but the value of that capital lies in its effective dissemination to present and future audiences.” AAU, ARL, CNI (2009)

Much of academic scholarship is based on Boyer’s model espousing that original research is centered on discovery, teaching, knowledge and integration (Boyer 1990). The American Association of Colleges of Nursing (AACN) adopted Boyer’s model defining nursing research as: “… those activities that systematically advance the teaching, research, and practice of nursing through rigorous inquiry that (1) is significant to the profession, (2) is creative, (3) can be documented, (4) can be replicated or elaborated, and (5) can be peer-reviewed through various methods” (2006). A hallmark of scholarship dissemination continues to be a process of peer-reviewed publications in refereed journals. The Association of American Universities (AAU), the Association of Research Libraries (ARL) and the Coalition for Network Information (CNI) published a report emphasizing the need to disseminate scholarship. Big data now questions if the scholarship model requires updating to be more inclusive of the sea change in information (culturally, socially, and philosophically) technology has introduced (Boyd and Crawford 2012).

Today’s digital environment and the need for dissemination of scholarly work suggest expanding the definition to allow for other methods of scholarship. Borgman and colleagues, in their 1996 report to the National Science Foundation (NSF), developed the information life cycle model as a description of activities in creating, searching and using information (Borgman et al. 1996). The outer ring denotes life cycle stages (active, semi-active, inactive) with four phases (creation, social context, searching and utilization), where creation is the most active. The model includes six stages, which further assist the context of scholarship utilization (Fig. 10.3).

Fig. 10.3
figure 3

Information life cycle. Source Borgman et al. (1996) report to NSF. Note The outer ring indicates the life cycle stages (active, semi-active, and inactive) for a given type of information artifact (such as business records, artworks, documents, or scientific data). The stages are superimposed on six types of information uses or processes (shaded circle). The cycle has three major phases: information creation, searching, and utilization. The alignment of the cycle stages with the steps of information handling and process phases may vary according to the particular social or institutional context

The incorporation of the information life cycle model into the AACN scholarship definition adds the need for dissemination and the inclusion of sources outside the normal process of p-journal publications. This incorporation would highlight that publication is a multi-dimensional continuum requiring three main criteria. First, the information must be publicly available via sources such as subscriptions, abstracts and databases/datasets allowing for awareness and accessibility of the work. Second, the scholarly work should be trustworthy; this is generally conducted through peer review and identified institutional affiliation. Finally, dissemination and accessibility are the third criterion that allows visualization of the scholarly work by others (Kling 2004).

The digital enterprise is no longer relegated to p-journals but has increased to include e-journals, data sets, repositories (created to collect, annotate, curate and store data) and publicly shared scholarly presentations. Citations of data sets now receive credibility and validity through this new scholarship type, in part through the work of the National Institutes of Health Big Data to Knowledge (BD2K) initiative. The rapid ability to access, analyze either through cloud computing or novel software designed for big data, and disseminate through multiple methods provides nursing and all disciplines a more rapid ability to publicize and legitimize their scholarly work.

4 Initiatives Supporting Data Science and Research

Many government agencies have initiated work designed to build processes within the digital ecosystem to assist teams focusing on data science. These initiatives have been developed as a means of assisting in faculty and student training, collaboration with centers of excellence, development of software designed to facilitate the analysis of large datasets, and the ability to share data and information through a cloud-based ecosystem that maximizes the use of existing multi-dimensional data to better understand and predict better patient outcomes. Further, multiple agencies have open funding sources consistent with nursing science. Examples follow.

4.1 National Institutes of Health

The National Institutes of Health (NIH) spearheaded the big data program with the creation of the Big Data to Knowledge or BD2K initiative in 2012 when an advisory committee convened by Dr. Francis Collins, the NIH Director, investigated the depth and breadth of big data potential. Dr. Collins and key members of his leadership team reviewed the committee’s findings and committed to providing a ‘data czar’ to facilitate data science that would span the 27 Institutes that comprised NIH. The BD2K team, led by Dr. Phil Bourne (co-author of this work), created a data science ecosystem incorporating (1) training for all levels of data scientists, (2) centers that would work independently and in concert with all BD2K centers to build a knowledge base, (3) a software development team focused on creating and subsequently maintaining new methods for big data analysis, and (4) a data indexing team focused on creating methods for indexing and referencing datasets. (https://datascience.nih.gov). Taken together the intent is to make data FAIR—Findable, Accessible (aka usable), Interoperable and Reusable.

4.1.1 Training

Training focused on establishing an effective and diverse biomedical data science workforce using multiple methods across educational and career levels—students through scientists. Training focused on the continuum of scientists who see biomedical data science as their primary occupation to those that see biomedical data science as a supplement to their skill set. Development and funding of a training coordination center that ensures all NIH training materials are discoverable is paramount.

4.1.2 Centers

Centers included the establishment of 11 Centers of Excellence for Big Data Computing and two Centers that are collaborative projects with the NIH Common Fund LINCS program (the LINCS-BD2K Perturbation Data Coordination and Integration Center, and the Broad Institute LINCS Center for Transcriptomics). Centers are located throughout the United States providing training to advance big data science in the context of biomedical research across a variety of domains and datatypes.

4.1.3 Software

Software focus included targeted Software Development awards to fund software and methods for the development of tools addressing data management, transformation, and analysis challenges in areas of high need to the biomedical research community.

4.1.4 Commons

Commons addressed the development of a scalable, cost effective electronic infrastructure simplifying, locating, accessing and sharing of digital research objects such as data, software, metadata and workflows in accordance with the FAIR principles (https://www.force11.org/group/fairgroup/fairprinciples).

4.1.5 Data Index

Data Index is a data discovery index (DDI) prototype (https://biocaddie.org/) that indexes data that are stored elsewhere. The DDI will increasingly play an important role in promoting data integration through the adoption of content standards and alignment to common data elements and high-level schema.

4.2 National Science Foundation

The National Science Foundation (NSF) is a United States government agency supporting research and education in non-medical fields of science and engineering. NSF’s mission is “to promote the progress of science; to advance the national health, prosperity, and welfare; to secure the national defense…” (http://www.nsf.gov accessed 12/23/2015). The annual NSF budget of $7.3 billion (FY 2015) is the funding source for approximately 24% of all federally supported basic research conducted by U.S. colleges and universities.

NSF created the Directorate for Computer & Information Science & Engineering (CISE) with four goals related to data science to:

  • Uphold the U.S. position of world leadership in computing, communications, and information science and engineering;

  • Promote advanced computing, communications and information systems understanding;

  • Support and provide advanced cyberinfrastructure for the acceleration of discovery and innovation across all disciplines; and

  • Contribute to universal, transparent and affordable participation in an information-based society.

CISE consists of four divisions, each organized into smaller programs, responsible for managing research and education. These four divisions (the Division of Advanced Cyberinfrastructure; the Division of Computing & Communication Foundations; the Division of Computer and Network Systems; and the Division of Information and Intelligent Systems) incorporate program directors acting as the point of contact for sub-disciplines that work across each division and between divisions and directorates. NSF CISE provides funding in the areas of research infrastructure, advancing women in academic science and engineering, cybersecurity, big data hub and spoke designs to advance big data applications, and computational and data science solicitations to enable science and engineering (http://www.nsf.gov/cise/about.jsp accessed 12/23/2015).

4.3 U.S. Department of Energy

The U.S. Department of Energy’s (DOE) mission is to “ensure America’s security and prosperity by addressing its energy, environmental and nuclear challenges through transformative science and technology solutions” (http://energy.gov/mission accessed 12/23/2015). A prime focus of the DOE is to understand open energy data through the use and access to solar technologies. The DOE collaborated with its National Laboratories to harness data to analyze new information from these large data sets (Pacific Northwest National Lab), train researchers to think about big data, and to focus on issues of health-related data (Oak Ridge National Lab). As an example, the Oak Ridge National Laboratory’s Health Data Sciences Institute, in concert with the National Library of Medicine (NLM), developed a new and more rapid process to accelerate medical research and discovery. The process, Oak Ridge Graph Analytics for Medical Innovation (ORiGAMI), is an advanced tool for literature-based discovery.

4.4 U.S. Department of Defense

The U.S. Department of Defense (DOD) focuses on cyberspace to enable military, intelligence, business operations and personnel management and movement. The DOD focuses on protection from cyber vulnerability that could undermine U.S. governmental security. The four DOD foci include (1) resilient cyber defense, (2) transformation of cyber defense operations, (3) enhanced cyber situational awareness, and (4) survivability from sophisticated cyber-attacks (http://www.defense.gov/accessed 12/23/2015).

The Defense Advanced Research Projects Agency (DARPA) is an agency within the DOD that deals with military technologies. DARPA focuses on a ‘new war’ they call a network war for cyber security. Understanding that nearly everything has a computer, including phone, television, watches, and military weapons systems, DARPA is utilizing a net-centric data strategy to develop mechanisms to thwart potential or actual cyber-attacks.

5 Summary

The big data tsunami created fertile ground for the conduct of research and related scholarship. It opened doors to healthcare research previously unimagined; it provided large data sets containing massive amounts of information with the potential of increasing knowledge and providing a proactive approach to healthcare. It also created a firestorm of change reflecting how access to data is accomplished. Big data is the automation of research. It also is an epistemological change that questions certain ethical morés; for example, just because we can access the data does not mean we should. Fordism changed the manufacturing world with a profound societal impact; big data is the new Fordism impacting society.

The societal and ethical impact of big data requires the attention of all disciplines. The impact, while requiring a priori decisions, will provide an unprecedented opportunity to influence healthcare and add to the global knowledge base and scholarly work. Big data has opened an abyss of opportunity to explore what is, hypothesize what could be, and provide methods to change practice, research and scholarship.

Big data is central to all areas of nursing research. Areas with the most prominent interface with other major healthcare initiatives, from an NIH perspective, include the precision medicine initiative seeking to further personalize a patient’s health profile; the U.S. cancer moonshot, which has at its core the greater sharing and standardization of data supporting cancer research; and the Environmental influences of Child Health Outcomes (ECHO). We are truly entering the era of data-driven healthcare discovery and intervention.