Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

Large data processing has been happening throughout evolution. Target recognition in large-format persistent surveillance systems is easily accomplished when the human eye tracks an object in a distant landscape. The flood of visual data entering the 130 million neurons [1] is quickly pushed up the Data, Information, Knowledge, Wisdom (DIKW) pyramid to information using past experience with that object. All of this is performed in real-time, with a computing system (the brain) that uses a fraction of the energy required in today’s modern data farms.

Historically, data collection from sensors or other sources required reducing it down initially into information via feature generation. For example, intelligence analysts would be trained to spot what a military tank looks like by paying attention to certain details such as a long tank barrel. Then in a photograph, they would look for this feature to quickly identify a target tank.

The recent explosion in Big Data technology jumps the DIKW chain by using advanced data analytic tools to skip the information level of the DIKW pyramid. A 2008 essay makes the argument that Big Data analytics make the scientific method obsolete [2]. These tools along with significant computational resources, can sort and analyze vast quantities of raw data directly and determine trends that can enable predictive analysis. The additional advantage with this method is that the original data can be left intact allowing future study if needed. These benefits have encouraged the collection of data from all factors of day-to-day life.

In the proceeding sections, we will discuss data analytic tools and resources, followed by applications if Big Data analytics. We will highlight several current and future challenges in this area and finally suggest emerging technology enablers.

Big Data Technologies and Tools

The term Big Data is an often over-generalized term used for the processing of data that is so large, it is time-prohibitive to transfer or copy it. Subsequently, new technologies were developed to process data in-place, versus transferring it to the customer or researcher location. The 2000 U.S. census data set is 200 GB is size, the Human Genome data set is 210 GB, and the Common Crawl Corpus data set of web crawl information is an incredible 541 TB in size. All are freely available through Amazon Web Services (AWS). These and other data sets are stored and made available via various data analytic services such as AWS, Microsoft Azure and IBM Big Data services. These companies also offer cloud storage and computing services to business users who want to store and process their own collected data, such as data collected from websites.

Many companies also house large data sets that they use for their own commercial benefit, such as Facebook, Google, and Twitter. The storage of vast quantities of data along with the computing power to supply it to users requires massive facilities. Facebook’s two Prineville Oregon data centers are each 330,000 square feet in size and in 2012 used 153 GW of power [3]. The facilities have grown since then.

With such large data sets, the emphasis is on data reduction to just what is needed to gain meaningful insights. Various tools have been developed that utilize batch processing where common instructions can be performed across numerous computing cores over different pieces of data in parallel. The MapReduce Function in Hadoop for instance can perform the task of finding the largest value in a large data set by splitting up the data set into smaller parts and searching for the largest values in each of these parts in parallel, and then iteratively performing smaller searches on the previous search result’s winners [4]. Once the basic functions such as sort and search can be performed for large scale data, complex data mining tools such as neural networks can be employed to develop a predictive model. Leo Breiman stated in [5] that there are two cultures in statistics. One assumes the data are generated from a stochastic model. The other allows the data to determine the statistical model. Big Data processing methods follow the latter statistical culture. The number of applications that are benefiting from such data analysis is growing quickly. The demand for data scientists has vastly increased in recent years and many colleges and universities have added degrees in this area [6]. In the following sections, we will describe some of the areas in which data analytics have had great success.

Medical Data Analysis

The field of statistics has always found application in medicine. Causation and correlation research of drugs and therapies to diseases are what doctors rely upon to make diagnoses for their patients. Pharmaceutical companies also rely on statistics to determine real effect of their drugs on combating diseases. In the past, the number of trials was limited by location and the amount of data that could be processed and store. However, with Big Data technology, many more variables about each patient can be stored and processed. In addition, health data can now be stored from different patients over a long period of time and this historical data can provide a larger sample set from which to draw insight and meaningful diagnoses. For instance, in [7] an attending physician was able to create a database for childhood lupus by making boxes of medical records electronically searchable. With this tool created, she correctly administered an anti-coagulant treatment based on recorded medical history of past patients. The digitization of medical records mandated by the Affordable Care Act will increase the amount of information available for medical research. Although there are many justified privacy concerns for pooled medical data, the opportunities for medical progress cannot be denied.

Another field of medical research that is benefiting from data analytics is genetic research. The human DNA consists of 3 billion base pairs. While the Human Genome Project took an army of computers, $3 billion in funding and 13 years to complete, current projects funded by the US National Human Genome Research Institute (NHGRI) are close to achieving the same feat for $1000 per person in much reduced time [8]. Genetic research holds the promise of designer drugs that can be tailored to the specific person rather than the average person as is currently done. Also, new research advances in cancer treatment, Alzheimer’s disease [9], and others are depending on the analytical capabilities of Big Data technologies. With companies such as Human Longevity promising to sequence as many as 40,000 genomes per year [10], the amount of genomic data will quickly accelerate in size. Big Data analytics will serve to identify the effects of the various genes on the human body. Several research efforts have already started in this field and offer promising roads toward future development [10, 11].

One of the barriers to data analysis is incomplete or incompatible data records. Medical records collected from one hospital are likely not available to other hospitals. In addition, several medical record software vendors exist, none of which are completely compatible with each other. Thus, although more medical data is being collected than ever before, most of it is not collectively searchable. However, as benefits of data analytics reduce costs by increasing early detection and correct diagnoses rates, insurance companies are highly likely to require standardization in record keeping. As new drug therapies are developed more accurately and quickly with data analytic research, pharmaceutical companies are also likely to drive standardization. Government participation may also help. In April 2013, the National Institutes of Health (NIH) announced the Big Data to Knowledge (BD2K) program to address Big Data challenges such as lack of appropriate tools, poor data accessibility, and insufficient training [12]. Among the goals of BD2K is to facilitate the broad use and sharing of complex biomedical data sets. Among the 2014 awardees are projects to create centralized data hubs for nuclear receptor data, cardiovascular research and fMRI-based biomarkers. Each of these projects will seek to encourage data sources such as hospitals and universities to contribute and populate respective joint databases where all partners can then collaborate on furthering medical research.

Big Data and Image Processing

The recent trend towards higher resolution images via greater megapixel cameras has vastly increased the amount of necessary data storage necessary. Higher resolution video formats such as 4 K video only exacerbate the issue as these video files store 8 MB images at 30 frames (images) per second. Despite these daunting file sizes, some companies have found methods to extract useful information. Google’s Picasa software and Facebook can both scan images uploaded to their software and perform facial recognition to a surprising degree of accuracy. In addition, they can analyze images to gain the context in which it was taken to better provide personalized advertisements to the user. It is estimated that Facebook users post an average of 300 million images per day resulting in many petabytes of stored information that is routinely processed and analyzed [13].

Medical imaging from MRI systems offer unprecedented views of the human body, but also require in some cases 15 TB of data per year [14]. Projects such as the Cardiac Atlas Project (CAP) [14] have started foundational research into using machine learning techniques to perform image recognition of abnormalities in 3-D medical images. These images consist of data cubes of millions of pixels of information, and therefore can benefit from Big Data processing methods.

The field of target recognition in imaging systems has been around for many decades. The US Air Force’s Gorgon Stare consists of 9 different cameras, can send 65 different video feeds, and according to Maj Gen James O. Poss, “will be looking at a whole city” [15]. The challenge however is to sift through the mountains of data that systems such as these generate. If machine learning systems can pre-process data with sufficient accuracy before sending it to the intelligence analyst, the operator workload can be vastly reduced. In addition, simple tasks such as moving target tracking or even scene change become very difficult when dealing with such large data files. Current Big Data tools along with pattern recognition methods can greatly help, if only on an offline capacity as the data would need to be processed on high-end servers. The Air Force Distributed Common Ground System (DCGS) consists of 27 geographically separated networked sites for the collection and processing of intelligence information collected from a variety of sensors [16]. This system can process imagery information with both data analytics and human operators to help turn Data into Knowledge for the military commander. The problem of real-time processing remains however as much of the imaging files are much too large to allow in-place processing.

Big Data and Computer Security

In 2010, the DARPA CRASH program sought proposals for computer systems that mimicked the human immune system [17]. The goal was to design a system that could survive a cyber-attack by learning about the malware even as it was being attacked and relay that information to other systems that could learn and harden their defenses before they were attacked. Although the program had a 4 year life cycle and was focused on single or networked computer systems, the same theory can be applied to cyber defense on a grander scale. Already, anti-virus companies find signatures for new cyber threats and convey them to all systems using their software via new virus definitions. However, these systems rely on the capability of the anti-virus company to see the total threat space, something that is an NP-complete problem. A much more reliable method would be for the attacked system itself to convey details of its attacker to all other systems. New data analysis tools will have to be developed to parse through the mountains of data generated from log files from millions of connected computing systems and determine what constitutes a true threat. Information from log files, abnormal behavior of the system, duration and effects of the attack provide inputs to a cyber security model that can correlate symptoms to attack effects. This view was echoed by Gartner Research Director Lawrence Pingree at the Intel Security FOCUS conference in 2014, where he likened security threats to police All Points Bulletin (APB) where the criminal’s description is conveyed to all police officers to maximize coverage. For security suites such as Intel’s Threat Intelligence Exchange, once a system identifies that it is being attacked, symptoms of the attack are sent to a central repository which then informs all others [18]. IBM also has a system that pairs its QRadar Security Intelligence with Big Data analytics to determine abnormal behavior and events over years of activity [18]. Such long range assessment is invaluable in detecting stealthy malware that does not produce overt effects on a system, but can only be determined by analyzing system behavior for extended periods of time before and after the attack.

In the January 2013 RSA Security Brief [19], the benefits of Big Data towards cyber security are further highlighted. Authors cite two main risk factors in the coming years: the first is the dissolution of network boundaries, where distributed private networks are expanded to include more customers and items and take advantage of cloud storage and computing. The larger and more decentralized the network, the more vulnerable its nodes can be to attack. The second is more sophisticated adversaries. Today, sophisticated hacking attacks that are reported weekly come in two flavors: large scale very public attacks such as those on Target., Home Depot, Sony and other companies where their websites were shut down or customers’ information stolen. The other is covert attacks that secretly and continuously steal and transmit the victims’ information. These attackers have found and exploited vulnerabilities even in systems with sophisticated protection suites.

Upcoming Challenges for Big Data

Big Data analytics have yielded some stunning benefits and there is no sign that we are near the end. Engineers are using insights gleaned through data mining to optimize assembly lines, discover new medicines, protect computer systems and perform anomaly detection. However, as Big Data’s popularity increases, so do the unique challenges that are associated with it. In this section, several upcoming challenges are described.

The Internet of Things

The number of connected devices is set to increase dramatically with a recent push for the Internet of Things (IoT). Through the IoT revolution, various devices that we use in day-to-day life will be connected to the internet and have the capability of conveying information. This is of great interest in building automation and control where the closed/open status of a door or window can be tracked. Faulty machinery components of sophisticated systems can be identified by monitoring the health of key components. Health of a warfighter on the battlefield can be monitored and automated messages sent to centralized repositories that can provide accurate information on battle casualties.

With this increased data collection comes the burden of collecting and archiving the data. Apple, Amazon, Facebook and Google have all built massive data centers based on predictions of their data storage needs. However, the amount of data required is a monotonically increasing problem – there will always be more data that needs to be stored. With IoT, the increase in data that needs storing will go up significantly to where even current data centers may not suffice. New low-cost and low-power storage technologies are needed to keep up with the cost of storing information.

The only way around ever-increasing data storage requirements is to make a determination on which data needs to be archived and which does not. Choosing to discard information is a difficult problem. Although data today may seem uncorrelated to an effect, new data mining tools such as deep learning may find a connection in the near future. There is the option to store only the usable features generated from the raw data instead of the data as a whole. For instance, in the case of persistent surveillance, if the user only wishes to know the time in which an event happened, only the time stamp can be recorded and the image need not be stored. However, if later the user wishes to determine what the event was, then that data would not be available as the images were never stored.

Security and Privacy

A 2014 report from the White House on Big Data and Privacy [20] shows that over 80 % of respondents to a survey stated that they were “very much concerned” about proper transparency and oversight for data practices. It is well known that internet companies such as Google, Facebook, Microsoft and others collect and store information on their users’ for commercial benefit. This may be only slightly irritating when an ad for a product you searched for on Amazon follows you when you are trying to read your Gmail. Many of these companies have data handling policies in place to protect user identities. However, the fact remains that these companies are collecting vast amounts of personal data and the users have little to no say on how their data is used. In addition, even though in most cases identifying information can be removed from the collected data, techniques like data fusion can piece enough information from different sources so as to identify the person related to a particular data. Also, “de-identifying” information in this way can in some cases remove its usefulness for predictive analysis.

Closely related to privacy concerns are the security concerns of adequately protecting vast amounts of personal data from falling into the wrong hands. Having data from various sources stored in centralized locations turns these locations into big targets for malicious cyber-attack. The challenges of cloud security include protection and trust of hardware, facilities, operators, code and companies using the data [21]. Often times, many of these layers are not known to the subject whose data is being gathered. Even for a company engaging in Big Data analytics, it is difficult to establish a fully trusted data flow system whereby they can truthfully ensure their users that their data is being well guarded. Some larger companies such as Google, Facebook and Apple have established their own data centers to improve their level of trust by using their own hardware, however this too does not assure security [21]. These Data centers must be physically secured, access to massive computation systems must be physically restricted, and theft detection systems should be well established. Many Software-as-a-Service (SaaS) applications that collect data from users run in internet browsers which places the security risk on both ends: at the user terminal where it may not be known if even basic security measures such as anti-virus, firewall, or routers are being used; and at the receiver datacenter whose the security measures are not usually known to the user.

Technical Challenges

The scale of Big Data is what sets it apart from previous generations of data analytics. Data sets that are an order of magnitude or more larger than previously useable data are now routinely being processed. The storage requirements are only predicted to increase over time. The most apparent technical challenge in the near-term is the need for more dense memory and faster processors. In [22], data size in kB/day were shown to increase exponentially from 2003 to 2011. In 2010, enterprises and data users stored over 13 exabytes of new data. While data archiving technology such as tape drives exist [23, 24], these are typically only allow sequential access which is very difficult to work with for Big Data analytics. New low-power, dense and random access memory technologies are required for the next generation of data analytics.

In addition to data storage, there is a need for the next generation of computer processors. It is well known that computer processors have been following Moore’s law for many decades, doubling computer processing power every 18 months. However, as incoming data increases exponentially, new architectures will be needed to keep up with demand. Processors built inherently for large data distributed processing are needed, that can sort and process data in parallel but also be able to combine information from multiple locations to draw insights.

Finally, there must be continual development in the Big Data analytical software. Machine learning and pattern recognition have had a cyclical nature of research, where there was intense interest in certain decades and lack of interest in others. Currently, there is an uptick in research in these algorithms due to the many Big Data success stories. However, for every improvement in classification accuracy by a next-generation classifier, there must also be a corresponding development in software that can use these classifiers in parallel over massive data centers and spatially separated data sets. Next generations of data retrieval and indexing software will also need to be developed for such massive data sets.

Future Trends in Large Data Handling

The processing of massive data sets has yielded many significant results, the benefits of which more than offset the need for improved technologies in the years to come. With the many challenges for Big Data in the near and long term future, there has been a corresponding increase in research and development in this field. Technology is being developed on multiple fronts to allow the successes achieved through the processing of massive data sets to continue in the future years.

Dr. Michio Kaku writes that we are rapidly entering the stage of computer commoditization, where computing power can be sold as a commodity [25]. Seeing the rapid rise of Big Data facilities and services such as Amazon AWS cloud computing and storage, this is very true. In most circumstances, companies using Big Data will only need large computational resources for a relatively short period of time after which these systems will remain idle. Multiple methods of computer resource commoditization are available now, such as SaaS, Analytics as a Service (AaaS), and Infrastructure as a Service (IaaS). These service models allow businesses to only pay for what they need. And to complement computational resource commoditization, there are various cloud services that allow data storage on third party data centers. These third party facilities save companies cost by consolidating IT personnel and resources.

With the increasing amount of information being collected, the need for more advance storage will outpace the need for more compute power. Therefore much research is being generated in the fields of next generation storage technologies. A good example is the non-volatile Resistive Random Access Memory (RRAM). As transistor size decreases, the effect of electron leakage becomes significant and can affect the stored value in adjacent memory cells. RRAM seeks to replace the traditional transistor-based memory model with an alternative method that lowers leakage current while reducing size and increasing endurance. RRAM uses a switching material sandwiched between two electrodes to represent a memory cell. The motion of electrons under the influence of an electric field causes a measureable change in resistance, which is readable as a memory value. A company named Crossbar has recently announced commercialization of 3-D stacked RRAM which increases memory density [26]. Crossbar has indicated that the first application of their technology will be in high-capacity enterprise systems. Other memory technologies are also being developed such as Magnetoresistive RAM (MRAM) and Phase-Change Memory (PCM) in an effort to find a suitable replacement to traditional DRAM and NAND based memory architecture. Commercialization of these technologies in the coming years will help significantly to improve storage.

The current trend toward mobile computing has increased the need for low-power compute and data storage technology. As previously mentioned, large data centers consume very high amounts of power. New lower power processors with smaller transistor sizes will help to bring the power requirements down. In addition, real time data analytics are now possible with these low power processors. Instead of sending the raw data to the data center for processing on massive compute servers, the data is sifted in real time at the source for useful information and only that data is sent to the central repository for storage. Further, certain processors are being developed that change the traditional architecture of a microprocessor from a serial nature to a parallel one. The GPU manufacturer Nvidia has developed a programming interface called Compute Unified Device Architecture (CUDA) that allows programmers to take advantage of the large number of parallel processors in a GPU. Nvidia believes that there is great potential in fields of machine learning and has developed a deep learning library cuDNN to decrease processing time [27]. In some cases, speedup of 14× versus a 12 core CPU were realized when training on an image dataset with a neural network.

GPU computing’s Achilles heel has always been memory transfer. GPUs can only store small amounts of memory on board and the latency of transferring memory back and forth between the GPU and CPU often makes them unusable for Big Data applications. However, recent work in MapD architecture has shown that GPUs can be used in data sorting and produce significant time savings. In [28], the author generated a hierarchical memory structure similar to that in the classical CPU architecture, with hard drives at the lower level, main memory buffer pool in the middle and GPU memory at the top. Similar to memory cache behavior, data is read and stored in chunks and if the required data is not in the cache, it is searched for at the next lower memory level. With this memory framework in place, the author was able to achieve speedup of over 1,000,000 times from using the algorithm on the GPU versus an in-memory CPU implementation. Real-time twitter search throughout the entire United States for particular words is now possible with sub-$1000 systems, which can be useful in a variety of ways to sense words that are trending on social media and correlate them to points on a map [28, 29]. By knowing what is popular among users in a particular location, an estimate of the collective emotions in the area can be predicted. This can be useful for marketing the right product for the correct audience, but also for allocating appropriate crowd control resources in volatile situations.

Conclusions

Many benefits have been realized with Big Data analytics and cloud computing resources. Data that is an order of magnitude or more larger than previous generations can now be routinely analyzed to discover hidden patterns. Fields of medicine, cyber security, image processing and defense can benefit by tracking collected information and drawing conclusions based on bigger data sets than were ever previously available. Future progress in Big Data technology will depend on research needed to overcome significant technological, security and policy barriers. However, Big Data takes advantage of a generalized rule in statistics: in most cases, more data will generate better statistical models. With the aid of these better models, the potential exists to more accurately draw Wisdom from Data.