Introduction

Sleep deprivation is one of the most common problems in children. Approximately 69 % of American adolescents report insufficient sleep [1]. Lack of sleep can cause daytime drowsiness, impair memory and attention, and influence other cognitive and emotional functions [2], which can result in poor academic performance [3]. In addition, sleep deprivation has been linked to other physiological problems, including obesity and diabetes, in children [4, 5]. Therefore, it is important to be able to measure sleep.

Studies in the sleep laboratory are very accurate but expensive, and only measure one night’s sleep in an artificial environment. Sleep diaries are inexpensive, but depend entirely on an accurate report by the patient. Some actigraphy devices can accurately measure sleep in the home over a prolonged (1–2 weeks) period [6], but they are expensive and specialized equipment is needed to download and interpret the data. This creates a great need for a simple, inexpensive means to measure the quantity and quality of sleep in the home.

The use of smartphones in medicine has become increasingly common, as almost two-thirds of Americans own them [7], creating a vastly accessible platform through which patients can be reached. Applications (apps) have been created to keep track of Alzheimer’s patients at risk of wandering away [8], to monitor blood pressure [9], to track diet and fitness, and to monitor patients’ progress during rehabilitation [10]. Apps for measuring sleep have also been developed. Many of these require an external device connected to the smartphone, including a Bluetooth-linked external single channel ECG [11], a snoring detector [12], an application that combines snoring data with sleep apnea education [13], several wristwatch devices [14], and a single channel EEG [15].

Integrating medical devices with smartphones is a promising idea, but most of the aforementioned apps still require patients to buy a separate medical device. Therefore, a smartphone application that accurately measures sleep without the aid of external devices would be useful. Apps like this have been developed and tested against actigraphy [16, 17], but only recently have they begun to be compared with results from polysomnography (PSG). One study found one of these apps to correlate poorly with PSG when it comes to basic sleep parameters and sleep staging [18]. Another app was found to be a weak differentiator of wake and sleep [19].

Maciek Drejak Labs (now Northcube AB) developed a smartphone app for the iPhone (Sleep Cycle App, (SCA)) that purports to assess sleep for people of all ages non-invasively by using the accelerometer of the phone to detect motion. Using proprietary software, it claims to be able to provide a report on the timing of different “stages” of sleep (classified into “wake”, “sleep”, and “deep sleep”) by sensing the motion of the bed. A new feature has recently been added that claims to make use of the phone’s microphone to estimate sleep. Our study focused solely on the motion-detecting function since this would compare more directly with medical-grade actigraphy, as well as consumer-oriented wrist devices. We assessed the accuracy of this app by comparing it with sleep staging determined simultaneously in children who are undergoing overnight PSG for clinical reasons.

Methods

Patients were recruited from the sleep laboratory at the Children’s Hospital of Philadelphia. All had been referred for a clinical suspicion of sleep disorders, usually obstructive sleep apnea (OSA). This study was approved by the Institutional Review Board and informed consent was obtained from all individual participants included in the study. Anthropometric data, including height, weight, sex, and age were obtained in the sleep lab at the time of recruitment. At the start of each subject’s sleep study, an iPhone 4S running SCA was placed on the bed at a consistent location near the subject’s head, as recommended on the developer’s website. The same device was used throughout the study.

The PSGs were scored by an experienced sleep technologist using standard pediatric scoring criteria and reviewed by a Board-certified sleep physician. The following parameters were recorded (Rembrandt, Embla, Broomfield, CO): electroencephalogram (EEG) (C3/A2, C4/A1, F3/A2, F4/A1, O1/A2, O2/A1), electrooculogram (EOG) (left and right), submental electromyogram (EMG), tibial EMG, modified lead 2 electrocardiogram, chest and abdominal wall motion by respiratory inductance plethysmography (SensorMedics, Yorba Linda, CA), airflow by nasal pressure (Pro-Tech Services, Inc, Mukilteo, WA) and three-pronged oronasal thermistor (Pro-Tech Services, Inc, Mukilteo, WA); end-tidal PCO2 (ETCO2) by capnography (Novametrix 7000; Novametrix, Wallingford, CT), arterial oxyhemoglobin saturation (SpO2; Novametrix 7000 or Masimo, Irvine, CA), oximeter pulse waveform, and digital video. Studies were scored utilizing the American Academy of Sleep Medicine standards [20, 21]. Sleep latency was calculated as the number of minutes from “lights out” to sleep onset. The apnea/hypopnea index (AHI) was calculated as the number of apneas + hypopneas per hour of sleep. The arousal index (AI) was calculated as the number of arousals + awakenings per hour of sleep.

The total time in each of the app’s sleep state classifications, the latency to “sleep,” and the latency to “deep sleep” were acquired using the graphs generated by the SCA. The lengths of the graph segments were converted into minutes by comparing them to the length of the entire graph, which represented a known total number of minutes. The investigators reviewing these data and performing these calculations had no prior knowledge of the PSG results.

The calculated data from the SCA were compared with analogous parameters from the PSG data. The SCA’s latency to “sleep” and latency to “deep sleep” were compared with the PSG’s sleep latency, and the SCA’s total sleep time (TST) was compared with the PSG’s TST. Bivariate linear regression and Bland-Altman plots were used for these comparisons. The concordance correlation coefficient for each PSG/SCA pair and its associated 95 % confidence interval were computed to assess the agreement between PSG and SCA. In addition, paired t test and Wilcoxon signed-rank test were performed to evaluate the difference in means and medians between PSG and SCA.

We hypothesized that certain stages of sleep might contribute more to the app designation than others. For example, we hypothesized that “deep sleep” might represent REM and/or N3 sleep. To assess the relationship between app staging and the PSG, stepwise regression was used with app “stage” as the dependent variable (i.e., minutes spent “awake,” “sleep,” and “deep sleep”), and PSG stages as independent variables (i.e. minutes spent in “Wake,” “N1,” “N2,” N3,” “REM,” and all combinations of summing them together (e.g., “Wake + N1,” “Wake + N1 + N2,” etc.)).

Results

Twenty-five patients participated in the study, ranging from 2 to 14 years of age. Twenty-two had been referred for suspected OSA, and three were participating as volunteers for other studies. Subject demographics are displayed in Table 1, PSG parameters are shown in Table 2, and SCA parameters are shown in Table 3. Eight subjects had AHI ≤ 1.5, nine had between 1.5 and 5, one had between 5 and 10, three had between 10 and 20, and four had AHI ≥20.

Table 1 Anthropometric data of subjects
Table 2 General sleep statistics from PSG
Table 3 SCA results

Most graphs from the app visually resembled the hypnograms from the PSG (Fig. 1), but some did not (Fig. 2). There was no significant relationship between the app and PSG measurements of TST (Fig. 3), sleep latency (Fig. 4), or sleep efficiency. The differences were random and without a systematic bias, as determined by Bland-Altman plots.

Fig. 1
figure 1

Visual comparison between PSG hypnogram and SCA graph: example of good visual concordance. Progression of sleep stages throughout the night is shown, where top graph is PSG, with y-axis sleep stages of REM (R), wake (W), N1 (1), N2 (2), and N3 (3), and bottom graph is SCA, with y-axis sleep states of “awake,” “sleep,” and “deep sleep”

Fig. 2
figure 2

Visual comparison between PSG hypnogram and SCA graph: example of poor visual concordance. Progression of sleep stages throughout the night is shown, where top graph is PSG, with y-axis sleep stages of REM (R), wake (W), N1 (1), N2 (2), and N3 (3), and bottom graph is SCA, with y-axis sleep states of “awake,” “sleep,” and “deep sleep”

Fig. 3
figure 3

Comparison of PSG (Lab) and SCA (Phone) measurement of total sleep time (TST). Each point represents a subject. a Linear regression is dotted line, line of identity (if SCA were to perfectly match PSG) is solid line. b Bland-Altman plot, concordance correlation coefficient (CCC), and 95 % confidence interval (95 % CI) values given

Fig. 4
figure 4

Comparison of PSG (Lab) and SCA (Phone) measurement of sleep latency. a Linear regression is dotted line, line of identity is solid line. b Bland-Altman plot, concordance correlation coefficient (CCC) and 95 % confidence interval (95 % CI) values given

There was a significant relationship between sleep latency on the PSG and the latency to “deep sleep” on the app (Fig. 5; r-squared = 0.33, p = 0.03). There remained no relationship between TST and total “deep sleep” time on the app. There were no effects of AHI, weight, age, or AI on app accuracy.

Fig. 5
figure 5

Comparison of PSG (Lab) sleep latency and SCA (Phone) latency to “deep sleep.” a Linear regression is dotted line, line of identity is solid line. b Bland-Altman plot, concordance correlation coefficient (CCC), and 95 % confidence interval (95 % CI) values given

When comparing sleep stages in PSG versus the app, there was a borderline (p = 0.056) relationship between “deep sleep” and minutes in “N1 + N3.” No other matches between PSG sleep stages and app “stages” were found.

Discussion

Although the visual images suggest that this simple, consumer-oriented app may yield a gross approximation of the amount and quality of sleep, our hard numerical data do not support its use in a quantitative fashion. Only latency to “deep sleep” on the app showed a significant relationship with any PSG parameters (sleep latency).

We considered other factors that might affect the accuracy of the app. As this was a highly selected population being investigated for OSA, we considered whether abnormal sleep architecture in the subjects could have affected its accuracy. For example, patients with OSA might have more movement/arousals than a healthy individual, which could cause deviation from the model used to develop the algorithm. Weight was considered, as heavier subjects would move the mattress with more intensity and movements might be more easily detected by the app. AI was also considered, as more arousals throughout the night without consequent movement could create discrepancies between the PSG and app wakefulness readings. However, none of these variables affected the discrepancy between app and PSG readings for the parameters analyzed.

There are other possible explanations for the lack of accuracy. The SCA measures movement. It is possible that the bed dampens movement enough to impair the accuracy of the app (we can only speculate on the effects of a waterbed!). The sensitivity of the accelerometer, at least, does not seem to be the problem, as we found the phone could detect even the most miniscule motions from the opposite edge of a normal mattress. The website for the Sleep Cycle App claims that it works on all types of beds, including water beds and memory foam mattresses. It even claims that it works when two people share a bed. We used only firm, hospital-style mattresses, and there was no co-sleeping so we were unable to evaluate this claim. It is possible that a wrist device might yield better results without the potentially dampening effects of the bed.

Even though there was a significant relationship between the app’s latency to “deep sleep” and the PSG’s sleep latency, the correlation between the two parameters was low. Because they did not have a 1:1 relationship, sleep latency cannot be predicted directly by the app’s latency to “deep sleep,” but rather as a fraction of this variable. This finding could be useful for developers who are building sleep measuring algorithms and suggests that better, perhaps less stringent criteria for “sleep” might improve the app’s accuracy in measuring sleep latency.

Although there was little numerical agreement between the PSG and the app, the strong relationship between “deep sleep” and stages N1 + N3 is encouraging, since N3 sleep usually has less movement than other stages, suggesting that with further refinement of the algorithm, and/or the accelerometer sensitivity, a closer relationship between the app and PSG may emerge.

This study has some limitations. Firstly, the sample size was only 25 subjects. Perhaps a larger sample might validate a relationship between “deep sleep” and N3. Our subjects were all children, varying in weight, height, and age. However, there was no effect of age or weight on accuracy of the app. Finally, the subjects recruited for this study were already scheduled for an overnight sleep study, meaning some of them were suspected for OSA or some other sleep disorder. Subjects with complicated conditions such as genetic or craniofacial disorders were excluded, and neither AHI nor AI was found to contribute to the variability of the app.

The app was unable to provide actual minutes spent in each reported sleep state. These data were gleaned by physically measuring the lengths of the graphs in an analog fashion to determine number of minutes. This may introduce an element of error caused by the extra step of human measurement.

Since the PSG assesses sleep stage in 30 s epochs, brief state changes and arousals were recorded, while the app graphs had more smoothing (Fig. 1). A noticeable consequence of this was in the app’s measurement of time spent awake. There were many moments where in the PSG, a subject had multiple arousals but spent most of the actual time asleep. The app registered these entire periods of time as “wake.” Some REM periods were also registered as “wake” by the app because of arousals before and after the stage. This observation follows a similar trend among sleep measuring devices [22], including actigraph devices [23], of having low specificity in detecting wake from sleep.

Our use of the iPhone 4S as our study instrument was not ideal, as new generations of Apple and Android phones have since emerged. These new phones, which have more sensitive accelerometers and greater hardware capabilities, could perhaps allow for more accurate results from sleep measurement apps. We used the technology available at the time we started the study; Apple will always be able to make phones faster than the IRB can approve studies, so this will remain a problem.

Finally, it is important to emphasize that these results are only directly applicable to the Sleep Cycle App, as algorithms and design vary among the different sleep-measuring apps available on the market. However, this app was the most widely purchased at the start of the study.

This study is not meant to promote or defame a particular device. The use of these consumer sleep-measuring devices is a reality in the patient base that we are trying to investigate. In clinical settings, patients increasingly show up with data from these devices [24]. Our data suggest that they should not be used as diagnostic tools. Medical devices using similar principles [6] have been well studied and validated. Until consumer-oriented devices are shown to be accurate, physicians might unfairly dismiss their graphs or give them more legitimacy than deserved. Likewise, consumers need to know if these apps are usable for self-diagnosis, or if they should be treated as nothing more than entertainment.

Further research into the field of smartphone medical tools may include potential integration with other sleep applications, such as snoring detectors or utilization of other phone hardware such as the camera. Finding innovative ways to use them to measure sleep parameters and then integrating them into full sleep architecture could be promising, especially as smartphones become increasingly advanced. In addition, integration with smartwatches could potentially offer a more accurate measure of movement, as it would not be dependent on the movement of the bed.

There is evidence that the poor performance of consumer-oriented sleep measuring devices is mainly due to imperfect algorithms. While some actigraph devices estimate sleep with some degree of validity, their low specificity is a point of concern, particularly in pediatric samples [25]. Even though other consumer devices work similarly by measuring motion, they do not measure sleep with the same accuracy as medical-grade devices [22]. Even actigraphy tends to be more accurate with human—rather than computerized—scoring [26]. Its algorithms seem to be improving over the long term, which suggests an optimistic potential for consumer sleep-measuring devices. In fact, one fitness tracker device was shown to have good concordance with PSG in adolescents [27].

Finally, even if the app may not be accurate, it may at least be useful in educating users about sleep. In today’s technology-driven society, a flashy app might just be able to make the patient population more scientifically literate. As people get excited about measuring their own sleep, they will learn enough to have more meaningful conversations with physicians, increasing the quality of their healthcare. If the app can be improved, it has the potential to encourage users to fix poor sleep habits on their own, reducing unnecessary trips to the clinic. It could also be used to improve outcomes from behaviorally based interventions, which are already highly effective [28], or as a critical detection tool for early sleep problems such as behavioral insomnia of childhood [29], as it could allow physicians to monitor their patients’ long-term sleep patterns more reliably than with sleep diaries or self-reports. From keeping track of bedtimes to checking for abnormal awakenings or sleep quality, the app could provide the physician with better data while also eliminating a chore for the patient.

``We conclude that the Sleep Cycle App is not currently useful as a tool for sleep estimation. The difficulty is likely with the computation of the raw data picked up by the app. Some of these devices are marketed as “alarm clocks” that claim to awaken the user during a “light stage” of sleep. This study calls into question the ability of these “alarm clocks” to act as advertised. Our hope is that the results of this study can give developers and scientists new information to refine their models and that someday a cheap, accurate measurement for sleep is publicly available.