Data for our analysis were extracted from the ongoing prospective, multicenter, international REALITY (Long-Term Real-World Outcomes Study on Patients Implanted with a Neurostimulator) study (NCT03876054). Before starting the study, Institutional Review Board or Ethics Committee approval was received at each site, and all patients were given written informed consent. The devices used in this study are FDA-approved or approved by a corresponding national agency for this indication. Study visits occur at enrollment, at baseline, peri-operatively, and at six months, one year, and yearly thereafter until the subject has been followed for five years post-implantation.
The eligibility criteria included a baseline pain score ≥ 6 on the 0–10 NRS and scheduled for implantation of an Abbott SCS or dorsal root ganglion stimulation (DRGs) neurostimulation system (Abbott Neuromodulation, Austin, Texas) within 60 days of the baseline visit. To replicate the range of complex patients seen in daily medical practice, the REALITY study inclusion criteria was developed with few restrictions on the pain indication as permitted by regulatory guidelines in each geographical area and according to standard clinical practice.
Demographics, pain etiology, and chronic pain history were collected at baseline. Various patient-reported outcome measures, as described in detail below, were collected at baseline and each follow-up study visit in accordance with the IMMPACT recommendations (37) to capture the effects of therapy on subjects’ pain, function, disability, and mental health. Pain intensity was measured using NRS, where 0 is no pain and 10 is the worst pain imaginable (38). Physical function and pain related disability were measured with the 10-item Oswestry Disability Index (ODI) (16). Each section in the scale covers a different domain (pain intensity, personal care, lifting, walking, sitting, standing, sleeping, sex life, social life, and traveling). Emotional distress was assessed with the 13-item PCS which yields a total score and three subscale scores assessing rumination, magnification, and helplessness (39). PROMIS-29 was used to assess the following nine domains: Physical Function and Ability to Participate in Social Roles and Activities, as well as the seven days average of subject’s Depression, Anxiety, Fatigue, Sleep Disturbance, Pain Interference, and Pain Intensity (40, 41). Other standardized metrics such PGIC (42), a 7-question scale to assess patient global health and a subject’s impression of clinical change, were also collected at each follow-up. Subjects were also asked to report their satisfaction with pain relief provided by the therapy on a five-scale rating of very satisfied, satisfied, neither satisfied nor dissatisfied, dissatisfied, very dissatisfied.
REALITY Wearable Sub Study
The REALITY wearable sub-study was devised to investigate the feasibility of using smartwatch recordings to measure physiological and behavioral markers and to characterize patient experience from baseline to 6 months post-implantation in a subgroup of patients with access to a smartwatch. The sub-study visits occurred at enrollment, baseline, and at three- and six-months post-implantation. All sub-study participants were given an Apple® Watch (Series 3) at enrollment and were instructed to enter NRS scores on a custom watch application daily from baseline to six months after implantation. In addition to the NRS scores, the watch application passively collected several HealthKit metrics for activity, behavior, and cardiac measures, such as step count, stand time, distance walked/run, heart rate, and heart rate variability. Participants were asked to start the watch application at least once a day and the NRS data was sent to secure cloud storage after they selected their current pain level from 0 (no pain) to 10 (the worst pain imaginable). The REALITY iPhone custom application was used to collect behavioral data and to collect PROMs such as PROMIS-29, ODI, PCS, and PHQ-9 on a regular basis (at least 3 times pre-implant, and monthly post-implant). In addition, PGIC was collected at least monthly post-implant. Many subjects reported PGIC multiple times a month.
Data Preprocessing and Missing Data
The statistical features of the daily windowed data were included in the feature set, such as maximum, minimum, sum, mean, standard deviation, and 25, 75, and 90 percentiles of data. To balance weights and missing data for low-resolution wearable features in our analyses, we used daily window averaging for data points with the same pain level. However, due to a high number of missing and low-resolution sleep data acquired through the smartwatch HealthKit app Series 3, this measure was excluded from further analysis. In addition, heart rate variability (HRV) calculated through the Apple® HealthKit had missing data points and inter-beat interval was not accessible, and therefore heart rate values were used to estimate the time interval for calculating HRV using three different methods: root mean square of successive inter-beat intervals of heartbeat differences (RMSSD), standard deviation of the average inter-beat intervals without artifacts (NN intervals) for each 5 minute-period over a 24-hour recording of HRV (SDANN), and the average of standard deviations of all the NN intervals for each 5 minute-period over a 24-hour recording of HRV (SDNNI) (43). To handle missing data for the subjective data from PROs, we used the average score across all data points within the same NRS and PGIC level depends on the output of the machine learning model meaning PGIC was used to do the imputation of missing PROs for NRS modeling and vice versa NRS for PGIC modeling.
Dimensionality Reduction
Principal component analysis (PCA) was used to understand the similarities across many subjective measures in both REALITY main study and sub study. The same analysis was performed to compare the subjective and wearable objective measures (WOMs) collected for all REALITY sub-study patients. PCA is a statistical method used for large datasets to reduce the dimensionality of the data while increasing the interpretability with minimal data loss (44). PROMs from all scales and WOMs were treated as unique entries. Data points from various scales were standardized using the z-score (standard score) prior to this analysis. The z-score describes the fractional distance between a data point and the population mean in terms of standard deviation units. Similarities between the clusters of PROMs and WOMs were compared.
Predictive Models using Machine Learning
Machine learning models were developed from baseline demographic and medical history, wearable objective measurements, and subjective PROMs collected as described previously. A balanced number of training sets for each class of different output variables was considered. Subject response to SCS therapy was evaluated based on how various objective and subjective input variables to the machine learning model can predict pain and PGIC categories. For prediction of PGIC, scores of PGIC were categorized into three responder groups; 1) subjects who selected “No change (or the condition has gotten worse)”, “Almost the same, or hardly any change at all”, “A little better, but no noticeable change”, and “Somewhat better, but the change has not made a real difference” were considered non-responders, 2) subjects who selected “Moderately better, and a slight but noticeable change”, “Better and a definite improvement that has made a real and worthwhile difference”, were considered moderate responders, and 3) subjects who selected “A great deal better and a considerable improvement that has made all the difference” were considered super responders. Similarly, for prediction of pain, NRS scores were categorized into three responder groups: mild (NRS < 4), moderate (NRS ≥ 4 and NRS ≤ 6), and severe (NRS > 6).
For the sub-study, Random Forest (RF) model (45) was implemented to predict PGIC and NRS levels using the PROMs and WOMs collected from the Apple® Watch. 80% of the data was used for training the model and model was tested on the remaining 20%. To increase the robustness of the predictions among the training sets, the Random Forest model was trained 50 times using randomly selected 80% of the input data available. The reported outcomes were then averaged across all 50 different runs. Accuracies of predictive models developed on both main study and sub-study with and without the objective measures were compared. The models were tested with different input variables such as PROMs models for REALITY main study meaning PROs were used as inputs to the random forest model to predict NRS and PGIC, and PROMs or WOMs models for REALITY wearable sub-study meaning PROMs and objective measures were used as inputs to predict NRS and PGIC.