One of the greatest challenges in clinical trial design is dealing with the subjectivity and variability introduced by human raters when measuring clinical end-points. We hypothesized that robotic measures that capture the kinematics of human movements collected longitudinally in patients after stroke would bear a significant relationship to the ordinal clinical scales and potentially lead to the development of more sensitive motor biomarkers that could improve the efficiency and cost of clinical trials.
We used clinical scales and a robotic assay to measure arm movement in 208 patients 7, 14, 21, 30 and 90 days after acute ischemic stroke at two separate clinical sites. The robots are low impedance and low friction interactive devices that precisely measure speed, position and force, so that even a hemiparetic patient can generate a complete measurement profile. These profiles were used to develop predictive models of the clinical assessments employing a combination of artificial ant colonies and neural network ensembles.
The resulting models replicated commonly used clinical scales to a cross-validated R2 of 0.73, 0.75, 0.63 and 0.60 for the Fugl-Meyer, Motor Power, NIH stroke and modified Rankin scales, respectively. Moreover, when suitably scaled and combined, the robotic measures demonstrated a significant increase in effect size from day 7 to 90 over historical data (1.47 versus 0.67).
These results suggest that it is possible to derive surrogate biomarkers that can significantly reduce the sample size required to power future stroke clinical trials.
All relevant data are within the manuscript and its
Stroke is the leading cause of permanent disability in the United States [
An important potential advantage of robotic devices over “traditional” clinical instruments is that the measurement variability due to the skills and expertise of the rater can be removed from the assessment process. It has been shown repeatedly that standard clinical scales, such as the Fugl-Meyer assessment (FM) [
However, the correlation between robotic assays and established clinical scales—such as the NIH stroke scale (NIH) [
Here, we provide methods and modeling details of a longitudinal study involving 208 patients who had suffered severe to moderate acute ischemic stroke and were assessed with four commonly used clinical instruments [
In this study, 208 patients who had suffered acute stroke (defined as patients with a baseline NIH of 7–20 recorded at day 7 days since stroke onset) were enrolled and were given a battery of standard clinical assessments including the NIH, FM, MR and MP [
The RMK battery consists of several metrics derived from various directed unassisted reaching tasks, circle drawing, resistance to external forces, and shoulder strength measurement. These metrics are listed in
Measurement | Metrics | Abbreviation | Additional Description |
---|---|---|---|
Primary Motion | Aim | Aim | |
Deviation of Path | Deviation | Maximum distance between straight-line path vs. patient motion | |
Average Speed | Mean Speed | ||
Peak Speed | Peak Speed | ||
Movement Duration | Duration | Time To Reach Target | |
Jerk Metric | Smooth M/P | Mean Speed/Peak Speed | |
Jerk Metric 1 | Smooth J1 | Jerk Metric Normalized by Peak Speed | |
Jerk Metric 2 | Smooth J2 | Jerk Metric Normalized by Duration | |
Circle Drawing | Ellipse | Difference between major and minor axis for a drawn circle | |
Sub-Movements | Number of Sub-movements | Numb Subm | Number of sub-movements |
Duration of Sub-movements | Dur Subm | Average Width of sub-movement velocity profile | |
Sub Movement Overlap | Overlap Subm | Degree of Overlap between sub-movements | |
Sub Movement Peak | Max Subm | Maximum Height of the sub-movements | |
Sub Movement Skewness | Sigma Subm | Statistical Skewness of sub-movements | |
Sub Movement Intervals | Dist Subm | Interpeak Interval of sub-movements | |
Power | Static Resistance | Plbck | Resistance against force generated by robot |
Dynamic Resistance | Rnd Dyn | Average distance moved vs. Set Resistance Level | |
Shoulder Strength | Mean Z | Resistance against force generated by robot in the vertical direction |
The same reaching tasks are broken down further into sub-movements as described by Novak et al [
Since the RMK endpoints used a variety of units and the assessments fell across widely divergent ranges, each endpoint was linearly normalized from 0 to 1 using the formula:
As mentioned earlier, our working assumption is that the different clinical scales are functions of gross motor movements which are implicitly captured by the various RMK variables recorded by the robotic apparatus. To test this hypothesis, we used a machine learning approach aimed at predicting the clinical scores of a given patient on a given day from the RMK variables measured for that patient on that same day. Models were derived independently for each clinical scale, as different scales may capture different aspects of motor movement and thus require a different subset of RMK variables for effective reconstruction. Each patient contributed at most 6 records to the training set, one for each day for which an RMK and clinical assessment were made (days 7, 14, 21, 30 and 90 plus some patients were also evaluated at day 3). If either the clinical score or any of the RMK variables were missing, that record was excluded from the data set. (The day on which the measurement was made was not included as an independent variable.).
To build robust models, one must guard against over-fitting. Over-fitting arises when the number of features or adjustable parameters in the model substantially exceeds the number of training samples. The presence of excessive features can cause the learning algorithm to focus attention on the idiosyncrasies of the individual samples and lose sight of the broad picture that is essential for generalization beyond the training set. A common solution to this problem is to employ a feature selection algorithm to identify a subset of relevant features and use only them to construct the actual model [
In the present work, we use a feature selection algorithm based on artificial ant colonies that was originally designed to model the biological properties of chemical compounds [
For feature selection, we consider the selection of a variable as a step of the real ant’s path; therefore, the whole path represents a choice of a particular subset of
The path length
As can be seen from this plot, L increases 10-fold as R2 decreases by 0.2 units up to a R2 value of ~0.2, and at a much greater rate for R2 values lower than 0.2.
After
The process is repeated for the specified number of ants, and the best selection found is reported. Variables that contribute to good solutions (small
In this work, we used 3,000 ants, and set the initial weights
For each candidate set of
Neural networks were chosen because of their ability to capture complex nonlinear relationships. However, neural networks are inherently unstable in that small changes in the training set and/or training parameters can lead to large changes in their generalization performance. A proven way to improve the accuracy of unstable predictors is to create multiple instances of them and aggregate their predictions [
In the present work, each subset of features identified by the artificial ant algorithm was used to construct 10 independent neural network models using exactly the same network topology and training parameters but a different random seed number (and thus different initial synaptic parameters and presentation sequence of the training samples). The predictions of these 10 models were averaged to produce the aggregate prediction of the ensemble, as illustrated in
Each subset of features identified by the artificial ant algorithm was used to construct 10 independent neural network models using exactly the same network topology and training parameters but a different random seed number (and thus different initial synaptic parameters and presentation sequence of the training samples). The predictions of these 10 models were averaged to produce the aggregate prediction of the ensemble.
Following common practice, the quality of the models was assessed using 10-fold (leave-10%-out) cross-validation, and quantified using the cross-validated correlation coefficient,
Feature selection, neural network modeling and cross-validation were implemented in the C++ and C# programming languages and are part of the DirectedDiversity® [
The specific study from which the data has been collected has been expressly approved by the MIT Committee on the Use of Humans as Experimental Subjects (COUHES), the Burke Rehabilitation Hospital IRB, and the NHS National Patient Safety Agency / Gardiner Institute Western Infirmary of Glasgow University IRB. All participants provided written consent to participate in this study, and copies of their signed consent forms have been archived. This consent procedure was approved by all the aforementioned ethics committees/IRBs.
Our trial had two primary goals: 1) test whether the RMK metrics can predict the clinical scales with sufficient accuracy to serve as their surrogates for measuring impairment and recovery in a non-variable and objective manner, and 2) test whether it is possible to design a more sensitive RMK-based endpoint to measure effect size and thus reduce the sample size of future clinical trials. Endpoint sensitivity was assessed using the standardized paired effect size, defined as the mean divided by the standard deviation of the day 7 to day 90 changes, aggregated over all patients.
To enable these analyses, we identified two complementary patient populations: 1) those with complete data (i.e., no missing values) for days 7 and 90 for all 35 RMK variables and all four clinical scales (87 patients, 67 from Burke and 20 from Glasgow, hereafter referred to as
Completers | Non-Completers | Total | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Count | Min | Max | Mean | StdDev | Median | Count | Min | Max | Mean | StdDev | Median | Count | p-value | ||
Demographics | Age | 87 | 29 | 96 | 70.552 | 13.860 | 73 | 121 | 22 | 97 | 73.521 | 13.522 | 76 | 208 | 1.3E-01 |
Sex | Male | 43 | 63 | 106 | ||||||||||||
Sex | Female | 44 | 58 | 102 | ||||||||||||
Ethnicity | Caucasian | 66 | 98 | 164 | ||||||||||||
Ethnicity | Hispanic | 6 | 8 | 14 | ||||||||||||
Ethnicity | Asian | 1 | 2 | 3 | ||||||||||||
Ethnicity | African American | 14 | 13 | 27 | ||||||||||||
Handedness | Right | 58 | 80 | 138 | ||||||||||||
Handedness | Left | 8 | 10 | 18 | ||||||||||||
Handedness | Left / Right Writing | 0 | 1 | 1 | ||||||||||||
Handedness | Unknown | 21 | 30 | 51 | ||||||||||||
Affected Side | Right Body | 32 | 40 | 72 | ||||||||||||
Affected Side | Left Body | 35 | 51 | 86 | ||||||||||||
Affected Side | Unknown | 20 | 30 | 50 | ||||||||||||
Site | Burke | 67 | 79 | 146 | ||||||||||||
Site | Glasgow | 20 | 42 | 62 | ||||||||||||
Clinical Scales | NIH Admission | 51 | 1 | 27 | 10.860 | 6.470 | 11 | 77 | 1 | 27 | 10.390 | 5.910 | 9 | 128 | 6.8E-01 |
FM Day 7 | 87 | 4 | 66 | 40.250 | 21.650 | 45 | 91 | 4 | 66 | 38.490 | 21.930 | 41 | 178 | 5.9E-01 | |
MP Day 7 | 87 | 2 | 70 | 45.490 | 20.140 | 50 | 73 | 0 | 70 | 39.290 | 21.710 | 43 | 160 | 6.5E-02 | |
NIH Day 7 | 87 | 0 | 24 | 5.750 | 4.270 | 5 | 93 | 0 | 21 | 6.490 | 4.860 | 6 | 180 | 2.8E-01 | |
FM | 403 | 4 | 66 | 48.790 | 20.030 | 58 | 391 | 0 | 66 | 44.430 | 22.600 | 54 | 794 | 4.2E-03 | |
MP | 402 | 2 | 70 | 52.270 | 17.830 | 56 | 314 | 0 | 70 | 45.360 | 21.500 | 53 | 716 | 5.2E-06 | |
NIH | 404 | 0 | 24 | 3.540 | 3.720 | 2 | 408 | 0 | 21 | 4.590 | 4.500 | 3 | 812 | 3.1E-04 | |
MR | 165 | 0 | 5 | 2.350 | 1.280 | 2 | 129 | 0 | 5 | 2.820 | 1.250 | 3 | 294 | 1.7E-03 | |
RMK Metrics | Aim Aff | 404 | 0.003 | 1 | 0.176 | 0.122 | 0.144 | 377 | 0 | 0.79 | 0.210 | 0.144 | 0.172 | 781 | 4.1E-04 |
Aim NonAff | 404 | 0 | 0.931 | 0.245 | 0.138 | 0.215 | 380 | 0.031 | 1 | 0.272 | 0.151 | 0.245 | 784 | 9.3E-03 | |
Deviation Aff | 402 | 0.005 | 0.848 | 0.093 | 0.106 | 0.06 | 381 | 0 | 1 | 0.128 | 0.164 | 0.072 | 783 | 4.5E-04 | |
Deviation NonAff | 404 | 0 | 1 | 0.092 | 0.101 | 0.067 | 377 | 0.001 | 0.873 | 0.122 | 0.141 | 0.078 | 781 | 7.2E-04 | |
Dist Subm Aff | 404 | 0.01 | 1 | 0.364 | 0.126 | 0.358 | 368 | 0 | 0.754 | 0.374 | 0.135 | 0.369 | 772 | 2.9E-01 | |
Dist Subm NonAff | 404 | 0.036 | 1 | 0.344 | 0.129 | 0.334 | 381 | 0 | 0.749 | 0.376 | 0.137 | 0.369 | 785 | 8.0E-04 | |
Dur Subm Aff | 404 | 0.173 | 1 | 0.493 | 0.134 | 0.491 | 369 | 0 | 0.831 | 0.479 | 0.140 | 0.486 | 773 | 1.6E-01 | |
Dur Subm NonAff | 404 | 0.262 | 1 | 0.600 | 0.134 | 0.589 | 381 | 0 | 0.977 | 0.612 | 0.130 | 0.613 | 785 | 2.0E-01 | |
Duration Aff | 404 | 0.044 | 1 | 0.234 | 0.134 | 0.201 | 380 | 0 | 0.894 | 0.266 | 0.162 | 0.226 | 784 | 2.8E-03 | |
Duration NonAff | 404 | 0.035 | 0.855 | 0.207 | 0.126 | 0.175 | 381 | 0 | 1 | 0.257 | 0.149 | 0.227 | 785 | 5.2E-07 | |
Ellipse Aff | 403 | 0.999 | 1 | 1.000 | 0.000 | 1 | 379 | 0 | 1 | 0.992 | 0.089 | 1 | 782 | 8.1E-02 | |
Ellipse NonAff | 404 | 0.002 | 1 | 0.755 | 0.183 | 0.818 | 380 | 0 | 0.988 | 0.734 | 0.188 | 0.793 | 784 | 1.1E-01 | |
Max Subm Aff | 404 | 0.031 | 0.77 | 0.339 | 0.133 | 0.323 | 369 | 0 | 1 | 0.319 | 0.156 | 0.292 | 773 | 5.7E-02 | |
Max Subm NonAff | 404 | 0.018 | 0.775 | 0.321 | 0.152 | 0.303 | 381 | 0 | 1 | 0.283 | 0.160 | 0.258 | 785 | 6.9E-04 | |
Mean Speed Aff | 404 | 0.007 | 0.689 | 0.299 | 0.111 | 0.29 | 380 | 0 | 1 | 0.281 | 0.138 | 0.267 | 784 | 4.5E-02 | |
Mean Speed NonAff | 404 | 0.008 | 1 | 0.318 | 0.136 | 0.313 | 377 | 0 | 0.814 | 0.279 | 0.146 | 0.269 | 781 | 1.3E-04 | |
Mean Z Aff | 404 | 0.816 | 1 | 0.849 | 0.029 | 0.842 | 328 | 0 | 0.914 | 0.829 | 0.094 | 0.835 | 732 | 2.2E-04 | |
Numb Subm Aff | 404 | 0 | 0.858 | 0.181 | 0.128 | 0.156 | 369 | 0.008 | 1 | 0.218 | 0.156 | 0.185 | 773 | 3.6E-04 | |
Numb Subm NonAff | 404 | 0 | 1 | 0.161 | 0.133 | 0.124 | 381 | 0.006 | 0.922 | 0.213 | 0.152 | 0.179 | 785 | 4.5E-07 | |
Overlap Subm Aff | 404 | 0.056 | 1 | 0.453 | 0.128 | 0.443 | 367 | 0 | 0.761 | 0.432 | 0.125 | 0.434 | 771 | 2.2E-02 | |
Overlap Subm NonAff | 404 | 0.16 | 1 | 0.491 | 0.149 | 0.482 | 379 | 0 | 0.93 | 0.482 | 0.130 | 0.48 | 783 | 3.7E-01 | |
Peak Speed Aff | 404 | 0.05 | 0.833 | 0.397 | 0.133 | 0.377 | 380 | 0 | 1 | 0.379 | 0.152 | 0.36 | 784 | 7.9E-02 | |
Peak Speed NonAff | 404 | 0.057 | 0.801 | 0.359 | 0.150 | 0.343 | 377 | 0 | 1 | 0.318 | 0.155 | 0.306 | 781 | 1.9E-04 | |
Plbck Mean Aff | 404 | 0 | 0.961 | 0.170 | 0.171 | 0.098 | 377 | 0.003 | 1 | 0.212 | 0.188 | 0.147 | 781 | 1.2E-03 | |
Plbck Mean NonAff | 403 | 0.003 | 0.823 | 0.139 | 0.170 | 0.063 | 383 | 0 | 1 | 0.164 | 0.167 | 0.089 | 786 | 3.8E-02 | |
Rnd Dyn Mean Dist Aff | 404 | 0.017 | 0.975 | 0.728 | 0.258 | 0.866 | 380 | 0 | 1 | 0.670 | 0.300 | 0.86 | 784 | 3.9E-03 | |
Rnd Dyn Mean Dist NonAff | 404 | 0 | 0.982 | 0.770 | 0.120 | 0.797 | 381 | 0.106 | 1 | 0.752 | 0.148 | 0.796 | 785 | 6.3E-02 | |
Sigma Subm Aff | 404 | 0.165 | 1 | 0.456 | 0.124 | 0.46 | 369 | 0 | 0.801 | 0.436 | 0.124 | 0.447 | 773 | 2.5E-02 | |
Sigma Subm NonAff | 404 | 0.212 | 1 | 0.553 | 0.132 | 0.546 | 381 | 0 | 0.905 | 0.550 | 0.117 | 0.55 | 785 | 7.4E-01 | |
Smooth J1 Aff | 404 | 0.029 | 0.942 | 0.170 | 0.101 | 0.143 | 375 | 0 | 1 | 0.192 | 0.128 | 0.154 | 779 | 8.2E-03 | |
Smooth J1 NonAff | 404 | 0 | 0.466 | 0.123 | 0.059 | 0.114 | 381 | 0.005 | 1 | 0.128 | 0.091 | 0.109 | 785 | 3.6E-01 | |
Smooth J2 Aff | 404 | 0 | 0.588 | 0.114 | 0.075 | 0.096 | 378 | 0.003 | 1 | 0.121 | 0.118 | 0.086 | 782 | 3.3E-01 | |
Smooth J2 NonAff | 404 | 0 | 0.39 | 0.084 | 0.055 | 0.074 | 380 | 0 | 1 | 0.076 | 0.077 | 0.059 | 784 | 9.6E-02 | |
Smooth M/P Aff | 404 | 0 | 0.967 | 0.579 | 0.129 | 0.601 | 380 | 0.111 | 1 | 0.550 | 0.150 | 0.562 | 784 | 3.9E-03 | |
Smooth M/P NonAff | 404 | 0 | 1 | 0.500 | 0.135 | 0.524 | 377 | 0.017 | 0.848 | 0.470 | 0.147 | 0.484 | 781 | 3.1E-03 |
RMK metrics are normalized across all patients and assessment points. Statistics for clinical scales and RMK metrics are based on total number of patient assessments. p-values that indicate a statistically significant difference between completers and non-completers are highlighted in red.
We were well aware that in a relatively large group of patients 7 days post stroke the FM might include an occasional patient who showed a ceiling effect for the measurement [
An intuitive way to visualize the correlation structure of the RMK data set is to embed the 35 robotic and four clinical variables into a two-dimensional nonlinear map in a way that preserves as much as possible the pairwise correlations between them. The map shown in
The map was derived by computing the pairwise Pearson correlation coefficients (R) for all pairs of features, converting them to correlation distances (1-abs(R)), and embedding the resulting matrix into 2 dimensions in such a way that the distances of the points on the map approximate as closely as possible the correlation distances of the respective features. The clinical parameters are highlighted in red, the RMK parameters on the affected side in blue, and the RMK parameters on the unaffected side in green. The map also shows distinct clusters of correlated variables which are preserved on both the affected and unaffected sides (outlined by green and blue ellipses, respectively).
Several observations emerge from this map. First, the four clinical scales (highlighted in red) show a substantial degree of correlation to each other as compared to the majority of the RMK variables, with FM and MP exhibiting very similar correlation profiles and being highly correlated themselves (R = 0.933). This is consistent with the findings of Bosecker et al [
Second, the RMK variables on the affected side (in blue) exhibit substantially greater correlation to the clinical scales compared to the non-affected side (in green). Among all the RMK metrics,
Given the degree of redundancy among the RMK metrics, we used principal component analysis (PCA) to estimate the number of underlying independent variables and thus the intrinsic dimensionality of the RMK data. The first 3 PCs account for 59% of the total variance in the data, while 10, 14 and 22 PCs are required to reach the 90%, 95% and 99% levels, respectively. (Note that the limited size of our data set precluded the use of more elaborate geodesic approaches for detecting nonlinear manifolds, such as isometric SPE [
Models were derived independently for each clinical scale, using the completer population for training and cross-validation, and the non-completer population for external validation. Since the number of optimal features is not known
The models with two hidden units were slightly better than those with one and virtually identical to those with three, so the remaining discussion is based on the models with two hidden units. Similarly, other training parameters, such as momentum, initial synaptic weights and number of training epochs, had minimal impact on the generalization error and were set to the values outlined in the Methods section. Although, as we discuss later, there were distinct differences among clinical scales, all models showed good predictive power, with the cross-validated R2s ranging from 0.48 to 0.73 for individual networks, and 0.50 to 0.75 for network ensembles. Model aggregation improved the results in all cases, both in terms of predictive ability (the R2CVs of the ensembles were on average 0.02 units greater than those of the corresponding individual predictors) as well as robustness (standard deviation was reduced by a factor of two to three). The training R2’s were on average 0.05 units higher than the cross-validated R2’s.
The results are summarized in
The figure shows the ability of the robot-derived RMK models to predict the clinical scales with an increasing number of features. The model performance exhibits an asymptotic behavior with respect to the number of RMK features, reaching the point of diminishing returns at approximately 8 features for all four clinical scales. Note the small variance in the prediction of the trained data as shown by the small “whiskers,” which for the most part, are not visible in the figure.
More importantly, the ensemble models retain much of their predictive power on the non-completer population, as illustrated by the dotted lines in
Our models are markedly better than those derived by Bosecker et al. on patients with chronic stroke [
As can be seen in
Every model was cross-validated using the same 10-fold cross-validation procedure described in the Methods session.
One of the goals of our analysis was to gain a more quantitative understanding of what each of the clinical instruments is trying to measure, how they differ from each other, where they fall short, and how we can design alternative scales with greater sensitivity and ability to detect finer differences in motor function.
Given the complex correlation structure of the RMK metrics, the features that are selected by the feature selection algorithm are not necessarily the only ones that can produce a high quality model. The SPE map in
To assess the importance of each feature in predicting the various clinical scales, we systematically removed each feature from our training sample and repeated the feature selection, aggregation and cross-validation procedure for each derived data set. Given the computationally intensive nature of this exercise and our previous observations regarding the optimal number of features and hidden units, this process was only tested with models with 8 input and 2 hidden units. The results are summarized in
Only models with 8 input and 2 hidden units are shown. The left-most data point and the horizontal solid line on the top part of the plot represent the cross-validated R2 of the best model with all features included, averaged over all 10 cross-validation runs (standard deviations shown as error bars). Each subsequent point shows the R2 of the corresponding apo model, i.e., the model derived by omitting the feature shown on the x axis (the model still includes 8 features, just not the one shown on the x axis). The individual markers at the bottom part of the plot indicate whether there is a statistically significant difference between the R2 distributions of the all-feature and the respective apo models (the presence of a marker indicates that the difference between the two distributions is statistically significant, and the absence that it is not).
The left-most data point and the horizontal solid line on the top part of the plot represent the cross-validated R2 of the best model with all features included, averaged over all 10 cross-validation runs (standard deviations shown as error bars). Each subsequent point shows the R2 of the corresponding
For FM and MP, the most critical feature is
A number of features appear significant for NIH, including
At this point, we have demonstrated that a small number of RMK invariants can predict the clinical scales with sufficient accuracy to serve as a proxy for measuring impairment in an objective and unbiased manner. As seen in
The horizontal lines show the day 7 to day 90 effect size for comparable patients of the historical VISTA data for the NIH, as well as the effect sizes for the NIH, FM and MP assessment scales for our
Thus, our second goal was to determine whether we could improve the sensitivity of the clinical endpoints by means of a novel RMK-based composite that could be used to measure effect size in future clinical trials. We have already seen that motor impairment play a dominant role in all clinical scales, but the relative weight of each of these components differs from one scale to another. We hypothesized that by rebalancing these weights we could detect finer improvements in a patient’s condition over a short period of time. Therefore we sought to create a composite scale made exclusively of RMK metrics and determine whether it could improve our ability to distinguish patient improvement from day 7 to day 90. While no treatment was administered, our assumption was that there would be some natural recovery during the acute and sub-acute stroke phase.
Effect size was assessed using Cohen’s d for paired observations, defined as the mean divided by the standard deviation of the day 7 to day 90 changes over the entire completer population. The composite itself is defined as a linear combination of RMK features:
We solved this problem using a greedy forward selection algorithm. Briefly, the algorithm constructed composites by adding one feature at a time until all 8 preselected RMK endpoints were included. The process started by identifying the feature that yielded the maximum effect size and assigning to it a weight of 1. Each remaining feature was then examined in turn, and the one that yielded the largest effect size in combination with the previously selected feature was added to the composite. The algorithm continued in this fashion progressively building larger composites until all 8 features were included. At each step, each candidate feature was evaluated using 18 discrete weights ranging from -1 to +1 in increments of 0.1, while keeping the coefficients of the already selected features at their previously optimized values. Once the feature was selected, the weights of all the features in the current composite were refined in an iterative fashion until the effect size no longer improved. (An alternative backward elimination algorithm was also employed but produced inferior results. That method started by including every preselected RMK endpoint in the composite and optimizing their coefficients using the Newton-Raphson gradient minimization procedure. The feature with the smallest weight was then identified and removed from the composite, the weights of the remaining features were re-optimized, and the process continued in the same fashion until a single feature remained.).
As with the prediction of clinical scales, cross-validation is necessary to ensure that the resulting composites are meaningful beyond the training set. Thus, for each of the three groups of 8 features used in the most predictive models of MP, FM, and NIH, respectively, the forward selection algorithm was repeated 100 times, each time using a different, randomly chosen 80% of the patients to build up the composites and reserving the remaining 20% for testing.
The results are summarized in
For the RMK measurements, no single feature performs as well as the clinical scales, which is not surprising given that the latter encapsulate multiple RMK measures, as demonstrated earlier. The effect size increases sharply as additional features are added, exceeds the clinical scales by as much as 30% for the training and 20% for the validation set with only four features, and then plateaus offering little additional improvement. As expected, the effect size is significantly lower for the validation than for the training set and is also more variable.
In order to test the sensitivity of these robot-assisted surrogate markers against historical data, we selected a subset of 2,937 patients from the Virtual International Stroke Trials Archive (VISTA) [
As seen in
While the underlying RMK metrics were identical to those used to reconstruct the NIH stroke scale, our modeling effort reweighted each of the individual components, enhancing our ability to detect smaller improvements in our patient population. One surprising aspect of our modeling was that sub-movements do not seem to play a significant role in the new composites. We speculate that sub-movements features might be less important when dealing with coarse aspects of motor abilities and the clinical scales generally employed to measure them.
As illustrated in
These results strongly suggest that robotic measurement of motor function may be a viable and improved method for capturing clinical outcomes over traditional clinician-rated measures, and can greatly reduce the sample size required for future clinical trials, thus improving study cost and efficiency. The computational methodology described in this work is not limited only to stroke, but can be applied also to a broad range of problems in medical diagnostics and remote monitoring. While the general clinical findings were described in our earlier publication, the present paper offers greater insights into the relative significance of strength and coordination, and the importance of each robotic feature in capturing different aspects of stroke recovery.
In this work, we have established that we are able to accurately replicate the traditional stroke evaluation scales from robotic measurements with a high degree of accuracy, and that a straightforward re-weighting of the features needed to reconstruct the traditional scales can yield a novel composite that is significantly more sensitive than the traditional scales for measuring improvement over time. This has allowed us to establish that, for a fixed time interval, we can greatly reduce the number of patients needed to power a clinical trial.
Our current work has established a composite that works well over the 90 day assessment window. However, given sufficient amounts of data, it would be possible to tune the composite for a patient’s individual level of impairment. The reason for this is that stroke recovery may not progress in a consistent fashion, and initial recovery to stoke may be more sensitive on some metrics vs others. Being able to identify and pre-specify sensitive composites over a specific range of severities would allow us to better tune it towards a patient population and further shrink the number of patients needed for a given clinical trial. Given the design of this trial and the confounding issue of patient improvement over time, it was not possible for us to assess the inherent variability in human performance. Answering this question could provide valuable insights into determining how long a trial needs to run for patients to stand a good chance of showing functional improvement to their physical well-being.
The results described above are extremely promising but must be interpreted with appropriate caution. The population enrolled in our study was highly selected, with a day 7 mean NIH score of 5.7 ± 4.1. Clearly, any gains in statistical power will need to be balanced against lower enrollment rates imposed by the selection criteria and against potential failure to complete follow-up or to comply with the RMK measurements. Additionally, while the RMK lends itself to repeated assessments to produce averaged measurements, the same might be true if we employ ordinal analysis, central adjudication by multiple raters, and global testing procedures that combine complementary scales across clinical domains and across time (albeit at much greater expense). Finally, there was substantially greater improvement in the NIH scores achieved by our current pool of
Despite the generally limited penetration of robotic technologies in the post-stroke neurorehabilitation arena (only 200 InMotion Arm robots have been produced so far), taken together, our results suggest that robotic measurements may enable early decision making in clinical testing, reduce required sample sizes, and offer a more reliable method to track longitudinal change in patients affected by stroke than using current clinical instruments. More importantly, this study marks a novel beginning for technology-enabled measurement of outcomes, and offers a proof-of-principle for other robotic and wearable devices potentially affording further improvements and efficiencies.
(CSV)
Click here for additional data file.