Discriminative Ability and Predictive Validity of the Timed Up and Go Test in Identifying Older People Who Fall: Systematic Review and Meta-Analysis
Abstract
Objectives
To investigate the discriminative ability and diagnostic accuracy of the Timed Up and Go Test (TUG) as a clinical screening instrument for identifying older people at risk of falling.
Design
Systematic literature review and meta-analysis.
Setting and Participants
People aged 60 and older living independently or in institutional settings.
Measurements
Studies were identified with searches of the PubMed, EMBASE, CINAHL, and Cochrane CENTRAL data bases. Retrospective and prospective cohort studies comparing times to complete any version of the TUG of fallers and non-fallers were included.
Results
Fifty-three studies with 12,832 participants met the inclusion criteria. The pooled mean difference between fallers and non-fallers depended on the functional status of the cohort investigated: 0.63 seconds (95% confidence (CI) = 0.14–1.12 seconds) for high-functioning to 3.59 seconds (95% CI = 2.18–4.99 seconds) for those in institutional settings. The majority of studies did not retain TUG scores in multivariate analysis. Derived cut-points varied greatly between studies, and with the exception of a few small studies, diagnostic accuracy was poor to moderate.
Conclusion
The findings suggest that the TUG is not useful for discriminating fallers from non-fallers in healthy, high-functioning older people but is of more value in less-healthy, lower-functioning older people. Overall, the predictive ability and diagnostic accuracy of the TUG are at best moderate. No cut-point can be recommended. Quick, multifactorial fall risk screens should be considered to provide additional information for identifying older people at risk of falls.
Falls are a major public health problem affecting at least one-third of people aged 65 and older.1 Early identification of older people at risk of falling is crucial for timely initiation of falls prevention strategies. Many falls occur while older people are performing mobility tasks, such as transfers (e.g., getting up from a chair) and walking.2-4 It is therefore not surprising that mobility assessment tools have been identified as an important component in determining fall risk of older individuals.
In community settings, the Timed Up and Go Test (TUG) has been recommended as a simple fall risk screening tool, primarily to identify people warranting more-detailed assessment of gait and balance.5, 6 It measures the time taken (in seconds) for a person to rise from a chair with armrests, walk 3 m with usual assistive devices, turn, return to the chair, and sit down.7 It has been proposed as a more-objective, quantitative version of the original Get Up and Go Test.8 In the most widely used version of the TUG, people are asked to complete the task at a comfortable walking speed,7 but some modified versions have been developed that include walking as fast as possible;9 adding cognitive and motor tasks;10, 11 measuring separate times for completing different component tasks;12 using an armless chair;13 walking around a cone;9 placing an additional chair at the 3-m line;14 walking around the chair before sitting down;15 and using distances of 8 feet (2.44 m),9 5 m,16 and 10 m.12
The TUG is a composite measure of functional mobility. It includes transfer tasks (standing up and sitting down), walking, and turning and therefore incorporates neuromuscular components such as power, agility, and balance.9, 17 TUG performance has been shown to be poorer in older age and in at-risk populations, including those with cognitive impairment,12, 18-21 but no sex differences have been demonstrated.22 Poor TUG performance has been associated with poor muscle strength, poor balance, slow gait speed, fear of falling, physical inactivity, and impairments relating to basic and instrumental activities of daily living (ADLs).7, 23, 24 Each of these associated factors is also a known risk factor of falls in older people,25, 26 so it seems logical that the TUG would be useful for discriminating between fallers and non-fallers.
Two systematic reviews have recently been published on the discriminative ability of the TUG in relation to fall risk in older people.27, 28 One found that TUG performance should not be used to discriminate between persons with a high or low fall risk.28 The other concluded that, although TUG performance was consistently associated with past falls, findings were less consistent for future falls.27 The two reviews contained different study sets because they had different study aims and search strategies, and both reviews included a small number of studies. Both reviews also included case–control studies, even though they provide incomplete control of potentially important variables and are likely to overestimate test diagnostic accuracy.29
To gain a more-accurate indication of the discriminative ability and diagnostic accuracy of the TUG as a clinical screening instrument for identifying older people at risk of falling, a comprehensive systematic review and meta-analysis was conducted, including its several modifications and using different published cut-points. To maximize the utility of the review, case–control studies were excluded, restricting the search to retrospective and prospective cohort studies that included samples of older populations.
Methods
Literature Search Strategy
Four electronic databases (Pubmed, EMBASE, CINAHL, Cochrane CENTRAL) were searched for articles published from their inceptions to September 2011. Medical subject heading terms and key words used included different variations of the Timed Up and Go Test and fall-related outcome measures ((‘TUG’ OR ‘Timed Up and Go’ OR ‘Timed-Up-and-Go’ OR ‘Timed-Up and Go’ OR ‘Timed Up & Go’ OR ‘Timed-Up-&-Go’ OR ‘Timed Up-and-Go’ OR ‘Timed Up-&-Go’) AND (‘accidental falls’ OR ‘fall’ OR ‘falls’ OR ‘faller’ OR ‘fallers’ OR ‘time-to-first’). No language or other restrictions were applied to this initial search. Reference lists of included studies and review articles were also searched for further appropriate articles.
Inclusion Criteria
Articles were included based on the following criteria. A timed version of the Up and Go Test was used and its administration procedure sufficiently described (description of test in article or citation of an original reference or contact with author), 95% of participants were aged 60 and older or recruited from geriatric institutions, a comparison of fallers and non-fallers classified according to fall events was included, information regarding setting was provided, and the study design was a prospective or retrospective cohort study or randomized controlled trial. Articles published in languages other than Dutch, English, French, or German were excluded.
Quality Assessment
Prospective study design with follow-up periods of 12 months (independent living), 6 months (care institution), or individual length of stay (hospital); at least monthly intervals of reporting falls (to minimize reporting bias); and sample size of 150 or more (to provide more-precise results with smaller confidence intervals and less likelihood of incorrect conclusions) were considered to be indicators of high-quality studies investigating diagnostic accuracy in falls. The Quality Assessment of Diagnostic Accuracy in Systematic Reviews (QUADAS) tool was also used to provide a general standardized rating of methodological quality.30 Representativeness was assessed using sampling (random) and response rate (>70%) (Appendix S4).
Data Extraction and Analysis
Two independent reviewers performed initial screening (abstracts and full-text if necessary) and data extraction, and any disagreements were solved by discussion involving a third person. In some cases, information was sourced from more than one publication that described an included study. Authors were contacted if eligibility could not be determined from the published articles. Fields of interest were study design, sample size, age, demographic characteristics, setting, country or ethnicity of study population, fall outcome, fall definition, method of fall ascertainment, description of the TUG version, proportion of fallers or falls rate, and timed results of the TUG.
For description of outcomes, studies were classified according to the method of analysis used. Means or medians were used for discrimination between groups, and regression procedures (logistic, Poisson, negative binomial) were used for prediction. For diagnostic accuracy, sensitivity, and specificity, the area under the receiver operating characteristic curve (AUC), positive and negative predictive values, positive and negative likelihood ratios, and error rate were considered appropriate.31 Where possible, required values for the review tables were calculated using data provided in the published articles or sought from the authors.
Results from regression analyses and tests were not pooled for diagnostic accuracy because of publication bias favoring positive results, many missing data points, and the use of previously published cut-points in some studies (rather than data-driven threshold values). A random-effects meta-analysis of mean differences in time taken to complete the TUG between faller groups was conducted. To pool studies with comparable study samples, separate analyses were conducted for independent-living older people and older people from institutional settings. Separate analyses were also performed for independent-living people categorized as healthy and high-functioning (no cognitive impairment, no use of walking aids, good physical performance) and studies that included a mix of higher- and lower-functioning people. Heterogeneity between studies was assessed using chi-square (χ2) tests (P < .1) and the I2 statistic. Sensitivity analysis was conducted to determine the influence of study design (prospective/retrospective). The level of significance for differences in TUG times between groups was set at 5%. Analyses were conducted using Review Manager (RevMan, Version 5.1.; The Nordic Cochrane Centre, The Cochrane Collaboration, Copenhagen, Denmark).
Results
Identified Studies
The initial search yielded 443 potentially relevant articles, of which 64 were identified as eligible after applying the selection criteria. Eleven of these were multiple publications, leaving 53 studies (English = 51, French = 1, German = 1) for validity assessment and data extraction (Figure 1). (Complete list of included references provided in Appendix S1.)

Description of Studies
Characteristics of the studies are summarized in Tables 1–3 (Appendix S2). Of the 53 studies reviewed, 43 examined the TUG at normal walking speed and seven at fast walking speeds; one used a distance of 5 m, one placed a second chair at the 3-m line, and in one, participants had to walk around the chair before sitting down again. (These three studies required the participant to walk at normal walking speed.) Twenty-five studies had prospective designs. The control group of a randomized controlled trial was used for analysis in one study. A variety of different settings were identified (independently living older adults (40 studies), residents from long-term care facilities (four studies), individuals attending day care (one study), outpatient clinics (two studies), day hospitals (two studies), and geriatric inpatient settings (four studies).
Study Populations
The combined sample size comprised 12,832 people who were able to complete the TUG and provided retrospective or prospective falls data. Sample size of studies varied from 12 to 1,200 participants. Follow-up periods ranged from 4 weeks to 5 years. The following faller categories were used or could be deduced from data provided in the text: fallers versus non-fallers (50 studies), multiple fallers (≥2 falls) versus non-fallers (four studies), multiple fallers versus non-multiple fallers (three studies), single fallers versus multiple fallers (three studies), and two-time fallers versus the remainder (non-fallers, single fallers and ≥3 fallers; one study). One study separated indoor fallers from outdoor fallers.
Methodological Quality
The quality assessment using the QUADAS showed similar results across studies (Appendix S4). Explanation of participant withdrawal is relevant only to prospective studies, so studies with this superior design may have been disadvantaged in fulfilling this criterion compared with retrospective studies. Additional quality assessment using criteria of relevance to fall risk research indicated that only nine studies used representative samples. Fourteen studies followed the recommendations for falls data collection. Overall, eight studies were classified as high-quality trials, of which four included representative samples.
Discrimination of TUG Performance Between Fallers and Non-Fallers
The pooled estimate of the mean difference between fallers and non-fallers in the healthy, higher-functioning samples was 0.63 seconds (95% confidence interval (CI) = 0.14–1.12, P = .01), and the heterogeneity was moderate (χ2 = 12.6, degrees of freedom (df) = 6, P = .05; I2 = 52%) (Figure 2). The pooled estimate of the mean difference between fallers and non-fallers in studies that included a mix of higher- and lower-functioning people living independently was 2.05 seconds (95% CI = 1.47–2.62, P < .001), and the heterogeneity was substantial (χ2 = 50.7, df = 20, P < .001; I2 = 61%) (Figure 2). The pooled estimate of the mean difference between fallers and non-fallers in institutional settings was 3.59 seconds (95% CI = 2.18–4.99, P < .001), and there was no sign of heterogeneity (χ2 = 7.7, df = 8, P = .47; I2 = 0%) (Figure 2). Sensitivity analysis showed slightly lower effects in the two comparisons in independent-living people when the meta-analysis was repeated with studies using only prospective study designs (healthy, higher functioning: mean difference = 0.46 seconds, 95% CI = 0.06–0.87 seconds, P = .03; less healthy, lower functioning: mean difference = 1.86 seconds, 95% CI = 0.96–2.76 seconds, P < .001). In institutional settings, the sensitivity analysis with prospective studies showed only a slightly greater effect (4.23 seconds, 95% CI = 0.09–8.37 seconds, P = .05). There was considerable overlap between fallers, multiple fallers, and non-fallers in TUG times within and between studies independent of the TUG version used, ethnicity, study design (retrospective vs prospective), age, fall definition, and whether the study was conducted in a clinical group (Appendix S3, Figures 4 and 5).

The majority of studies found associations between TUG times and falls in univariate analysis, but in multivariate regression models (logistic, Poisson) adjusting for demographic or functional measures, TUG performance remained an independent predictor of falls in only six of 17 studies in independent-living older adults and two of five studies in institutional settings. A study in older women showed that being in the slowest quartile increased the probability of indoor falls (odds ratio = 2.3, P = .008).32 Of the above trials, two of five studies conducted in independent-living older people and the sole study conducted in an institutional setting found TUG performance to be associated with multiple falls.
Diagnostic Accuracy of TUG Performance
The diagnostic accuracy of TUG times in correctly classifying people as fallers was poor to moderate across studies and settings (Appendix S2, Tables 1–3 and Figure 3). Exceptions demonstrating AUCs greater than 85% were small studies (n = 20–35) investigating few falls. One study in community-dwelling older people found that the TUG was better at predicting multiple fallers than any (≥1) fallers.33 Furthermore, there is limited evidence that the TUG is better at identifying non-fallers, as indicated by high sensitivity values and low negative likelihood ratios. Depending on sample characteristics, derived cut-points for falls in independent-living samples varied between 8.1 seconds and 16 seconds for performing the TUG while walking at comfortable gait speed and between 11 seconds and 13.5 seconds while walking as fast as possible. In institutional settings, cut-points for the original version varied greatly (13–32.6 seconds). Figure 3 shows that sensitivity and specificity were often close to chance. In a frail outpatient day hospital setting, sensitivity and specificity were 75% and 67%, respectively, for a cut-point of 32.6 seconds,34 further suggesting that differences are higher in frailer samples, although the sample size in this study was small (n = 30).

Discussion
Findings of this review indicate that mean TUG times of fallers and non-fallers differ significantly, with small differences in healthy older people and larger differences in frailer and less-mobile groups. Sensitivity analysis investigating the effect of study design showed only marginal changes in the pooled mean differences and will not be further discussed, but tests of prediction and diagnostic accuracy could not support the TUG's ability to predict fallers with a high certainty or its ability to classify older people accurately into faller and non-faller groups.
The TUG was initially developed to measure functional mobility and was not operationalized or formally validated to be a test for predicting falls in older people. Although TUG performance can discriminate between faller and non-faller groups in small case–control studies,11, 13, 35, 36 this study design is likely to overestimate the test's predictive value. Because of these initial findings and its ease of administration, it has been recommended as a screen for fall risk in several guidelines.5, 6 More than 50 retrospective and prospective cohort studies have assessed the ability of the TUG to discriminate between faller and non-faller groups. These studies have used various forms of the test, including the originally proposed version, fast walking speed, and different distances. Most studies derived TUG cut-points for predicting fallers from their samples and did not undertake any external validation.
The pooled mean difference showed that fallers were significantly slower than non-fallers across settings, but this difference in TUG time was only 0.4 seconds in prospective studies of healthy older people and when considered in the context of intra- and interrater variability in measuring test time manually, this difference is not clinically meaningful. The difference between faller and non-faller TUG times increased to more than 4 seconds in institutional settings, demonstrating that the TUG discriminates better between faller groups in lower-functioning populations.
The poor predictive validity of the TUG may be due to several factors. First, falls result from multiple factors and are unlikely to be explained using a single test. The TUG indirectly measures several important risk factors, including strength, balance, and gait stability7, 23 but may not sufficiently reflect risk factors such as poor vision, cognition, and effect of medications. Some researchers have added a secondary task to make the test more encompassing. In a sample of frail participants, one study added a manual secondary task and found that, if the time difference between this task and the standard TUG was greater than 4.5 seconds, people were more likely to fall within 6 months.10 However, one small case–control study added a manual task (carrying a cup of water) or a cognitive task (subtracting 3 seconds) to the TUG and could not find any added value in predicting multiple fallers.11
Second, older people do not form a homogeneous group. Studies with wide eligibility criteria include a variety of people likely to represent different subpopulations in terms of fall risk, as the overlap in mean values between fallers and non-fallers supports. TUG performance depends on many factors, for instance, use and type of walking aids,37 age,22 and fear of falling.38 Moreover, recent research findings suggest the increasing importance of fall location.39 One study found that faster TUG times were associated with more outdoor falls and slower TUG with more indoor falls.32 Most of the studies did not control for fall location or other abovementioned factors, which may have resulted in effects being cancelled out because active outdoor fallers had faster TUG times than slow indoor fallers, diminishing the predictive ability of the TUG.
Third, there may be methodological reasons for the failure of the TUG to predict falls. Although excellent measurement properties in terms of reliability have been reported in standardized laboratory-based settings for different older populations,7, 40, 41 a population-based study could not confirm these findings (intraclass correlation coefficient = 0.56), using a longer time interval between test administrations and different locations and raters.18 Moreover, substantial measurement errors for the TUG have been shown in frail older people who were dependent in ADLs or hospitalized.41, 42 Furthermore, people at high risk of falls may be underrepresented because they are less likely to volunteer for studies, withdraw early from studies, or are unable to complete the TUG.43-45 One study found that people who refused participation were more likely to be cognitively impaired, have experienced a fall in the past 12 months, use walking aids, and have difficulty performing ADLs.46
This review has several strengths. An exhaustive search of the literature was conducted, included studies in multiple languages, additional data were collected from authors if possible (n = 21), and meta-analyses of appropriately similar population groups were performed. Nevertheless, some findings from this review should be interpreted with caution because of the limited data available. There were few high-quality trials, and studies with retrospective falls data and non-randomly chosen samples, which are known to be prone to selection and recall bias and therefore compromise internal and external validity, were included.29 Falls data collection methods used in retrospective studies are also likely to underestimate the number of falls,47 limiting the interpretability of the results from this review. Finally, the reporting of population characteristics was often not sufficient to categorize studies in terms of their population's fall risk.
Conclusions
The findings from this systematic review and meta-analysis suggest that the TUG is not useful for discriminating fallers from non-fallers in healthy, high-functioning populations of older people but is of more use in less-healthy, lower-functioning groups. There was considerable overlap in TUG times between fallers and non-fallers within and between studies, and derived cut-points varied substantially, such that the use of a cut-point from one study would classify most participants from another study as fallers, whereas the use of a cut-point from a third study would classify most of the same participants as non-fallers. Cut-points were so different between studies that it is not possible to make a recommendation regarding threshold values between fallers and non-fallers for independently living people or for those in institutional settings.
Health professionals should therefore not rely too heavily on TUG times in clinical practice, because taking a cut-point from one study and applying it in a clinical setting may lead to incorrect clinical decision-making.48 Clinicians may gain as much insight from observing how an older individual performs the task as from the time it takes to perform it. The TUG might therefore not be sufficient as a simple fall risk screening tool to identify people warranting more-detailed assessment of gait and balance. Quick, multifactorial fall risk screens should be considered to provide additional information about identifying older people at risk of falls.49, 50
Acknowledgments
Conflict of Interest: The editor in chief has reviewed the conflict of interest checklist provided by the authors and has determined that the authors have no financial or any other kind of personal conflicts with this paper.
Author Contributions: D. Schoene, S. R. Lord: Protocol development, literature search, data extraction, data analysis, interpretation of data, manuscript preparation. S. M. S. Wu: Protocol development, literature search, data extraction, interpretation of data, manuscript preparation. A. S. Mikolaizak: Literature search, data extraction, interpretation of data, manuscript preparation. J. C. Menant: Data extraction, interpretation of data, manuscript preparation. S. T. Smith: Protocol development, data analysis, interpretation of data, manuscript preparation. K. Delbaere: Interpretation of data, manuscript preparation.
Sponsor's Role: None.