Process and methods
6 Reviewing research evidence
Reviewing evidence is an explicit, systematic and transparent process that can be applied to both quantitative (experimental and observational) and qualitative evidence (see the chapter on developing review questions and planning the evidence review). The key aim of any review is to provide a summary of the relevant evidence to ensure that the committee can make fully informed decisions about its recommendations. This chapter describes how evidence is reviewed in the development of guidelines.
Evidence identified during literature searches and from other sources (see the chapter on identifying the evidence: literature searching and evidence submission) should be reviewed against the review protocol to identify the most appropriate information to answer the review questions. The evidence review process used to inform guidelines must be explicit and transparent and involves 6 main steps:
writing the review protocol (see the section on planning the evidence review in the chapter on developing review questions and planning the evidence review)
identifying and selecting relevant evidence
extracting and synthesising the results
assessing quality/certainty in the evidence
interpreting the results.
Any substantial deviations from these steps need to be agreed, in advance, with NICE staff with a quality assurance role.
The process of selecting relevant evidence is common to all evidence reviews; the other steps are discussed in relation to the main types of review questions. The same rigour should be applied to reviewing all data, whether fully or partially published studies or unpublished data supplied by stakeholders. Care should be taken to ensure that multiple reports of the same study are identified and ordered in full text to ensure that data extraction is as complete as possible, but study participants are not double counted in the analysis.
Titles and abstracts of the retrieved citations should be screened against the inclusion criteria defined in the review protocol, and those that do not meet these should be excluded. A percentage (at least 10%, but possibly more depending on the review question) should be screened independently by 2 reviewers (that is, titles and abstracts should be double-screened). The percentage of records to be double-screened for each review should be specified in the review protocol.
If reviewers disagree about a study's relevance, this should be resolved by discussion or by recourse to a third reviewer. If, after discussion, there is still doubt about whether or not the study meets the inclusion criteria, it should be retained. If double-screening is only done on a sample of the retrieved citations (for example, 10% of references), inter-rater reliability should be assessed against a pre-specified threshold (usually 90% agreement, unless another threshold has been agreed and documented). If agreement is lower than the pre-specified threshold, the reason should be explored and a course of action agreed to ensure a rigorous selection process. A further proportion of studies should be double-screened to validate this new process until appropriate agreement is achieved.
Once the screening of titles and abstracts is complete, full versions of the selected studies should be obtained for assessment. As with title and abstract screening, a percentage of full studies should be checked independently by 2 reviewers, with any differences being resolved and additional studies being assessed by multiple reviewers if sufficient agreement is not achieved. Studies that fail to meet the inclusion criteria once the full version has been checked should be excluded at this stage.
The study selection process should be clearly documented and include full details of the inclusion and exclusion criteria. A flow chart should be used to summarise the number of papers included and excluded at each stage and this should be presented in the evidence review (see the PRISMA statement). Each study excluded after checking the full version should be listed, along with the reason for its exclusion. Reasons for study exclusion need to be sufficiently detailed (for example, 'editorial/review' or 'study population did not meet that specified in the review protocol').
Priority screening refers to any technique that uses a machine learning algorithm to enhance the efficiency of screening. Usually, this involves taking information on previously included or excluded papers, and using this to order the unscreened papers from those most likely to be included to those least likely. This can be used to identify a higher proportion of relevant papers earlier in the screening process, or to set a cut‑off for manual screening, beyond which it is unlikely that additional relevant studies will be identified.
There is currently no published guidance on setting thresholds for stopping screening where priority screening has been used. Any methods used should be documented in the review protocol and agreed in advance with NICE staff with a quality assurance role. Any thresholds set should, at minimum, consider the following:
the number of references identified so far through the search, and how this identification rate has changed over the review (for example, how many candidate papers were found in each 1,000 screened)
the overall number of studies expected, which may be based on a previous version of the guideline (if it is an update), published systematic reviews, or the experience of the guideline committee
the ratio of relevant/irrelevant records found at the random sampling stage (if undertaken) before priority screening.
Regardless of the level of double-screening, and whether or not priority screening was used, additional checks should always be made to reduce the risk that relevant studies are not identified. These should include, at minimum:
checking reference lists of included systematic reviews, even if these reviews are not used as a source of primary data
checking with the guideline committee that they are not aware of any relevant studies that have been missed
looking for published papers associated with key trial registry entries or published protocols.
It may be useful to test the sensitivity of the search by checking that it picks up known studies of relevance.
Conference abstracts seldom contain enough information to allow confident judgements about the quality and results of a study, but they may be important in interpreting evidence reviews. Conference abstracts should therefore not be excluded from the search strategy. But it can be very time consuming to trace the original studies or additional data, and the information found may not always be useful. If enough evidence has been identified from full published studies, it may be reasonable not to trace the original studies or additional data related to conference abstracts. But if limited evidence is identified from full published studies, tracing the original studies or additional data may be considered, to allow full critical appraisal of the data and to make judgements on their inclusion or exclusion from the evidence review. The investigators may be contacted if additional information is needed to complete the quality assessment.
Sometimes conference abstracts can be a good source of other information. For example, they can point to published studies that have been missed, they can indicate how much evidence has not yet been fully published (and so guide calls for evidence), and they can identify ongoing studies that are due to be published.
Relevant legislation or policies may be identified in the literature search and used to inform guidelines (such as drug safety updates from the Medicines and Healthcare products Regulatory Agency [MHRA]). Legislation and policy does not need quality assessment in the same way as other evidence, given the nature of the source. National policy or legislation can be quoted verbatim in the guideline (for example, Health and Social Care Act ), where needed.
Any unpublished data should be quality assessed in the same way as published studies (see the section on assessing quality of evidence: critical appraisal, analysis, and certainty in the findings). Ideally, if additional information is needed to complete the quality assessment, the investigators should be contacted. Similarly, if data from studies in progress are included, they should be quality assessed in the same way as published studies. Confidential information should be kept to a minimum, and a structured abstract of the study must be made available for public disclosure during consultation on the guideline.
Grey literature may be quality assessed in the same way as published literature, although because of its nature, such an assessment may be more difficult. Consideration should therefore be given to the elements of quality that are most likely to be important.
Assessing the quality of the evidence for a review question is critical. It requires a systematic process of assessing potential biases through considering both the appropriateness of the study design and the methods of the study (critical appraisal) as well as the certainty of the findings (using an approach, such as GRADE).
Options for assessing the quality of the evidence should be considered by the developer. The chosen approach should be discussed and agreed with NICE staff with responsibility for quality assurance, where the approach deviates from the standard (as described below). The agreed approach should be documented in the review protocol (see the appendix on review protocol template) together with the reasons for the choice. If additional information is needed to complete the data extraction or quality assessment, study investigators may be contacted.
Every study should be appraised using a checklist appropriate for the study design (see the appendix on appraisal checklists, evidence tables, GRADE and economic profiles for checklists). If a checklist other than those listed is needed or the one recommended as first choice is not used, the planned approach should be discussed and agreed with NICE staff with responsibility for quality assurance and documented in the review protocol.
Before starting the review, the criteria from the checklist (if not all) that are likely to be the most important indicators of biases for the review question should be agreed. These criteria will be useful in guiding decisions about the overall risk of bias of each individual study.
Sometimes, a decision might be made to exclude certain studies or to explore any impact of bias through sensitivity analysis. If so, the approach should be specified in the review protocol and agreed with NICE staff with responsibility for quality assurance.
Criteria relating to key areas of bias may also be useful when summarising and presenting the evidence (see the section on summarising evidence). Topic-specific input (for example, from committee members) may be needed to identify the most appropriate criteria to define subgroup analyses, or to define inclusion in a review, for example, the minimum biopsy protocol for identifying the relevant population in cancer studies.
For each criterion that might be explored in sensitivity analysis, the decision on whether it has been met or not, and the information used to arrive at the decision, should be recorded in a standard template for inclusion in an evidence table (see the appendix on appraisal checklists, evidence tables, GRADE and economic profiles for examples of evidence tables).
Each study included in an evidence review should preferably be critically appraised by 1 reviewer and checked by another. Any differences in critical appraisal should be resolved by discussion or recourse to a third reviewer. Different strategies for critical appraisal may be used depending on the topic and the review question.
Characteristics of data should be extracted to a standard template for inclusion in an evidence table (see the appendix on appraisal checklists, evidence tables, GRADE and economic profiles). Care should be taken to ensure that newly identified studies are cross-checked against existing studies to avoid double-counting. This is particularly important where there may be multiple reports of the same study.
Meta-analysis may be appropriate if treatment estimates of the same outcome from more than 1 study are available. Recognised approaches to meta-analysis should be used, as described in the handbook from Cochrane, the Centre for Reviews and Dissemination (2009), in Higgins et al. (2011) and documents developed by the NICE Guidelines Technical Support Unit.
There are several ways of summarising and illustrating the strength and direction of quantitative evidence about the effectiveness of an intervention if a meta-analysis is not done. Forest plots can be used to show effect estimates and confidence intervals for each study (when available, or when it is possible to calculate them). They can also be used to provide a graphical representation when it is not appropriate to do a meta-analysis and present a pooled estimate. However, the homogeneity of the outcomes and measures in the studies needs to be carefully considered: a forest plot needs data derived from the same (or justifiably similar) outcomes and measures.
Head‑to‑head data that compares the effectiveness of interventions is useful for a comparison between 2 active management options. Comparative studies are usually combined in a meta-analysis where appropriate. A network meta-analysis is an analysis that can include trials that compare the interventions of interest head-to-head and also trials that allow an indirect comparison via a common third intervention.
The same principles of good practice for evidence reviews and meta-analyses should be applied when conducting network meta-analyses. The reasons for identifying and selecting the randomised controlled trials (RCTs) should be explained, including the reasons for selecting the treatment comparisons. The methods of synthesis should be described clearly in the methods section of the evidence review.
When multiple competing options are being appraised, a network meta-analysis should be considered. The data from individual trials should also be documented (usually as an appendix). If there is doubt about the inclusion of particular trials (for example, because of concerns about limitations or applicability), a sensitivity analysis in which these trials are excluded should also be presented. The level of consistency between the direct and indirect evidence on the interventions should be reported, including consideration of model fit and comparison statistics such as the total residual deviance, and the deviance information criterion (DIC). Results of further inconsistency tests, such as those based on node-splitting, should also be reported, if available. Results from direct comparisons may also be presented alongside network meta-analyses to help validate the overall effect sizes obtained; ideally this will be the results from direct pairwise comparisons.
When evidence is combined using network meta-analyses, trial randomisation should typically be preserved. If this is not appropriate, the planned approach should be discussed and agreed with NICE staff with responsibility for quality assurance. A comparison of the results from single treatment arms from different RCTs is not acceptable unless the data are treated as observational and appropriate steps are taken to adjust for possible bias and increased uncertainty.
Further information on complex methods for evidence synthesis is provided by the documents developed by the NICE Guidelines Technical Support Unit.
To promote transparency of health research reporting (as endorsed by the EQUATOR network), evidence from a network meta-analysis should usually be reported according to the criteria in the modified PRISMA‑NMA checklist in the appendix on network meta-analysis reporting standards.
Evidence from a network meta-analysis can be presented in a variety of ways. The network should be presented diagrammatically with the direct and indirect treatment comparisons clearly identified and the number of trials in each comparison stated. Further information on how to present the results of network meta-analyses is provided by the documents developed by the NICE Guidelines Technical Support Unit.
There are a number of approaches for assessing the quality, or confidence in outputs derived from network meta-analysis that have recently been published (Phillippo et al. 2017, Caldwell et al. 2016, Purhan et al. 2014, Salanti et al. 2014). The strengths and limitations of these approaches and their application to guideline development are currently being assessed.
Information on methods of presenting and synthesising results from studies of diagnostic test accuracy is being developed by the Cochrane Screening and Diagnostic Tests Methods Group and the GRADE working group. The quality of the evidence should be based on the critical appraisal criteria from QUADAS-2 (see the appendix on appraisal checklists, evidence tables, GRADE and economic profiles). If meta-analysis is not possible or appropriate, there should be a narrative summary of the results that were considered most important for the review question.
Evidence on diagnostic test accuracy may be summarised in tables or presented as Receiver Operating Characteristic curves (ROC curves). Meta-analysis of results from a number of diagnostic accuracy studies can be complex. Relevant published technical advice (such as that from Cochrane) should be used to guide reviewers.
There is currently no general consensus on approaches for synthesising evidence from studies on prognosis or prediction models. A narrative summary of the quality of the evidence should be given, based on the quality appraisal criteria from the quality assessment tool used (for example, PROBAST [for clinical prediction models], or QUIPS [for simple correlation/univariate regression analyses], see the appendix on appraisal checklists, evidence tables, GRADE and economic profiles). Characteristics of data should be extracted to a standard template for inclusion in an evidence table (see the appendix on appraisal checklists, evidence tables, GRADE and economic profiles). Methods for presenting syntheses of evidence on prognosis and prediction models are being developed by the GRADE working group.
Results may be presented as tables. Reviewers should be wary of using meta-analysis to summarise results unless the same factor has been examined across all studies and the same outcome measured. It is important to explore whether all likely confounding factors have been accounted for, and whether the metrics used to measure exposure (or outcome) are universal. When studies cannot be pooled, results should be presented consistently across studies (for example, the median and ranges of predictive values). For more information on prognostic reviews, see Collins 2015 and Moons 2015.
Qualitative evidence occurs in many forms and formats and so different methods may be used for synthesis and presentation (such as those described by the Cochrane Cochrane Qualitative & Implementation Methods Group). As with all data synthesis, it is important that the method used to evaluate the evidence is easy to follow. It should be written up in clear English and any analytical decisions should be clearly justified. Critical appraisal of qualitative evidence should be based on the criteria from the Critical Appraisal Skills Programme (CASP; see the appendix on appraisal checklists, evidence tables, GRADE and economic profiles).
In most cases, the evidence should be synthesised and then summarised in GRADE-CERQual. If synthesis of the evidence is not appropriate, a narrative summary may be adequate; this should be agreed with NICE staff with responsibility for quality assurance. The approach used depends on the volume and consistency of the evidence. If the qualitative evidence is extensive, then a recognised method of synthesis is preferable. If the evidence is more disparate and sparse, a narrative summary may be appropriate.
The simplest approach to presenting qualitative data in a meaningful way is to analyse the themes (or 'meta' themes) in the evidence tables and write second level themes based on them. This 'second level' thematic analysis can be carried out if enough data are found, and the papers and research reports cover the same (or similar) factors or use similar methods. (These should be relevant to the review questions and could, for example, include intervention, age, population or setting.)
Synthesis can be carried out in a number of ways, and each may be appropriate depending on the question type, and the evidence identified. Papers reporting on the same factors can be grouped together to compare and contrast themes, focusing not just on consistency but also on any differences. The narrative should be based on these themes.
A more complex but useful approach is 'conceptual mapping' (see Johnson et al. 2000). This involves identifying the key themes and concepts across all the evidence tables and grouping them into first level (major), second level (associated) and third level (subthemes) themes. Results are presented in schematic form as a conceptual diagram and the narrative is based on the structure of the diagram.
Alternatively, themes can be identified and extracted directly from the data, using a grounded approach (Glaser and Strauss 1967). Other potential techniques include meta-ethnography (Noblit and Hare 1988) and meta-synthesis (Barroso and Powell-Cope 2000), but expertise in their use is needed.
The certainty or confidence in the findings should be presented at outcome level using GRADE or GRADE-CERQual (for individual or synthesised studies). If this is not appropriate, the planned approach should be discussed and agreed with NICE staff with responsibility for quality assurance. It should be documented in the review protocol (see the appendix on review protocol template) together with the reasons for the choice.
Before starting an evidence review, the outcomes of interest which are either 'critical' or 'important' to people using services and the public for the purpose of decision-making should be identified. The reasons for prioritising outcomes should be documented in the evidence review. This should be done before starting the evidence review and clearly separated from discussion of the evidence, because there is potential to introduce bias if outcomes are selected when the results are known. An example of this would be choosing only outcomes for which there were statistically significant results.
The committee discussion section should also explain how the importance of outcomes was considered when discussing the evidence. For example, the committee may have found evidence on important outcomes but none on critical outcomes. The impact of this on the final recommendation should be clear.
GRADE and GRADE-CERQual assess the certainty or confidence in the review findings by looking at features of the evidence found for each 'critical' and 'important' outcome or theme. GRADE is summarised in box 6.2, and GRADE-CERQual in box 6.3.
GRADE assesses the following features for the evidence found for each 'critical' and each 'important' outcome:
study limitations (risk of bias) – the internal validity of the evidence
inconsistency – the heterogeneity or variability in the estimates of treatment effect across studies
indirectness – the extent of differences between the population, intervention, comparator for the intervention and outcome of interest across studies
imprecision – the extent to which confidence in the effect estimate is adequate to support a particular decision
other considerations – publication bias, the degree of selective publication of studies.
GRADE-CERQual assesses the following features for the evidence found for each 'critical' and each 'important' outcome or finding:
methodological limitations – the internal validity of the evidence
relevance – the extent to which the evidence is applicable to the context in the review question
coherence – the extent of the similarities and differences within the evidence
adequacy of data – the extent of richness and quantity of the evidence.
The certainty or confidence of evidence is classified as high, moderate, low or very low. In the context of NICE guidelines, it can be interpreted as follows:
High – further research is very unlikely to change our recommendation.
Moderate – further research is likely to have an important impact on our confidence in the estimate of effect and may change the strength of our recommendation.
Low – further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the recommendation.
Very low – any estimate of effect is very uncertain and further research will probably change the recommendation.
The approach taken by NICE differs from the standard GRADE and GRADE-CERQual system in 2 ways:
it also integrates a review of the quality of cost-effectiveness studies (see the chapter on incorporating economic evaluation)
it does not use 'overall summary' labels for the quality of the evidence across all outcomes or for the strength of a recommendation, but uses the wording of recommendations to reflect the strength of the evidence (see the chapter on writing the guideline).
In addition, although GRADE does not yet cover all types of review questions, GRADE principles can be applied and adapted to other types of questions. The GRADE working group continues to refine existing approaches and to develop new approaches. Developers should check the GRADE website for any new guidance or systems when developing the review protocol. Any substantial changes, made by the developer, to GRADE should be agreed with NICE staff with responsibility for quality assurance before use.
GRADE or GRADE-CERQual tables summarise the certainty in the evidence and data for each critical and each important outcome or theme and include a limited description of the certainty in the evidence. GRADE or GRADE-CERQual tables should be available (in an appendix) for each review question(s).
NICE's equality and diversity duties are expressed in a single public sector equality duty ('the equality duty', see the section on key principles for developing guidelines in the introduction chapter). The equality duty supports good decision-making by encouraging public bodies to understand how different people will be affected by their activities. For NICE, much of whose work involves developing advice for others on what to do, this includes thinking about how people will be affected by its recommendations when these are implemented (for example, by health and social care practitioners).
In addition to meeting its legal obligations, NICE is committed to going beyond compliance, particularly in terms of tackling health inequalities. Specifically, NICE considers that it should also take account of socioeconomic status in its equality considerations.
Any equalities criteria specified in the review protocol should be included in the evidence tables. At the data extraction stage, reviewers should refer to the PROGRESS-Plus criteria (including age, sex, sexual orientation, disability, ethnicity, religion, place of residence, occupation, education, socioeconomic position and social capital; Gough et al. 2012) and any other relevant protected characteristics, and record these where reported, as specified in the review protocol. Review inclusion and exclusion criteria should also take the relevant groups into account, as specified in the review protocol.
Equalities should be considered during the drafting of the reviews. Equality considerations should be included in the data extraction process and should be recorded in the committee discussion section if they were important for decision-making.
The following sections should be included in the evidence review:
an introduction to the evidence review
summary of the evidence identified, in either table or narrative format
evidence tables (usually presented in an appendix)
full GRADE or GRADE-CERQual profiles (in an appendix)
evidence statements (if GRADE [or a modified GRADE approach], or GRADE-CERQual is not used)
results from other analysis of evidence, such as forest plots, area under the curve graphs, network meta-analysis (usually presented in an appendix).
The evidence should usually be presented separately for each review question; however, alternative methods of presentation may be needed for some evidence reviews (for example, where review questions are closely linked and need to be interpreted together). In these cases, the principles of quality assessment, and data extraction and presentation should still apply.
A summary of the evidence identified should be produced. The content of this summary will depend on the type of question and the type of evidence. It should also identify and describe any gaps in the evidence.
Short summaries of the evidence should be included with the main findings. These should:
summarise the volume of information gleaned for the review question(s), that is, the number of studies identified, included, and excluded (with a link to a PRISMA selection flowchart, in an appendix)
summarise the study types, populations, interventions, settings or outcomes for each study related to a particular review question.
Evidence tables help to identify the similarities and differences between studies, including the key characteristics of the study population and interventions or outcome measures. This provides a basis for comparison.
Data from identified studies are extracted to standard templates for inclusion in evidence tables. The type of data and study information that should be included depends on the type of study and review question, and should be concise and consistently reported. The appendix on appraisal checklists, evidence tables, GRADE and economic profiles contains examples of evidence tables for quantitative studies (both experimental and observational).
The types of information that could be included are:
bibliography (authors, date)
study aim, study design (for example, RCT, case–control study) and setting (for example, country)
funding details (if known)
population (for example, source and eligibility)
intervention, if applicable (for example, content, who delivers the intervention, duration, method, dose, mode or timing of delivery)
comparator, if applicable (for example, content, who delivers the intervention, duration, method, dose, mode or timing of delivery)
method of allocation to study groups (if applicable)
outcomes (for example, primary and secondary and whether measures were objective, subjective or otherwise validated)
key findings (for example, effect sizes, confidence intervals, for all relevant outcomes, and where appropriate, other information such as numbers needed to treat and considerations of heterogeneity if summarising a systematic review/meta-analysis)
inadequately reported data, missing data or if data have been imputed (include method of imputation or if transformation is used)
overall comments on quality, based on the critical appraisal and what checklist was used to make this assessment.
If data are not being used in any further statistical analysis, or are not reported in GRADE tables, effect sizes (point estimate) with confidence intervals should be reported, or back calculated from the published evidence where possible. If confidence intervals are not reported, exact p values (whether or not significant), with the test from which they were obtained, should be included. When confidence intervals or p values are inadequately reported or not given, this should be stated. Any descriptive statistics (including any mean values and degree of spread such as ranges) indicating the direction of the difference between intervention and comparator should be presented. If no further statistical information is available, this should be clearly stated.
The assessment of potential biases should also be presented. When study details are inadequately reported, or absent, this should be clearly stated.
The type of data that should be included in evidence tables for qualitative studies is shown in the example in the appendix on appraisal checklists, evidence tables, GRADE and economic profiles. This could include:
bibliography (authors, date)
study aim, study design and setting (for example, country)
funding details (if known)
population or participants
theoretical perspective adopted (such as grounded theory)
key aims, objectives and research questions; methods (including analytical and data collection technique)
key themes/findings (including quotes from participants that illustrate these themes/findings, if appropriate)
gaps and limitations
overall comments on quality, based on the critical appraisal and what checklist was used to make this assessment.
Full GRADE or GRADE-CERQual tables that present both the results of the analysis and describe the confidence in the evidence should normally be provided (in an appendix).
If GRADE or GRADE-CERQual is not appropriate for the evidence review, evidence statements should be included. Examples of where evidence statements may be needed are review questions covering prognosis/clinical prediction models (where data cannot be pooled), review questions covering service delivery, or where formal consensus approaches have been taken to answer a review question.
Evidence statements should provide an aggregated summary of all of the relevant studies or analyses, regardless of their findings. They should reflect the balance of the evidence, and its strength (quality, quantity and consistency, and applicability). Evidence statements should summarise key aspects of the evidence but should also highlight where there is a lack of evidence (note that this is different to evidence for a lack of effect).
Evidence statements are structured and written to help committees formulate and prioritise recommendations. They help committees decide:
whether or not there is sufficient evidence (in terms of strength and applicability) to form a judgement
whether (on balance) the evidence demonstrates that an intervention, approach or programme is effective or ineffective, or is inconclusive
the size of effect and associated measure of uncertainty
whether the evidence is applicable to people affected by the guideline and contexts covered by the guideline.
If evidence statements are presented, one or more evidence statements are prepared for each review question or subsidiary question. (Subsidiary questions may cover a type of intervention, specific population groups, a setting or an outcome.)
Each evidence statement should stand alone as an accessible, clear summary of key information used to support the recommendations (see the section on interpreting the evidence to make recommendations in the chapter on writing the guideline). The guideline should ensure that the relationship between the recommendations and the supporting evidence statements is clear.
Evidence statements should identify the sources of evidence and their quality in brief descriptive terms and not just by symbols. Each statement should also include summary information about the:
content of the intervention, management strategy (for example, what, how, where?) and comparison, or factor of interest
population(s), number of people analysed, and setting(s) (for example, country)
outcome(s), the direction of effect (or correlation) and the size of effect (or correlation) if applicable
strength of evidence (reflecting the appropriateness of the study design to answer the question and the quality, quantity and consistency of evidence)
applicability to the question, people affected by the guideline and setting (see the section on equality and diversity considerations).
Note that the strength of the evidence is reported separately to the direction and size of the effects or correlations observed.
Where important, the evidence statement should also summarise information about:
whether the intervention has been delivered as it should be (fidelity of the intervention)
what affects the intervention achieving the outcome (mechanism of action).
An evidence statement indicating where no evidence is identified for a critical or important outcome should be included.
A set of standardised terms for describing the strength of the evidence is given in box 6.4. However, the evidence base for each review may vary, so the developer should define how these terms have been used.
No evidence: 'No evidence was found from English-language trials published since 1990…'. (Note that no evidence is not the same as evidence of no effect.)
Weak evidence: 'There was weak evidence from 1 controlled before and after study'.
Moderate evidence: 'There was moderate evidence from 2 controlled before and after studies'.
Strong evidence: 'There was strong evidence from 2 controlled before and after studies and 1 cohort study'.
Inconsistent evidence: 'The quality of the evidence is mixed'.
Further commentary may be needed on the variability of findings in different studies. For example, when the quality of studies reporting the same outcome varies. In such cases, the review team may qualify an evidence statement with an explanatory sentence or section that gives more detail.
The terms should not be used to describe other aspects of the evidence, such as applicability or direction of effect (see below for suitable terminology).
'Vote counting' (merely reporting on the number of studies) is not an acceptable summary of the evidence.
If appropriate, the direction of effect or association should be summarised using 1 of the following terms:
However, appropriate context/topic-specific terms (for example, 'an increase in HIV incidence', 'a reduction in injecting drug use' and 'smoking cessation') may be used.
These terms should be used consistently in each review and their definitions should be reported in the methods section.
An example of an evidence statement from a prognostic review is given in box 6.5. The example has been adapted from the original and is for illustrative purposes only:
There is moderate evidence from 3 UK cross-sectional studies (Kettle et al. 2007, Jarrett et al. 2007, Morgan et al. 2000; n=254), about the correlation between young people's communication skills around safer sex and a reduction in the number of teenage pregnancies. The evidence about the strength of this correlation is mixed. One study (Kettle et al. 2007) found that discussing condom use with new partners was associated with an increase in actual condom use at first sex (odds ratio [OR] 2.67 [95% confidence interval 1.55 to 4.57]). Another study (Morgan et al. 2000) found that not talking to a partner about protection before first sexual intercourse was associated with an increase in teenage pregnancy (OR 1.67 [1.03 to 2.72]). And, another study (Jarrett et al. 2007) found small positive correlations between condom use, discussions about safer sex (r=0.072, p<0.01) and communication skills (r=0.204, p<0.01).
Evidence statements for qualitative studies or synthesis of qualitative studies do not usually report the impact of an intervention on behaviour or outcomes, and do not report statistical effects or aggregate measures of strength and effect size. Instead statements should summarise the evidence, its context and quality, and the consistency of key findings and themes across studies (meta-themes). Areas where there is little (or no) coherence should also be summarised. An example of an evidence statement developed from qualitative data is given in box 6.6.
Two UK studies (Ellis 1999, Swann 2000) and 1 Dutch study (Nolan 2004; n=542) reported the views of teenage mothers. In 1 study (Ellis 1999) of mothers interviewed in a family planning clinic and 1 study (Swann 2000) of mothers' responses to a questionnaire at their GP surgery, the participants agreed that access to education was the thing that helped them most after they had their child. However, this was not reported as a key theme in the Dutch study (Nolan 2004) of health visitor perceptions of teenage mothers' needs.
AGREE Collaboration (2003) Development and validation of an international appraisal instrument for assessing the quality of clinical practice guidelines: the AGREE project. Quality and Safety in Health Care 12: 18–23
Altman DG (2001) Systematic reviews of evaluations of prognostic variables. British Medical Journal 323: 224–8
Barroso J, Powell-Cope GM (2000) Meta-synthesis of qualitative research on living with HIV infection. Qualitative Health Research 10: 340–53
Brouwers M, Kho ME, Browman GP et al. for the AGREE Next Steps Consortium (2010) AGREE II: advancing guideline development, reporting and evaluation in healthcare. Canadian Medical Association Journal 182: E839–42
Caldwell DM, Ades AE, Dias S et al. (2016) A threshold analysis assessed the credibility of conclusions from network meta-analysis. Journal of Clinical Epidemiology 80: 68–76
Centre for Reviews and Dissemination (2009) Systematic reviews: CRD's guidance for undertaking reviews in health care. University of York: Centre for Reviews and Dissemination
Collins GS, Reistma JB, Altman DG et al. (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD Statement. Annals of Internal Medicine 162: 55–63
Egger M, Davey Smith G, Altman DG (2000) Systematic reviews in health care: meta-analysis in context. London: British Medical Journal Books
Glaser BG, Strauss AL (1967) The discovery of grounded theory: strategies for qualitative research. New York: Aldine de Gruyter
Gough D, Oliver S, Thomas J, editors (2012) An introduction to systematic reviews. London: Sage
GRADE working group (2004) Grading quality of evidence and strength of recommendations. British Medical Journal 328: 1490–4
Guyatt GH, Oxman AD, Schünemann HJ et al. (2011) GRADE guidelines: a new series of articles in the Journal of Clinical Epidemiology. Journal of Clinical Epidemiology 64: 380–2
Harbord RM, Deeks JJ, Egger M et al. (2007) A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 8: 239–51
Higgins JPT, Green S, editors (2011) Cochrane handbook for systematic reviews of interventions, version 5.1.0 (updated March 2011)
Johnson JA, Biegel DE, Shafran R (2000) Concept mapping in mental health: uses and adaptations. Evaluation and Programme Planning 23: 67–75
Moons KG, Altman DG, Reistma JB et al. (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Annals of Internal Medicine 126: W1–W73
NICE Decision Support Unit Evidence synthesis TSD series [online; accessed 31 August 2018]
Noblit G, Hare RD (1988) Meta-ethnography: synthesising qualitative studies. London: Sage
Phillippo DM, Dias S, Ades AE et al. (2017) Sensitivity of treatment recommendations to bias in network meta-analysis. Journal of the Royal Statistical Society; Series A
Puhan MA, Schünemann HJ, Murad MH et al. (2014) A GRADE working group approach for rating the quality of treatment effect estimates from network meta-analysis. British Medical Journal 349: g5630
Ring N, Jepson R and Ritchie K (2011) Methods of synthesizing qualitative research studies for health technology assessment. International Journal of Technology Assessment in Health Care 27: 384–390
Salanti G, Del Giovane C, Chaimani A et al. (2014) Evaluating the quality of evidence from a network meta-analysis. PloS one. 9(7): e99682
Tugwell P, Pettigrew M, Kristjansson E et al. (2010) Assessing equity in systematic reviews: realising the recommendations of the Commission on the Social Determinants of Health. British Medical Journal 341: 4739
Tugwell P, Knottnerus JA, McGowan J et al. (2018) Systematic Review Qualitative Methods Series reflect the increasing maturity in qualitative methods. Journal of Clinical Epidemiology 97: vii–viii
Turner RM, Spiegelhalter DJ, Smith GC et al. (2009) Bias modelling in evidence synthesis. Journal of the Royal Statistical Society, Series A (Statistics in Society) 172: 21–47
Whiting PF, Rutjes AWS, Westwood ME et al. and the QUADAS‑2 group (2011) QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Annals of Internal Medicine 155: 529–36