6 Reviewing research evidence
Reviewing evidence is an explicit, systematic and transparent process that can be applied to both quantitative (experimental, observational and correlational) and qualitative evidence (see chapter 4). The key aim of any review is to provide a summary of the relevant evidence to ensure that the Committee can make fully informed decisions about its recommendations. This chapter describes how evidence is reviewed in the development of guidelines.
Evidence reviews for NICE guidelines need to summarise the evidence, notwithstanding its limitations so that the Committee can interpret evidence and make recommendations, even where there is uncertainty.
Studies identified during literature searches (see chapter 5) need to be reviewed to identify the most appropriate information to answer the review questions, and to ensure that the guideline recommendations are based on the best available evidence. The evidence review process used must be explicit and transparent. The process used to inform guidelines involves 6 main steps:
Any substantial deviations from these steps need to be agreed, in advance, with NICE staff with a quality assurance role.
The process of selecting relevant evidence is common to all evidence reviews; the other steps are discussed in relation to the main types of review questions. The same rigour should be applied to reviewing fully and partially published studies, as well as unpublished data supplied by registered stakeholders.
Titles and abstracts of the retrieved citations should be screened against the inclusion criteria defined in the protocol, and those that do not meet these should be excluded. Unless agreed beforehand with NICE staff with a quality assurance role, title and abstract screening should be undertaken independently by 2 reviewers (that is, titles and abstracts should be double‑screened) using the parameters set out in the review protocol. If reviewers disagree about a study's relevance, this should be resolved by discussion or by recourse to a third reviewer. If, after discussion, there is still doubt about whether or not the study meets the inclusion criteria, it should be retained. If double‑screening is only done on a sample of the retrieved citations (for example, 10% of references), inter‑rater reliability should be assessed and reported in the guideline. If it is low, the reason for this should be explored and a course of action agreed to ensure a rigorous selection process.
However, this process is resource intensive. When deciding on the most appropriate strategy, a balance should be struck between the complexity of the topic and the potential risk of excluding studies inappropriately. Strategies could include checking with other members of the evidence review team, the topic adviser (if there is one), the Developer, and the Committee Chair or the Committee, checking of random samples, or using IT solutions such as text mining.
Once the screening of titles and abstracts is complete, full versions of the selected studies should be acquired for assessment. As with title and abstract screening, full studies should usually be checked independently by 2 reviewers, with any differences being resolved. As above, alternative strategies to ensure that studies are not excluded inappropriately can be used (such as checking with the Committee or checking of random samples). Studies that fail to meet the inclusion criteria once the full version has been checked should be excluded at this stage.
The study selection process should be clearly documented and include full details of the inclusion and exclusion criteria. A flow chart should be used to summarise the number of papers included and excluded at each stage of the process and this should be presented in the evidence review (see the PRISMA statement). Each study excluded after checking the full version should be listed, along with the reason for its exclusion. Ideally, if additional information is needed to complete the quality assessment, the investigators should be contacted.
Conference abstracts seldom contain enough information to allow confident judgements about the quality and results of a study, but they can be important in interpreting evidence reviews. Conference abstracts should therefore not be excluded from the search strategy. But it can be very time consuming to trace the original studies or additional data, and the information found may not always be useful. If enough evidence has been identified from full published studies, it may be reasonable not to trace the original studies or additional data related to conference abstracts. But if limited evidence is identified from full published studies, tracing the original studies or additional data may be considered, to allow full critical appraisal of the data and to make judgements on their inclusion or exclusion from the evidence review. Ideally, if additional information is needed to complete the quality assessment, the investigators should be contacted.
Sometimes conference abstracts can be a good source of other information. For example, they can point to published studies that may be missed, they can help to estimate how much evidence has not been fully published (and so guide calls for evidence and judgements about publication bias), or they can identify ongoing studies that are due to be published.
Relevant legislation or policies may be identified in the literature search and used to inform guidelines. Legislation and policy does not need quality assessment in the same way as other evidence, given the nature of the source. Recommendations from national policy or legislation can be quoted verbatim in the guideline (for example, Health and Social Care Act ), where needed.
Any unpublished data should be quality assessed in the same way as published studies (see section 6.2). Ideally, if additional information is needed to complete the quality assessment, the investigators should be contacted. Similarly, if data from studies in progress are included, they should be quality assessed in the same way as published studies. The same principles for the use of confidential data should be applied (see section 5.5) and, as a minimum, a structured abstract of the study must be made available for public disclosure during consultation on the guideline.
Grey literature may be quality assessed in the same way as published literature, although because of its nature, such an assessment may be more difficult. Consideration should therefore be given to the elements of quality that are likely to be most important.
Quality assessment is a critical stage in reviewing the evidence. It requires a systematic process of assessing bias through considering the appropriateness of the study design and the methods of the study. Every study should be assessed using an appropriate checklist. The quality is then summarised by individual study and, if using the GRADE approach, by outcome across all relevant studies. Details of methodology checklists for studies addressing different types of review question and the methods used for assessing quality are given below. Whatever the type of review question or the method used for assessing quality, critical thinking should be used to ensure that relevant biases are considered fully. The Cochrane handbook for systematic reviews of interventions gives a full description of potential biases for intervention studies and how they may be assessed. Quality assessment applies to qualitative and quantitative studies, including economic studies.
Making judgements about the overall quality of studies can be difficult. Before starting the review, an assessment should be made to determine which quality appraisal criteria from the appropriate checklist are likely to be the most important indicators of quality for the review question being addressed. These criteria will be useful in guiding decisions about the overall quality of individual studies and whether to exclude certain studies. They will also be useful when summarising and presenting the body of evidence as a whole (see section 6.4). Topic‑specific input (for example, from Committee members) may be needed to identify the most appropriate quality criteria.
Characteristics of data should be extracted to a standard template for inclusion in an evidence table (see appendix H for examples of evidence tables).
Options for quality assessment should be considered by the Developer, and the chosen approach discussed and agreed with NICE staff with responsibility for quality assurance. The approach should be documented in the review protocol (see table 4.1) together with the rationale for the choice. Each study included in an evidence review should usually be quality assessed by 1 reviewer and checked by another. Any differences in quality grading should be resolved by discussion or recourse to a third reviewer. Alternate strategies for quality assessment may be used depending on the topic and the review question. Strategies for different types of review questions are given below.
Reviews should be assessed using the methodology checklist for systematic reviews and meta‑analyses (see appendix H). If needed, high‑quality systematic reviews can be updated or their primary studies used as evidence for informing a new review. However, the original systematic review should be cited and its use acknowledged as evidence.
The Cochrane handbook for systematic reviews of interventions (Higgins and Green 2011) lists design features in tables 13.2a and 13.2b for quantitative studies with allocations to interventions at the individual and group levels respectively. Once the study design has been classified, the study should be assessed using the methodology checklist appropriate for that type of study (see appendix H). Box 13.4a of the Cochrane handbook for systematic reviews of interventions provides useful notes for completing the appropriate checklist.
The quality of a study can vary depending on which of its measured outcomes is being considered. For example, short‑term outcomes may be less susceptible to bias than long‑term outcomes because of greater loss to follow-up with the latter. It is therefore important when summarising evidence that quality is considered according to outcome.
For more information about the quality assessment of studies of cost effectiveness, see chapter 7.
Studies of diagnostic test accuracy should be assessed using the methodology checklist for QUADAS-2 (Quality Assessment of Studies of Diagnostic Accuracy included in Systematic Reviews; see appendix H).
Quality assessment of studies on the views and experiences of people using services, their families and carers, the public or practitioners
Studies about the views and experiences of people are likely to be qualitative studies or cross-sectional surveys. Qualitative studies should be assessed using the methodology checklist in appendix H.
There is no well‑validated methodology checklist for the quality appraisal of cross‑sectional surveys. Such surveys should be assessed for the rigour of the process used to develop the survey questions and their relevance to the population under consideration, and for the existence of significant bias (for example, non‑response bias).
There are two approaches to presenting the quality assessment of the evidence – either at the whole study level or by outcome across multiple studies. Either approach can be used, but this should be documented in the review protocol.
Studies are rated ('++', '+' or '−') individually to indicate their quality, based on assessment using a checklist, appropriate to the study design. Quality ratings are shown in box 6.1.
Box 6.1 Quality ratings
++ All or most of the checklist criteria have been fulfilled, and where they have not been fulfilled the conclusions are very unlikely to alter.
+ Some of the checklist criteria have been fulfilled, and where they have not been fulfilled, or are not adequately described, the conclusions are unlikely to alter.
– Few or no checklist criteria have been fulfilled and the conclusions are likely or very likely to alter.
If a study is not assigned a '++' quality rating, key reasons why this is the case should be recorded, alongside the overall quality rating, and highlighted in the guideline.
The GRADE (Grading of Recommendations Assessment, Development and Evaluation) approach for review questions about interventions has been used in the development of NICE clinical guidelines since 2009. For more details about GRADE, see the Journal of Clinical Epidemiology series, appendix H and the GRADE working group website.
GRADE is a system developed by an international working group for rating the quality of evidence in systematic reviews and guidelines; it can also be used to grade the strength of recommendations in guidelines. The GRADE system is designed for use for reviews and guidelines that examine alternative management strategies or interventions, which may include no intervention or current best management. The key difference from other assessment systems is that GRADE rates the quality of evidence for a particular outcome across studies and does not rate the quality of individual studies.
In order to apply GRADE, the evidence must clearly specify the relevant setting, population, intervention, comparator(s) and outcomes.
Before starting an evidence review, an initial rating should be applied to the importance of outcomes, in order to identify which outcomes of interest are both 'critical' to decision‑making and 'important' to people using services and the public. This rating should be confirmed or, if absolutely necessary, revised after completing the evidence review and documented in the guideline, noting any changes. This should be clearly separated from discussion of the evidence, because there is potential to introduce bias if outcomes are selected on the basis of the results. An example of this would be choosing only outcomes for which there were statistically significant results. It may be important to note outcomes that were not considered important for decision‑making, and why (such as surrogate outcomes if longer‑term, more relevant outcomes are available).
The GRADE system assesses the quality of the evidence for intervention studies by looking at features of the evidence found for each 'critical' and 'important' outcome. This is summarised in box 6.2.
The GRADE system assesses the following features for the evidence found for each 'critical' and each 'important' outcome:
For observational studies the effect size, effect of all plausible confounding and evidence of a dose–response relationship are also considered.
The quality of evidence is classified as high, moderate, low or very low (see the GRADE website for more information).
High – further research is very unlikely to change our confidence in the estimate of effect.
Moderate – further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate.
Low – further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the estimate.
Very low – any estimate of effect is very uncertain.
The approach taken by NICE differs from the standard GRADE system in two ways:
it also integrates a review of the quality of cost‑effectiveness studies
it does not use 'overall summary' labels for the quality of the evidence across all outcomes or for the strength of a recommendation, but uses the wording of recommendations to reflect the strength of the evidence (see chapter 9).
In addition, although GRADE does not yet cover all types of review questions, GRADE principles can be applied and adapted to other types of questions. The GRADE Working Group continues to refine existing approaches and to develop new approaches, such as GRADE-CERqual for qualitative questions. Developers should check the GRADE website for any new guidance or systems when developing the review protocol if use of GRADE is being considered. Any substantial changes to GRADE as described on the website should be agreed with the NICE staff with responsibility for quality assurance before use.
GRADEpro software can be used to prepare the GRADE profiles. These are evidence profiles that contain a 'quality assessment' section that summarises the quality of the evidence and a 'summary of findings' table that presents the outcome data for each critical and each important outcome. The 'summary of findings' table includes a limited description of the quality of the evidence and may be presented in the evidence review to help readers quickly understand the quality of the evidence base. Full GRADE profiles should also be available (for example, in an appendix).
NICE's equality and diversity duties are expressed in a single public sector equality duty ('the equality duty' see section 1.4). The equality duty supports good decision‑making by encouraging public bodies to understand how different people will be affected by their activities. For NICE, much of whose work involves developing advice for others on what to do, this includes thinking about how people will be affected by its recommendations when these are implemented (for example, by health and social care practitioners). In addition to meeting its legal obligations, NICE is committed to going beyond compliance, particularly in terms of tackling health inequalities. Specifically, NICE considers that it should also take account of socioeconomic status in its equality considerations.
Any equalities data specified in the review protocol should be included in the evidence reviews. At the data extraction stage, reviewers should refer to the PROGRESS‑Plus criteria (including age, sex, sexual orientation, disability, ethnicity, religion, place of residence, occupation, education, socioeconomic position and social capital; Gough et al. 2012) and any other relevant protected characteristics. Review inclusion and exclusion criteria should also take the relevant groups into account.
The following sections should be included in the evidence review:
summary of the evidence, including the 'summary of findings' table from the GRADE profile (if this improves readability and the GRADE system has been used)
full GRADE profiles or links to the profiles in an appendix (if GRADE has been used)
The evidence should usually be presented for each review question; however, alternative methods of presentation may be needed for some evidence reviews (for example, where review questions are closely linked and need to be interpreted together). In these cases, the principles of quality assessment, data extraction and presentation, and evidence statements should still apply.
Any substantial deviations in presentation need to be agreed, in advance, with a member of NICE staff with responsibility for quality assurance.
Evidence tables help to identify the similarities and differences between studies, including the key characteristics of the study population and interventions or outcome measures. This provides a basis for comparison.
Data from identified studies are extracted to standard templates for inclusion in evidence tables. The type of data and study information that should be included depends on the type of study and review question, and should be concise and consistently reported. Appendix H contains examples of evidence tables for quantitative studies (both experimental and observational).
The types of information that could be included are:
bibliography (authors, date)
study aim, type (for example, randomised controlled trial, case–control study) and setting (for example, country)
funding details (if known)
population (for example, source, eligible and selected)
intervention, if applicable (for example, content, who delivers the intervention, duration, method, mode or timing of delivery)
comparator, if applicable (for example, content, intervener, duration, method, mode or timing of delivery)
method of allocation to study groups (if applicable)
outcomes (for example, primary and secondary and whether measures were objective, subjective or otherwise validated)
key findings (for example, effect sizes, confidence intervals, for all relevant outcomes, and where appropriate, other information such as numbers needed to treat and considerations of heterogeneity)
inadequately reported or missing data
comments on quality, based on the quality assessment.
If not being used in any further statistical analysis or reported in GRADE tables, effect sizes with confidence intervals should be reported, as should exact p values (whether or not significant) with the test from which they were obtained, if this is a quality concern. Where p values are inadequately reported or not given, this should be stated. Any descriptive statistics (including any mean values) indicating the direction of the difference between intervention and comparator should be presented. If no further statistical information is available, this should be clearly stated.
The quality ratings of the study should also be given. When study details are inadequately reported, absent or not applicable, this should be clearly stated.
The type of data that should be included in evidence tables for qualitative studies is shown in the example in appendix H. This could include:
bibliography (authors, date)
location (for example, UK)
funding details (if known)
population or participants
theoretical perspective adopted (such as grounded theory)
key aims, objectives and research questions; methods (including analytical and data collection technique)
key themes/findings (including quotes from participants that illustrate these themes/findings, if appropriate)
gaps and limitations
the study's quality rating.
A summary of the evidence should be produced. The content of this summary will depend on the type of question, the type of evidence included and whether GRADE is used. It should also identify and describe any gaps in the evidence.
The narrative summary places a study and its findings in context. It should highlight key factors influencing the results observed, interpret the results and give more detail than presented in the evidence tables. Each narrative summary should include:
a brief description of the study design, methodology, population, setting and research questions or outcomes (if appropriate) for all relevant studies
a summary of the key findings
a summary of the quality ratings (expanding, as appropriate, on study strengths and weaknesses), applicability issues and any other relevant contextual points.
Commentary on the scale and nature of the evidence base may also be useful.
The narrative summary should conclude with a short discussion, followed by 1 or more evidence statements. These should reflect the key findings, the quantity, quality and consistency of the evidence, and its applicability to the review question (including its applicability to people affected by the guideline).
Narrative summaries of all studies and interventions should be incorporated in the main findings of the evidence review. They should be organised by review question and could be divided into smaller subcategories, such as outcome measure, setting or subpopulation.
If GRADE is used, the narrative summary needs only to be very brief and describe key features of the included studies and any other important information that is not included in the GRADE tables. For example, applicability is included in the GRADE tables so does not need to be included in the narrative summary.
If appropriate (for example, when GRADE is used), short summary tables (based on the 'summary of findings' table from the GRADE profile or the narrative summaries) should be included with the main findings (usually before an evidence statement) or in the appendices. For example, these might:
summarise the information gleaned for different review questions
summarise the study types, populations, interventions, settings or outcomes for each study related to a particular review question
organise and summarise studies related to different outcomes.
Meta-analysis may be appropriate if treatment estimates from more than 1 study are available. Recognised approaches to meta‑analysis should be used, as described in the manual from Centre for Reviews and Dissemination (2009), in Higgins and Green (2011) and technical support documents developed by the NICE Decision Support Unit.
NICE prefers data from head‑to‑head RCTs to compare the effectiveness of interventions. However, there may be situations when data from head‑to‑head studies of the options (and/or comparators) of interest are not available. In these circumstances, indirect treatment comparison analyses should be considered.
An indirect treatment comparison refers to the analysis of data from trials in which the interventions of interest have been compared indirectly using data from a network of trials that compare the interventions with other interventions. A network meta-analysis is an analysis that includes both trials that compare the interventions of interest head‑to‑head and trials that compare them indirectly.
The same principles of good practice for evidence reviews and meta‑analyses should be applied when conducting indirect treatment comparisons or network meta‑analyses. The rationale for identifying and selecting the RCTs should be explained, including the rationale for selecting the treatment comparisons included. A clear description of the methods of synthesis is required. The methods and results of the individual trials should also be documented. If there is doubt about the relevance of particular trials, a sensitivity analysis in which these trials are excluded should also be presented. The heterogeneity between the results of pairwise comparisons and inconsistencies between the direct and indirect evidence on the interventions should be reported, using coherence statistics such as the deviance information criterion (DIC).
When multiple options are being appraised, a network meta‑analysis should be considered. Consideration should also be given to presenting pair‑wise meta‑analyses to help validate the network meta‑analysis.
When evidence is combined using indirect or network meta‑analytical frameworks, trial randomisation should be preserved. A comparison of the results from single treatment arms from different randomised trials is not acceptable unless the data are treated as observational and appropriate steps are taken to adjust for possible bias and increased uncertainty.
Analyses using indirect or network meta‑analytical frameworks may include comparator interventions (including placebo) that have not been defined in the scope of the guideline if they are relevant to the development of the network of evidence. The rationale for the inclusion and exclusion of comparator interventions should be clearly reported. Again, the principles of good practice apply.
If sufficient relevant and valid data are not available to include in meta‑analyses of head‑to‑head trials, or mixed or indirect treatment comparisons (network meta‑analysis), the analysis may have to be restricted to a qualitative overview that critically appraises individual studies and presents their results.
Further information on complex methods for evidence synthesis is provided by the technical support documents developed by the NICE Decision Support Unit.
Evidence from a network meta‑analysis can be presented in a variety of ways. The network of evidence can be presented as tables. It should also be presented diagrammatically with the direct and indirect treatment comparisons clearly identified and the number of trials in each comparison stated. Further information on how to present the results of network meta‑analyses is provided by the technical support documents developed by the NICE Decision Support Unit.
There are several ways to summarise and illustrate the strength and direction of quantitative evidence about the effectiveness of an intervention if a meta‑analysis is not done. Forest plots can be used to show effect estimates and confidence intervals for each study (when available, or when it is possible to calculate them). They can also be used to provide a graphical representation when it is not appropriate to do a meta‑analysis and present a pooled estimate. However, the homogeneity of the outcomes and measures in the studies needs to be carefully considered: the forest plot needs data derived from the same (or justifiably similar) outcomes and measures.
If a forest plot is not appropriate, other graphical forms may be used (for example, a harvest plot [Ogilvie et al. 2008]).
If additional statistical analysis, such as meta‑analysis, is not possible or appropriate, a narrative summary of the evidence and its quality should be presented.
For more information on summarising and presenting results for studies of cost effectiveness, see chapter 7.
Information on methods of presenting and synthesising diagnostic test accuracy is being developed (http://srdta.cochrane.org and www.gradeworkinggroup.org). If meta‑analysis is not possible or appropriate, a narrative summary of the quality of the evidence should be based on the quality appraisal criteria from QUADAS‑2 (see appendix H) that were considered most important for the review question being addressed.
Numerical summaries of evidence on diagnostic test accuracy may be presented as tables. Meta‑analysis of numerical summaries from different studies can be complex and relevant published technical advice (such as that from the NICE Technical Support Unit or Decision Support Unit) should be used to guide reviewers.
Numerical summaries and analyses should be followed by a short evidence statement summarising what the evidence shows.
There is currently no well‑designed and validated approach for summarising evidence from studies on prognosis or prediction models. A narrative summary of the quality of the evidence should therefore be given, based on the quality appraisal criteria from appendix H that were considered most important for the review question being addressed. Characteristics of data should be extracted to a standard template for inclusion in an evidence table (see appendix H). Methods for presenting and synthesising evidence on prognosis and predication models are being developed (www.gradeworkinggroup.org).
Results from the studies included may be presented as tables to help summarise the available evidence. Reviewers should be wary of using meta‑analysis to summarise large observational studies, because the results obtained may give unfounded confidence in the study results. However, results should be presented consistently across studies (for example, the median and ranges of predictive values across all the studies).
The narrative summary should be followed by a short evidence statement summarising what the evidence shows.
Summarising and presenting results of studies of the views and experiences of people using services, their families and carers, the public or practitioners
The quality of the evidence should be described in a narrative summary, based on the quality appraisal criteria from appendix H that were considered the most important for the review question being addressed. If appropriate, the quality of the cross‑sectional surveys included should also be summarised.
The quality assessment of included studies could be presented in tables. Methods for synthesising evidence from qualitative studies (for example, meta-ethnography) are evolving, but the routine use of such methods in guidelines is not currently recommended.
The narrative summary should be followed by a short evidence statement summarising what the evidence shows. Characteristics of data should be extracted to a standard template for inclusion in an evidence table (see appendix H).
Qualitative evidence occurs in many forms and formats and so different methods may be used to synthesise and present it. As with all data synthesis, the key is transparency. It is important that the method used can be easily followed. It should be written up in clear English and any analytical decisions should be clearly justified.
In some cases, the evidence is synthesised and then summarised. In other cases, a narrative summary may be adequate. The approach used depends on the volume and consistency of the evidence. If the qualitative evidence is extensive, then a recognised method of synthesis is preferable. If the evidence is more disparate and sparse, a narrative summary approach may be more appropriate.
Qualitative reviews may comprise relatively few papers or have an inconsistent focus (for example, they may involve different settings, populations or interventions). If the papers have little in common, it is not appropriate to synthesise them. Instead, a narrative summary of the key themes (including illustrative quotes) of each paper should be provided, as well as a full evidence table for each study (for example, the methods, the participants and the underlying rationale).
Both the narrative summary and the evidence table should identify all the main themes reported: only themes that are not relevant to the review should be left out and these omissions should be clearly documented. As in all qualitative research, particular attention should be paid to 'outliers' (other themes) and views that disagree with or contradict the main body of research.
The narrative summary should be divided up under headings derived from the review question (for example, the settings of interest) unless good reasons are documented for not doing so. The narrative should be summarised into evidence statements that note areas of agreement and contradiction.
The simplest and most rigorous approach to presenting qualitative data in a meaningful way is to analyse the themes (or 'meta' themes) in the evidence tables and write a narrative based on them. This 'second level' thematic analysis can be carried out if enough data are found, and the papers and research reports cover the same (or similar) factors or use similar methods. (These should be relevant to the review questions and could, for example, include intervention, age, population or setting.)
Synthesis can be carried out in 1 of 2 ways. More simply, papers reporting on the same factors can be grouped together to compare and contrast themes, focusing not just on consistency but also on any differences. The narrative should be based on these themes.
A more complex but useful approach is 'conceptual mapping' (see Johnson et al. 2000). This involves identifying the key themes and concepts across all the evidence tables and grouping them into first level (major), second level (associated) and third level (subthemes) themes. Results are presented in schematic form as a conceptual diagram and the narrative is based on the structure of the diagram.
Alternatively, themes can be identified and extracted directly from the data, using a grounded approach (see Glaser and Strauss 1967). Other potential techniques include meta‑ethnography (see Noblit and Hare 1988) and meta‑synthesis (see Barroso and Powell‑Cope 2000), but expertise in their use is needed.
Any review or, particularly, any synthesis of qualitative data, must by its nature mask some of the variations considered important by qualitative researchers (for example, the way the researcher interacts with research participants when gathering data). Reviewers should, as far as possible, highlight any significant causes of variation noted during data extraction.
Evidence reviews for both qualitative and quantitative studies should include a narrative summary and GRADE tables where used, and should conclude with 1 or more supporting evidence statements.
The evidence statements should provide an aggregated summary of all of the relevant studies or analyses (such as economic models or network meta‑analyses), regardless of their findings. They should reflect the balance of the evidence, its strength (quality, quantity and consistency) and applicability. The evidence statements should summarise key aspects of the evidence but can also highlight where there is a lack of evidence (note that this is different to evidence for a lack of effect). In the case of intervention studies, evidence statements should reflect what is plausible, given the evidence available about what has worked in similar circumstances. This may also be supported by additional information about aspects of the evidence such as setting, applicability or methodological issues.
Evidence statements are structured and written to help Committees formulate and prioritise recommendations. They help Committees decide:
whether or not there is sufficient evidence (in terms of strength and applicability) to form a judgement
whether (on balance) the evidence demonstrates that an intervention, approach or programme can be effective or is inconclusive
the typical size of effect (where there is one) and associated measure of uncertainty
whether the evidence is applicable to people affected by the guideline and contexts covered by the guideline.
Evidence statements should be included in the final guideline.
One or more evidence statements are prepared for each review question or subsidiary question. (Subsidiary questions may cover a type of intervention, specific population groups, a setting or an outcome.)
Each evidence statement should stand alone as an accessible, clear summary of key information used to support the recommendations (see section 9.1). The guideline should ensure that the relationship between the recommendations and the supporting evidence statements is clear.
Evidence statements should refer to the sources of evidence and their quality in brief descriptive terms and not just by acronyms. Each statement should also include summary information about the:
content of the intervention, if applicable (for example, what, how, where?)
population(s) and setting(s) (for example, country), if applicable
outcome(s), the direction of effect (or correlation) and the size of effect (or correlation) if applicable
strength of evidence (reflecting the appropriateness of the study design to answer the question and the quality, quantity and consistency of evidence)
applicability to the question, people affected by the guideline and setting (see section 6.3).
Note that the strength of the evidence is reported separately to the direction and size of the effects or correlations observed (if applicable).
Where important, the evidence statement should also summarise information about:
whether the intervention has been delivered as it should be (fidelity of the intervention)
what affects the intervention achieving the outcome (mechanism of action).
Terms that describe the strength of the evidence should be used consistently and their definitions should be reported in the methodology section. A set of standardised terms is given in box 6.3. However, the evidence base for each review may vary, so the review team should define how these terms have been used.
Box 6.3 Examples of standardised terms for describing the strength of the evidence
No evidence1 'No evidence was found from English‑language trials published since 1990…''. (Be clear about the sources and inclusion criteria.)
Weak evidence 'There was weak evidence from 1 (−) RCT'.
Moderate evidence 'There was moderate evidence from 2 (+) controlled before and after studies'.
Strong evidence 'There was strong evidence from 2 (++) controlled before and after studies and 1 (+) RCT'.
Inconsistent evidence. Further commentary may be needed on the variability of findings in different studies. For example, when the results of (++) or (+) quality studies do not agree. In such cases, the review team may qualify an evidence statement with an explanatory sentence or section that gives more detail.
1 Note that no evidence is not the same as evidence of no effect.
The terms should not be used to describe other aspects of the evidence, such as applicability or size of effect (see below for suitable terminology).
'Vote counting' (merely reporting on the number of studies) is not an acceptable summary of the evidence.
If appropriate, the direction of effect (impact) or correlation should be summarised using 1 of the following terms:
However, appropriate context/topic‑specific terms (for example, 'an increase in HIV incidence', 'a reduction in injecting drug use' and 'smoking cessation') may be used.
If appropriate, the size of effect (impact) or correlation and the degree of uncertainty involved, should be reported using the scale applied in the relevant study. For example, an odds ratio (OR) or relative risk (RR) with confidence interval (CI), or a standardised effect size and its standard error, may be quoted. Where an estimate cannot be explained, every effort should be made to relate it to interpretable criteria or conventional public health measures. If it is not possible to provide figures for each study, or if there are too many studies to make this feasible, the size of effect or correlation can be summarised using the following standardised terms:
These terms should be used consistently in each review and their definitions should be reported in the methodology section.
An example of an evidence statement about the effectiveness of an intervention is given in box 6.4 and an example of an evidence statement from a correlates review is given in box 6.5. These examples have been adapted from the originals and are for illustrative purposes only:
Box 6.4 Example of an evidence statement about the effectiveness of an intervention
There is strong evidence from 4 studies (2 UK1,2 and 2 US3,4) to suggest that educational interventions delivered by youth workers may reduce the incidence of hazardous drinking by young people. Two (++) RCTs1,2 and 1 (+) NRCT3 showed reduced risk (95% confidence interval) in the intervention group: 0.75 (0.58–0.94)1; 0.66 (0.57–0.78)2; 0.42 (0.18–0.84)3. Another (+) RCT4 showed reduced risk but was not statistically significant: 0.96 (0.84–1.09). However, 1 (−) NRCT5 found increased risk of binge drinking in the intervention group: 1.40 (1.21–1.74).
1 Huntley et al. 2009 (++).
2 Axe et al. 2008 (++).
3 Carmona et al. 2010 (+).
4 White et al. 2007 (+).
5 Kelly et al. 2006 (−).
Box 6.5 Example of an evidence statement from a correlates review
There is moderate evidence from 3 UK cross-sectional studies (2 [+]1,2 and 1 [−]3) about the correlation between young people's communication skills around safer sex and a reduction in the number of teenage pregnancies. The evidence about the strength of this correlation is mixed. One (+) study1 found that discussing condom use with new partners was associated with actual condom use at first sex (OR 2.67 [95% CI 1.55–4.57]). Another (−) study3 found that not talking to a partner about protection before first sexual intercourse was associated with teenage pregnancy (OR 1.67 [1.03–2.72]). However, another (+) study2 found small correlations between condom use, discussions about safer sex (r=0.072, p<0.01) and communication skills (r=0.204, p<0.01).
1 Kettle et al. 2007 (+).
2 Jarrett et al. 2007 (+).
3 Morgan et al. 2000 (−).
OR, odds ratio; CI, confidence interval.
The Committee also needs to judge the extent to which the evidence reported in the reviews is applicable to the areas for which it is developing recommendations. A body of evidence should be assessed to determine how similar the population(s), setting(s), intervention(s) and outcome(s) of the selected studies are to those outlined in the review question(s).
The following characteristics should be considered:
population – age, sex/gender, race/ethnicity, disability, sexual orientation, gender re‑assignment, religion/beliefs, pregnancy and maternity, socioeconomic status, health status (for example, severity of illness/disease), other characteristics specific to the topic area/review question(s)
setting – country, geographical context (for example, urban/rural), delivery system, legislative, policy, cultural, socioeconomic and fiscal context, other characteristics specific to the topic area/review question(s)
intervention – feasibility (for example, in terms of health and social care services/costs), practicalities (for example, experience/training needed), acceptability (for example, number of visits/adherence needed), accessibility (for example, transport/outreach needed), other characteristics specific to the topic area/review question(s)
outcomes – appropriate/relevant, follow‑up periods, important health effects.
After this assessment, the body of evidence in each evidence statement should be categorised as:
A statement detailing the category it falls into and the reasons why should appear at the end of the evidence statement. It should state: 'This evidence is (directly, partially or not) applicable because ...'. An example of an applicability statement is shown in box 6.6.
Box 6.6 Example of an applicability statement
This evidence is only partially applicable to people in the UK who inject drugs. That is because all these studies were conducted in countries in which needles are mainly sold by pharmacies (USA, Russia and France), rather than freely distributed, as is the norm in the UK1.
1 This has been adapted from the original and is for illustrative purposes only.
If the Committee is not able to judge the extent to which the evidence reported in the reviews is applicable to the areas/topics for which it is developing recommendations, it may ask for additional information on the applicability of the evidence.
Although similar issues are considered when assessing the applicability of economic data, there are some important differences (see chapter 7).
A summary of the assessment should be included when describing the link between the evidence and the recommendations (see section 9.1).
If GRADE is used, short evidence statements for outcomes should be presented after the GRADE profiles, summarising the key features of the evidence on clinical effectiveness (including adverse events as appropriate) and cost effectiveness. The evidence statements should include the number of studies and participants, the quality of the evidence and the direction of estimate of the effect (see box 6.7 for examples of evidence statements), and the importance of the effect (that is, whether the size of the effect is meaningful). An evidence statement may be needed even if no evidence is identified for a critical or important outcome.
Box 6.7 Examples of evidence statements if GRADE is used
Moderate quality evidence from 12 studies with several thousand patients, showed that prostaglandin analogues are more effective than beta‑blockers in reducing IOP from baseline at 6 to 36 months follow up, but the effect size is too small to be clinically effective.
One study with 126 patients presented moderate quality evidence that a 6‑week supported self‑help rehabilitation manual improved the recovery of patients' physical function 8 weeks and 6 months after ICU discharge.
Three studies with 773 children presented high quality evidence that a delayed strategy reduced the consumption of antibiotics by 63% compared with an immediate prescribing strategy.
Evidence statements developed from qualitative data do not usually report the impact of an intervention on behaviour or outcomes, and do not report statistical effects or aggregate measures of strength and effect size. They should summarise the evidence, its context and quality, and the consistency of key findings and themes across studies. Areas where there is little (or no) concurrence should also be summarised. An example of an evidence statement developed from qualitative data is given in box 6.8.
Box 6.8 Example of evidence statements developed from qualitative data
Two UK studies (1 [+]1 and 1 [++]2) and 1 (+) Dutch study3 reported on the views of teenage mothers. In 1 (+) study1 of teenage mothers interviewed in a family planning clinic and 1 (++) study2 of teenage mothers who responded to a questionnaire at their GP surgery, the participants agreed that access to education was the thing that helped them most after they had their child. However, this was not reported as a key theme in the Dutch study3 of health visitor perceptions of teenage mothers' needs.
1 Ellis 1999 (+)
2 Swann 2000 (++)
3 Nolan 2004 (+).
Six studies comprising 94 participants showed that information on the diagnosis was highly desired, and should be provided as soon as possible to reduce anxiety. Information that does more than merely convey facts, but that also directs the patients and carers to practical sources of support, was a common wish.
AGREE Collaboration (2003) Development and validation of an international appraisal instrument for assessing the quality of clinical practice guidelines: the AGREE project. Quality and Safety in Health Care 12: 18–23
Altman DG (2001) Systematic reviews of evaluations of prognostic variables. British Medical Journal 323: 224–8
Balshem H, Helfand M, Schünemann HJ et al. (2011) GRADE guidelines: 3. Rating the quality of evidence. Journal of Clinical Epidemiology 64: 401–6
Barroso J, Powell‑Cope GM (2000) Meta-synthesis of qualitative research on living with HIV infection. Qualitative Health Research 10: 340–53
Bowling A (2002) Research methods in health: investigating health and health services. Buckingham: Open University Press
Brouwers M, Kho ME, Browman GP et al. for the AGREE Next Steps Consortium (2010) AGREE II: advancing guideline development, reporting and evaluation in healthcare. Canadian Medical Association Journal 182: E839–42
Centre for Reviews and Dissemination (2009) Systematic reviews: CRD's guidance for undertaking reviews in health care. University of York: Centre for Reviews and Dissemination
Chiou CF, Hay JW, Wallace JF et al. (2003) Development and validation of a grading system for the quality of cost‑effectiveness studies. Medical Care 41: 32–44
Dixon‑Woods M, Agarwal S, Young B et al. (2004) Integrative approaches to qualitative and quantitative evidence. London: Health Development Agency
Drummond MF, O'Brien B, Stoddart GL et al. (1997) Critical assessment of economic evaluation. In: Methods for the economic evaluation of health care programmes, 2nd edition. Oxford: Oxford Medical Publications
Eccles M, Mason J (2001) How to develop cost-conscious guidelines. Health Technology Assessment 5: 1–69
Edwards P, Clarke M, DiGuiseppi C et al. (2002) Identification of randomized trials in systematic reviews: accuracy and reliability of screening records. Statistics in Medicine 21: 1635–40
Egger M, Davey Smith G, Altman DG (2000) Systematic reviews in health care: meta‑analysis in context. London: British Medical Journal Books
Evers SMAA, Goossens M, de Vet H et al. (2005) Criteria list for assessment of methodological quality of economic evaluations: Consensus on Health Economic Criteria. International Journal of Technology Assessment in Health Care 21: 240–5
Glaser BG, Strauss AL (1967) The discovery of grounded theory: strategies for qualitative research. New York: Aldine de Gruyter
Gough D, Oliver S, Thomas J, editors (2012) An introduction to systematic reviews. London: Sage
GRADE Working Group (2004) Grading quality of evidence and strength of recommendations. British Medical Journal 328: 1490–4
Guyatt GH, Oxman AD, Schünemann HJ et al. (2011) GRADE guidelines: a new series of articles in the Journal of Clinical Epidemiology. Journal of Clinical Epidemiology 64: 380–2
Guyatt GH, Oxman AD, Akl EA et al. (2011) GRADE guidelines: 1. Introduction – GRADE evidence profiles and summary of findings tables. Journal of Clinical Epidemiology 64: 383–94
Guyatt GH, Oxman AD, Kunz R et al. (2011) GRADE guidelines: 2. Framing the question and deciding on important outcomes. Journal of Clinical Epidemiology 64: 395–400
Guyatt GH, Oxman AD, Vist G et al. (2011) GRADE guidelines: 4. Rating the quality of evidence – study limitations (risk of bias). Journal of Clinical Epidemiology 64: 407–15
Guyatt GH, Oxman AD, Montori V et al. (2011) GRADE guidelines 5: Rating the quality of evidence – publication bias. Journal of Clinical Epidemiology 64: 1277–82
Guyatt GH, Oxman AD, Kunz R et al. (2011) GRADE guidelines 6: Rating the quality of evidence – imprecision. Journal of Clinical Epidemiology 64: 1283–93
Guyatt GH, Oxmand AD, Kunz R et al. (2011) GRADE guidelines 7: Rating the quality of evidence – inconsistency. Journal of Clinical Epidemiology 64: 1294–302
Guyatt GH, Oxman AD, Kunz R et al. (2011) GRADE guidelines 8: Rating the quality of evidence – indirectness. Journal of Clinical Epidemiology 64: 1303–10
Guyatt GH, Oxman AD, Sultan S et al. (2011) GRADE guidelines 9: Rating up the quality of evidence. Journal of Clinical Epidemiology 64: 1311–6
Harbord RM, Deeks JJ, Egger M et al. (2007) A unification of models for meta‑analysis of diagnostic accuracy studies. Biostatistics 8: 239–51
Harden A, Garcia J, Oliver S et al. (2004) Applying systematic review methods to studies of people's views: an example from public health research. Journal of Epidemiology and Community Health 58: 794–800
Higgins JPT, Green S, editors (2011) Cochrane handbook for systematic reviews of interventions. Version 5.1.0 (updated March 2011)
Jackson N, Waters E for the Guidelines for Systematic Reviews of Health Promotion and Public Health Interventions Taskforce (2005) Guidelines for systematic reviews of health promotion and public health interventions. Australia: Deakin University
Johnson JA, Biegel DE, Shafran R (2000) Concept mapping in mental health: uses and adaptations. Evaluation and Programme Planning 23: 67–75
Kelly MP, Swann C, Morgan A et al. (2002) Methodological problems in constructing the evidence base in public health. London: Health Development Agency
Khan KS, Kunz R, Kleijnen J et al. (2003) Systematic reviews to support evidence‑based medicine. How to review and apply findings of healthcare research. London: Royal Society of Medicine Press
National Collaborating Centre for Methods and Tools (2011) AMSTAR: assessing methodological quality of systematic reviews. Hamilton, Ontario: McMaster University
Noblit G, Hare RD (1988) Meta‑ethnography: synthesising qualitative studies. London: Sage
Ogilvie D, Hamilton V, Egan M et al. (2005) Systematic reviews of health effects of social interventions: 1. Finding the evidence: how far should you go? Journal of Epidemiology and Community Health 59: 804–8
Ogilvie D, Egan M, Hamilton V et al. (2005) Systematic reviews of health effects of social interventions: 2. Best available evidence: how low should you go? Journal of Epidemiology and Community Health 59: 886–92
Ogilvie D, Fayter D, Petticrew M et al. (2008) The harvest plot: a method for synthesising evidence about the differential effects of interventions. BMC Medical Research Methodology 8: 8
Oxford Centre for Evidence‑Based Medicine (2009) Levels of evidence and grades of recommendation.
Oxman AD, Guyatt GH (1992) A consumer's guide to subgroup analyses. Annals of Internal Medicine 116: 78–84
Petticrew M (2003) Why certain systematic reviews reach uncertain conclusions. British Medical Journal 326: 756–8
Petticrew M, Roberts H (2003) Evidence, hierarchies, and typologies: horses for courses. Journal of Epidemiology and Community Health 57: 527–9
Philips Z, Ginnelly L, Sculpher M et al. (2004) Review of guidelines for good practice in decision-analytic modelling in health technology assessment. Health Technology Assessment 8: 1–158
Popay J, editor (2005) Moving beyond effectiveness in evidence synthesis: methodological issues in the synthesis of diverse sources of evidence. London: National Institute for Health and Clinical Excellence
Popay J, Rogers A, Williams G (1998) Rationale and standards for the systematic review of qualitative literature in health services research. Qualitative Health Research 8: 341–51
Ring N, Jepson R and Ritchie K (2011) Methods of synthesizing qualitative research studies for health technology assessment. International Journal of Technology Assessment in Health Care 27: 384–390
Rychetnik L, Frommer M, Hawe P et al. (2002) Criteria for evaluating evidence on public health interventions. Journal of Epidemiology and Community Health 56: 119
Schünemann HJ, Best D, Vist G et al. for the GRADE Working Group (2003) Letters, numbers, symbols and words: how to communicate grades of evidence and recommendations. Canadian Medical Association Journal 169: 677–80
Schünemann HJ, Oxman AD, Brozek J et al. for the GRADE Working Group (2008) Grading quality of evidence and strength of recommendations for diagnostic tests and strategies. British Medical Journal 336: 1106–10
Scottish Intercollegiate Guidelines Network (2008) SIGN 50. A guideline developer's handbook (revised edition). Edinburgh: Scottish Intercollegiate Guidelines Network
Sharp SJ, Thompson SG (2000) Analysing the relationship between treatment effect and underlying risk in meta-analysis: comparison and development of approaches. Statistics in Medicine 19: 3251–74
Sutton AJ, Jones DR, Abrams KR et al. (2000) Methods for meta‑analysis in medical research. London: John Wiley
Swann C, Falce C, Morgan A et al. (2005) HDA evidence base: process and quality standards for evidence briefings. London: Health Development Agency
Tooth L, Ware R, Bain C et al. (2005) Quality of reporting of observational longitudinal research. American Journal of Epidemiology 161: 280–8
Tugwell P, Pettigrew M, Kristjansson E et al. (2010) Assessing equity in systematic reviews: realising the recommendations of the Commission on the Social Determinants of Health. British Medical Journal 341: 4739
Turner RM, Spiegelhalter DJ, Smith GC et al. (2009) Bias modelling in evidence synthesis. Journal of the Royal Statistical Society, Series A (Statistics in Society) 172: 21–47
Victora C, Habicht J, Bryce J (2004) Evidence-based public health: moving beyond randomized trials. American Journal of Public Health 94: 400–5
Weightman A, Ellis S, Cullum A et al. (2005) Grading evidence and recommendations for public health interventions: developing and piloting a framework. London: Health Development Agency
Whiting PF, Rutjes AWS, Westwood ME et al. and the QUADAS‑2 group (2011) QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Annals of Internal Medicine 155: 529–36