MERRIMACK COLLEGE MCQUADE LIBRARY

Mpa 6200: research methods and evaluation (nguyen).

  • Choose a Topic
  • Find Background Information
  • Determine Keywords
  • Find Books & Media
  • Explore Types of Articles
  • Find Empirical Articles
  • Using Google Scholar
  • Evaluate Sources
  • Cite Sources
  • Zotero This link opens in a new window
  • Creating an Annotated Bibliography

Writing a Literature Review

Synthesis visualization.

  • Literature Review Class Activity
  • Lit Review Matrix
  • Lit Review Organizer
  • Lit Review Worksheet 1
  • Lit Review Worksheet 2
  • Lit Review Worksheet 3
  • Lit Review Template
  • Click on the activity link above
  • Select File > Make a Copy
  • Complete the activity on  YOUR COPY

Selected Books

Cover Art

Online Resources

  • Basics of a Literature Review (Merrimack College's Writing Center)
  • Library Guide to Capstone Literature Reviews: Role of the Literature Review
  • The Literature Review: A Few Tips on Writing It University of Toronto
  • Lit Review Matrix This one is customized for Higher Education students, but may be helpful for others.
  • Matrix Examples This page from Walden University gives examples of different types of literature review matrices. A matrix can be very helpful in taking notes and preparing sources for your literature review.
  • OWL's Literature Reviews
  • Review of Literature UW - Madison, Writing Center
  • UNC at Chapel Hill's Literature Reviews

What Is a Literature Review? 

A literature review is a survey of scholarly articles, books, or other sources that pertain to a specific topic, area of research, or theory. The literature review offers brief descriptions, summaries, and critical evaluations of each work, and does so in the form of a well-organized essay. Scholars often write literature reviews to provide an overview of the most significant recent literature published on a topic. They also use literature reviews to trace the evolution of certain debates or intellectual problems within a field. Even if a literature review is not a formal part of a research project, students should conduct an informal one so that they know what kind of scholarly work has been done previously on the topic that they have selected. 

How Is a Literature Review Different from a Research Paper? 

An academic research paper attempts to develop a new argument and typically has a literature review as one of its parts. In a research paper, the author uses the literature review to show how his or her new insights build upon and depart from existing scholarship. A literature review by itself does not try to make a new argument based on original research but rather summarizes, synthesizes, and critiques the arguments and ideas of others, and points to gaps in the current literature. Before writing a literature review, a student should look for a model from a relevant journal or ask the instructor to point to a good example. 

Organizing a Literature Review  

A successful literature review should have three parts that break down in the following way: 

INTRODUCTION 

  • Defines and identifies the topic and establishes the reason for the literature review. 
  • Points to general trends in what has been published about the topic. 
  • Explains the criteria used in analyzing and comparing articles. 

BODY OF THE REVIEW 

  • Groups articles into thematic clusters, or subtopics. Clusters may be grouped together chronologically, thematically, or methodologically (see below for more on this).
  • Proceeds in a logical order from cluster to cluster. 
  • Emphasizes the main findings or arguments of the articles in the student’s own words. Keeps quotations from sources to an absolute minimum. 

CONCLUSION 

  • Summarizes the major themes that emerged in the review and identifies areas of controversy in the literature. 
  • Pinpoints strengths and weaknesses among the articles (innovative methods used, gaps in research, problems with theoretical frameworks, etc.). 
  • Concludes by formulating questions that need further research within the topic, and provides some insight into the relationship between that topic and the larger field of study or discipline. 

literature review evaluation methods

In the four examples of student writing below, only one shows a good example of synthesis: the fourth column, Student D. (Click on the image below to see larger)

literature review evaluation methods

  • << Previous: Creating an Annotated Bibliography
  • Last Updated: May 9, 2024 3:32 PM
  • URL: https://libguides.merrimack.edu/MPA6200_Nguyen
  • UConn Library
  • Literature Review: The What, Why and How-to Guide
  • Introduction

Literature Review: The What, Why and How-to Guide — Introduction

  • Getting Started
  • How to Pick a Topic
  • Strategies to Find Sources
  • Evaluating Sources & Lit. Reviews
  • Tips for Writing Literature Reviews
  • Writing Literature Review: Useful Sites
  • Citation Resources
  • Other Academic Writings

What are Literature Reviews?

So, what is a literature review? "A literature review is an account of what has been published on a topic by accredited scholars and researchers. In writing the literature review, your purpose is to convey to your reader what knowledge and ideas have been established on a topic, and what their strengths and weaknesses are. As a piece of writing, the literature review must be defined by a guiding concept (e.g., your research objective, the problem or issue you are discussing, or your argumentative thesis). It is not just a descriptive list of the material available, or a set of summaries." Taylor, D.  The literature review: A few tips on conducting it . University of Toronto Health Sciences Writing Centre.

Goals of Literature Reviews

What are the goals of creating a Literature Review?  A literature could be written to accomplish different aims:

  • To develop a theory or evaluate an existing theory
  • To summarize the historical or existing state of a research topic
  • Identify a problem in a field of research 

Baumeister, R. F., & Leary, M. R. (1997). Writing narrative literature reviews .  Review of General Psychology , 1 (3), 311-320.

What kinds of sources require a Literature Review?

  • A research paper assigned in a course
  • A thesis or dissertation
  • A grant proposal
  • An article intended for publication in a journal

All these instances require you to collect what has been written about your research topic so that you can demonstrate how your own research sheds new light on the topic.

Types of Literature Reviews

What kinds of literature reviews are written?

Narrative review: The purpose of this type of review is to describe the current state of the research on a specific topic/research and to offer a critical analysis of the literature reviewed. Studies are grouped by research/theoretical categories, and themes and trends, strengths and weakness, and gaps are identified. The review ends with a conclusion section which summarizes the findings regarding the state of the research of the specific study, the gaps identify and if applicable, explains how the author's research will address gaps identify in the review and expand the knowledge on the topic reviewed.

  • Example : Predictors and Outcomes of U.S. Quality Maternity Leave: A Review and Conceptual Framework:  10.1177/08948453211037398  

Systematic review : "The authors of a systematic review use a specific procedure to search the research literature, select the studies to include in their review, and critically evaluate the studies they find." (p. 139). Nelson, L. K. (2013). Research in Communication Sciences and Disorders . Plural Publishing.

  • Example : The effect of leave policies on increasing fertility: a systematic review:  10.1057/s41599-022-01270-w

Meta-analysis : "Meta-analysis is a method of reviewing research findings in a quantitative fashion by transforming the data from individual studies into what is called an effect size and then pooling and analyzing this information. The basic goal in meta-analysis is to explain why different outcomes have occurred in different studies." (p. 197). Roberts, M. C., & Ilardi, S. S. (2003). Handbook of Research Methods in Clinical Psychology . Blackwell Publishing.

  • Example : Employment Instability and Fertility in Europe: A Meta-Analysis:  10.1215/00703370-9164737

Meta-synthesis : "Qualitative meta-synthesis is a type of qualitative study that uses as data the findings from other qualitative studies linked by the same or related topic." (p.312). Zimmer, L. (2006). Qualitative meta-synthesis: A question of dialoguing with texts .  Journal of Advanced Nursing , 53 (3), 311-318.

  • Example : Women’s perspectives on career successes and barriers: A qualitative meta-synthesis:  10.1177/05390184221113735

Literature Reviews in the Health Sciences

  • UConn Health subject guide on systematic reviews Explanation of the different review types used in health sciences literature as well as tools to help you find the right review type
  • << Previous: Getting Started
  • Next: How to Pick a Topic >>
  • Last Updated: Sep 21, 2022 2:16 PM
  • URL: https://guides.lib.uconn.edu/literaturereview

Creative Commons

Harvey Cushing/John Hay Whitney Medical Library

  • Collections
  • Research Help

YSN Doctoral Programs: Steps in Conducting a Literature Review

  • Biomedical Databases
  • Global (Public Health) Databases
  • Soc. Sci., History, and Law Databases
  • Grey Literature
  • Trials Registers
  • Data and Statistics
  • Public Policy
  • Google Tips
  • Recommended Books
  • Steps in Conducting a Literature Review

What is a literature review?

A literature review is an integrated analysis -- not just a summary-- of scholarly writings and other relevant evidence related directly to your research question.  That is, it represents a synthesis of the evidence that provides background information on your topic and shows a association between the evidence and your research question.

A literature review may be a stand alone work or the introduction to a larger research paper, depending on the assignment.  Rely heavily on the guidelines your instructor has given you.

Why is it important?

A literature review is important because it:

  • Explains the background of research on a topic.
  • Demonstrates why a topic is significant to a subject area.
  • Discovers relationships between research studies/ideas.
  • Identifies major themes, concepts, and researchers on a topic.
  • Identifies critical gaps and points of disagreement.
  • Discusses further research questions that logically come out of the previous studies.

APA7 Style resources

Cover Art

APA Style Blog - for those harder to find answers

1. Choose a topic. Define your research question.

Your literature review should be guided by your central research question.  The literature represents background and research developments related to a specific research question, interpreted and analyzed by you in a synthesized way.

  • Make sure your research question is not too broad or too narrow.  Is it manageable?
  • Begin writing down terms that are related to your question. These will be useful for searches later.
  • If you have the opportunity, discuss your topic with your professor and your class mates.

2. Decide on the scope of your review

How many studies do you need to look at? How comprehensive should it be? How many years should it cover? 

  • This may depend on your assignment.  How many sources does the assignment require?

3. Select the databases you will use to conduct your searches.

Make a list of the databases you will search. 

Where to find databases:

  • use the tabs on this guide
  • Find other databases in the Nursing Information Resources web page
  • More on the Medical Library web page
  • ... and more on the Yale University Library web page

4. Conduct your searches to find the evidence. Keep track of your searches.

  • Use the key words in your question, as well as synonyms for those words, as terms in your search. Use the database tutorials for help.
  • Save the searches in the databases. This saves time when you want to redo, or modify, the searches. It is also helpful to use as a guide is the searches are not finding any useful results.
  • Review the abstracts of research studies carefully. This will save you time.
  • Use the bibliographies and references of research studies you find to locate others.
  • Check with your professor, or a subject expert in the field, if you are missing any key works in the field.
  • Ask your librarian for help at any time.
  • Use a citation manager, such as EndNote as the repository for your citations. See the EndNote tutorials for help.

Review the literature

Some questions to help you analyze the research:

  • What was the research question of the study you are reviewing? What were the authors trying to discover?
  • Was the research funded by a source that could influence the findings?
  • What were the research methodologies? Analyze its literature review, the samples and variables used, the results, and the conclusions.
  • Does the research seem to be complete? Could it have been conducted more soundly? What further questions does it raise?
  • If there are conflicting studies, why do you think that is?
  • How are the authors viewed in the field? Has this study been cited? If so, how has it been analyzed?

Tips: 

  • Review the abstracts carefully.  
  • Keep careful notes so that you may track your thought processes during the research process.
  • Create a matrix of the studies for easy analysis, and synthesis, across all of the studies.
  • << Previous: Recommended Books
  • Last Updated: Jan 4, 2024 10:52 AM
  • URL: https://guides.library.yale.edu/YSNDoctoral

Purdue Online Writing Lab Purdue OWL® College of Liberal Arts

Writing a Literature Review

OWL logo

Welcome to the Purdue OWL

This page is brought to you by the OWL at Purdue University. When printing this page, you must include the entire legal notice.

Copyright ©1995-2018 by The Writing Lab & The OWL at Purdue and Purdue University. All rights reserved. This material may not be published, reproduced, broadcast, rewritten, or redistributed without permission. Use of this site constitutes acceptance of our terms and conditions of fair use.

A literature review is a document or section of a document that collects key sources on a topic and discusses those sources in conversation with each other (also called synthesis ). The lit review is an important genre in many disciplines, not just literature (i.e., the study of works of literature such as novels and plays). When we say “literature review” or refer to “the literature,” we are talking about the research ( scholarship ) in a given field. You will often see the terms “the research,” “the scholarship,” and “the literature” used mostly interchangeably.

Where, when, and why would I write a lit review?

There are a number of different situations where you might write a literature review, each with slightly different expectations; different disciplines, too, have field-specific expectations for what a literature review is and does. For instance, in the humanities, authors might include more overt argumentation and interpretation of source material in their literature reviews, whereas in the sciences, authors are more likely to report study designs and results in their literature reviews; these differences reflect these disciplines’ purposes and conventions in scholarship. You should always look at examples from your own discipline and talk to professors or mentors in your field to be sure you understand your discipline’s conventions, for literature reviews as well as for any other genre.

A literature review can be a part of a research paper or scholarly article, usually falling after the introduction and before the research methods sections. In these cases, the lit review just needs to cover scholarship that is important to the issue you are writing about; sometimes it will also cover key sources that informed your research methodology.

Lit reviews can also be standalone pieces, either as assignments in a class or as publications. In a class, a lit review may be assigned to help students familiarize themselves with a topic and with scholarship in their field, get an idea of the other researchers working on the topic they’re interested in, find gaps in existing research in order to propose new projects, and/or develop a theoretical framework and methodology for later research. As a publication, a lit review usually is meant to help make other scholars’ lives easier by collecting and summarizing, synthesizing, and analyzing existing research on a topic. This can be especially helpful for students or scholars getting into a new research area, or for directing an entire community of scholars toward questions that have not yet been answered.

What are the parts of a lit review?

Most lit reviews use a basic introduction-body-conclusion structure; if your lit review is part of a larger paper, the introduction and conclusion pieces may be just a few sentences while you focus most of your attention on the body. If your lit review is a standalone piece, the introduction and conclusion take up more space and give you a place to discuss your goals, research methods, and conclusions separately from where you discuss the literature itself.

Introduction:

  • An introductory paragraph that explains what your working topic and thesis is
  • A forecast of key topics or texts that will appear in the review
  • Potentially, a description of how you found sources and how you analyzed them for inclusion and discussion in the review (more often found in published, standalone literature reviews than in lit review sections in an article or research paper)
  • Summarize and synthesize: Give an overview of the main points of each source and combine them into a coherent whole
  • Analyze and interpret: Don’t just paraphrase other researchers – add your own interpretations where possible, discussing the significance of findings in relation to the literature as a whole
  • Critically Evaluate: Mention the strengths and weaknesses of your sources
  • Write in well-structured paragraphs: Use transition words and topic sentence to draw connections, comparisons, and contrasts.

Conclusion:

  • Summarize the key findings you have taken from the literature and emphasize their significance
  • Connect it back to your primary research question

How should I organize my lit review?

Lit reviews can take many different organizational patterns depending on what you are trying to accomplish with the review. Here are some examples:

  • Chronological : The simplest approach is to trace the development of the topic over time, which helps familiarize the audience with the topic (for instance if you are introducing something that is not commonly known in your field). If you choose this strategy, be careful to avoid simply listing and summarizing sources in order. Try to analyze the patterns, turning points, and key debates that have shaped the direction of the field. Give your interpretation of how and why certain developments occurred (as mentioned previously, this may not be appropriate in your discipline — check with a teacher or mentor if you’re unsure).
  • Thematic : If you have found some recurring central themes that you will continue working with throughout your piece, you can organize your literature review into subsections that address different aspects of the topic. For example, if you are reviewing literature about women and religion, key themes can include the role of women in churches and the religious attitude towards women.
  • Qualitative versus quantitative research
  • Empirical versus theoretical scholarship
  • Divide the research by sociological, historical, or cultural sources
  • Theoretical : In many humanities articles, the literature review is the foundation for the theoretical framework. You can use it to discuss various theories, models, and definitions of key concepts. You can argue for the relevance of a specific theoretical approach or combine various theorical concepts to create a framework for your research.

What are some strategies or tips I can use while writing my lit review?

Any lit review is only as good as the research it discusses; make sure your sources are well-chosen and your research is thorough. Don’t be afraid to do more research if you discover a new thread as you’re writing. More info on the research process is available in our "Conducting Research" resources .

As you’re doing your research, create an annotated bibliography ( see our page on the this type of document ). Much of the information used in an annotated bibliography can be used also in a literature review, so you’ll be not only partially drafting your lit review as you research, but also developing your sense of the larger conversation going on among scholars, professionals, and any other stakeholders in your topic.

Usually you will need to synthesize research rather than just summarizing it. This means drawing connections between sources to create a picture of the scholarly conversation on a topic over time. Many student writers struggle to synthesize because they feel they don’t have anything to add to the scholars they are citing; here are some strategies to help you:

  • It often helps to remember that the point of these kinds of syntheses is to show your readers how you understand your research, to help them read the rest of your paper.
  • Writing teachers often say synthesis is like hosting a dinner party: imagine all your sources are together in a room, discussing your topic. What are they saying to each other?
  • Look at the in-text citations in each paragraph. Are you citing just one source for each paragraph? This usually indicates summary only. When you have multiple sources cited in a paragraph, you are more likely to be synthesizing them (not always, but often
  • Read more about synthesis here.

The most interesting literature reviews are often written as arguments (again, as mentioned at the beginning of the page, this is discipline-specific and doesn’t work for all situations). Often, the literature review is where you can establish your research as filling a particular gap or as relevant in a particular way. You have some chance to do this in your introduction in an article, but the literature review section gives a more extended opportunity to establish the conversation in the way you would like your readers to see it. You can choose the intellectual lineage you would like to be part of and whose definitions matter most to your thinking (mostly humanities-specific, but this goes for sciences as well). In addressing these points, you argue for your place in the conversation, which tends to make the lit review more compelling than a simple reporting of other sources.

  • USC Libraries
  • Research Guides

Organizing Your Social Sciences Research Paper

  • 5. The Literature Review
  • Purpose of Guide
  • Design Flaws to Avoid
  • Independent and Dependent Variables
  • Glossary of Research Terms
  • Reading Research Effectively
  • Narrowing a Topic Idea
  • Broadening a Topic Idea
  • Extending the Timeliness of a Topic Idea
  • Academic Writing Style
  • Applying Critical Thinking
  • Choosing a Title
  • Making an Outline
  • Paragraph Development
  • Research Process Video Series
  • Executive Summary
  • The C.A.R.S. Model
  • Background Information
  • The Research Problem/Question
  • Theoretical Framework
  • Citation Tracking
  • Content Alert Services
  • Evaluating Sources
  • Primary Sources
  • Secondary Sources
  • Tiertiary Sources
  • Scholarly vs. Popular Publications
  • Qualitative Methods
  • Quantitative Methods
  • Insiderness
  • Using Non-Textual Elements
  • Limitations of the Study
  • Common Grammar Mistakes
  • Writing Concisely
  • Avoiding Plagiarism
  • Footnotes or Endnotes?
  • Further Readings
  • Generative AI and Writing
  • USC Libraries Tutorials and Other Guides
  • Bibliography

A literature review surveys prior research published in books, scholarly articles, and any other sources relevant to a particular issue, area of research, or theory, and by so doing, provides a description, summary, and critical evaluation of these works in relation to the research problem being investigated. Literature reviews are designed to provide an overview of sources you have used in researching a particular topic and to demonstrate to your readers how your research fits within existing scholarship about the topic.

Fink, Arlene. Conducting Research Literature Reviews: From the Internet to Paper . Fourth edition. Thousand Oaks, CA: SAGE, 2014.

Importance of a Good Literature Review

A literature review may consist of simply a summary of key sources, but in the social sciences, a literature review usually has an organizational pattern and combines both summary and synthesis, often within specific conceptual categories . A summary is a recap of the important information of the source, but a synthesis is a re-organization, or a reshuffling, of that information in a way that informs how you are planning to investigate a research problem. The analytical features of a literature review might:

  • Give a new interpretation of old material or combine new with old interpretations,
  • Trace the intellectual progression of the field, including major debates,
  • Depending on the situation, evaluate the sources and advise the reader on the most pertinent or relevant research, or
  • Usually in the conclusion of a literature review, identify where gaps exist in how a problem has been researched to date.

Given this, the purpose of a literature review is to:

  • Place each work in the context of its contribution to understanding the research problem being studied.
  • Describe the relationship of each work to the others under consideration.
  • Identify new ways to interpret prior research.
  • Reveal any gaps that exist in the literature.
  • Resolve conflicts amongst seemingly contradictory previous studies.
  • Identify areas of prior scholarship to prevent duplication of effort.
  • Point the way in fulfilling a need for additional research.
  • Locate your own research within the context of existing literature [very important].

Fink, Arlene. Conducting Research Literature Reviews: From the Internet to Paper. 2nd ed. Thousand Oaks, CA: Sage, 2005; Hart, Chris. Doing a Literature Review: Releasing the Social Science Research Imagination . Thousand Oaks, CA: Sage Publications, 1998; Jesson, Jill. Doing Your Literature Review: Traditional and Systematic Techniques . Los Angeles, CA: SAGE, 2011; Knopf, Jeffrey W. "Doing a Literature Review." PS: Political Science and Politics 39 (January 2006): 127-132; Ridley, Diana. The Literature Review: A Step-by-Step Guide for Students . 2nd ed. Los Angeles, CA: SAGE, 2012.

Types of Literature Reviews

It is important to think of knowledge in a given field as consisting of three layers. First, there are the primary studies that researchers conduct and publish. Second are the reviews of those studies that summarize and offer new interpretations built from and often extending beyond the primary studies. Third, there are the perceptions, conclusions, opinion, and interpretations that are shared informally among scholars that become part of the body of epistemological traditions within the field.

In composing a literature review, it is important to note that it is often this third layer of knowledge that is cited as "true" even though it often has only a loose relationship to the primary studies and secondary literature reviews. Given this, while literature reviews are designed to provide an overview and synthesis of pertinent sources you have explored, there are a number of approaches you could adopt depending upon the type of analysis underpinning your study.

Argumentative Review This form examines literature selectively in order to support or refute an argument, deeply embedded assumption, or philosophical problem already established in the literature. The purpose is to develop a body of literature that establishes a contrarian viewpoint. Given the value-laden nature of some social science research [e.g., educational reform; immigration control], argumentative approaches to analyzing the literature can be a legitimate and important form of discourse. However, note that they can also introduce problems of bias when they are used to make summary claims of the sort found in systematic reviews [see below].

Integrative Review Considered a form of research that reviews, critiques, and synthesizes representative literature on a topic in an integrated way such that new frameworks and perspectives on the topic are generated. The body of literature includes all studies that address related or identical hypotheses or research problems. A well-done integrative review meets the same standards as primary research in regard to clarity, rigor, and replication. This is the most common form of review in the social sciences.

Historical Review Few things rest in isolation from historical precedent. Historical literature reviews focus on examining research throughout a period of time, often starting with the first time an issue, concept, theory, phenomena emerged in the literature, then tracing its evolution within the scholarship of a discipline. The purpose is to place research in a historical context to show familiarity with state-of-the-art developments and to identify the likely directions for future research.

Methodological Review A review does not always focus on what someone said [findings], but how they came about saying what they say [method of analysis]. Reviewing methods of analysis provides a framework of understanding at different levels [i.e. those of theory, substantive fields, research approaches, and data collection and analysis techniques], how researchers draw upon a wide variety of knowledge ranging from the conceptual level to practical documents for use in fieldwork in the areas of ontological and epistemological consideration, quantitative and qualitative integration, sampling, interviewing, data collection, and data analysis. This approach helps highlight ethical issues which you should be aware of and consider as you go through your own study.

Systematic Review This form consists of an overview of existing evidence pertinent to a clearly formulated research question, which uses pre-specified and standardized methods to identify and critically appraise relevant research, and to collect, report, and analyze data from the studies that are included in the review. The goal is to deliberately document, critically evaluate, and summarize scientifically all of the research about a clearly defined research problem . Typically it focuses on a very specific empirical question, often posed in a cause-and-effect form, such as "To what extent does A contribute to B?" This type of literature review is primarily applied to examining prior research studies in clinical medicine and allied health fields, but it is increasingly being used in the social sciences.

Theoretical Review The purpose of this form is to examine the corpus of theory that has accumulated in regard to an issue, concept, theory, phenomena. The theoretical literature review helps to establish what theories already exist, the relationships between them, to what degree the existing theories have been investigated, and to develop new hypotheses to be tested. Often this form is used to help establish a lack of appropriate theories or reveal that current theories are inadequate for explaining new or emerging research problems. The unit of analysis can focus on a theoretical concept or a whole theory or framework.

NOTE: Most often the literature review will incorporate some combination of types. For example, a review that examines literature supporting or refuting an argument, assumption, or philosophical problem related to the research problem will also need to include writing supported by sources that establish the history of these arguments in the literature.

Baumeister, Roy F. and Mark R. Leary. "Writing Narrative Literature Reviews."  Review of General Psychology 1 (September 1997): 311-320; Mark R. Fink, Arlene. Conducting Research Literature Reviews: From the Internet to Paper . 2nd ed. Thousand Oaks, CA: Sage, 2005; Hart, Chris. Doing a Literature Review: Releasing the Social Science Research Imagination . Thousand Oaks, CA: Sage Publications, 1998; Kennedy, Mary M. "Defining a Literature." Educational Researcher 36 (April 2007): 139-147; Petticrew, Mark and Helen Roberts. Systematic Reviews in the Social Sciences: A Practical Guide . Malden, MA: Blackwell Publishers, 2006; Torracro, Richard. "Writing Integrative Literature Reviews: Guidelines and Examples." Human Resource Development Review 4 (September 2005): 356-367; Rocco, Tonette S. and Maria S. Plakhotnik. "Literature Reviews, Conceptual Frameworks, and Theoretical Frameworks: Terms, Functions, and Distinctions." Human Ressource Development Review 8 (March 2008): 120-130; Sutton, Anthea. Systematic Approaches to a Successful Literature Review . Los Angeles, CA: Sage Publications, 2016.

Structure and Writing Style

I.  Thinking About Your Literature Review

The structure of a literature review should include the following in support of understanding the research problem :

  • An overview of the subject, issue, or theory under consideration, along with the objectives of the literature review,
  • Division of works under review into themes or categories [e.g. works that support a particular position, those against, and those offering alternative approaches entirely],
  • An explanation of how each work is similar to and how it varies from the others,
  • Conclusions as to which pieces are best considered in their argument, are most convincing of their opinions, and make the greatest contribution to the understanding and development of their area of research.

The critical evaluation of each work should consider :

  • Provenance -- what are the author's credentials? Are the author's arguments supported by evidence [e.g. primary historical material, case studies, narratives, statistics, recent scientific findings]?
  • Methodology -- were the techniques used to identify, gather, and analyze the data appropriate to addressing the research problem? Was the sample size appropriate? Were the results effectively interpreted and reported?
  • Objectivity -- is the author's perspective even-handed or prejudicial? Is contrary data considered or is certain pertinent information ignored to prove the author's point?
  • Persuasiveness -- which of the author's theses are most convincing or least convincing?
  • Validity -- are the author's arguments and conclusions convincing? Does the work ultimately contribute in any significant way to an understanding of the subject?

II.  Development of the Literature Review

Four Basic Stages of Writing 1.  Problem formulation -- which topic or field is being examined and what are its component issues? 2.  Literature search -- finding materials relevant to the subject being explored. 3.  Data evaluation -- determining which literature makes a significant contribution to the understanding of the topic. 4.  Analysis and interpretation -- discussing the findings and conclusions of pertinent literature.

Consider the following issues before writing the literature review: Clarify If your assignment is not specific about what form your literature review should take, seek clarification from your professor by asking these questions: 1.  Roughly how many sources would be appropriate to include? 2.  What types of sources should I review (books, journal articles, websites; scholarly versus popular sources)? 3.  Should I summarize, synthesize, or critique sources by discussing a common theme or issue? 4.  Should I evaluate the sources in any way beyond evaluating how they relate to understanding the research problem? 5.  Should I provide subheadings and other background information, such as definitions and/or a history? Find Models Use the exercise of reviewing the literature to examine how authors in your discipline or area of interest have composed their literature review sections. Read them to get a sense of the types of themes you might want to look for in your own research or to identify ways to organize your final review. The bibliography or reference section of sources you've already read, such as required readings in the course syllabus, are also excellent entry points into your own research. Narrow the Topic The narrower your topic, the easier it will be to limit the number of sources you need to read in order to obtain a good survey of relevant resources. Your professor will probably not expect you to read everything that's available about the topic, but you'll make the act of reviewing easier if you first limit scope of the research problem. A good strategy is to begin by searching the USC Libraries Catalog for recent books about the topic and review the table of contents for chapters that focuses on specific issues. You can also review the indexes of books to find references to specific issues that can serve as the focus of your research. For example, a book surveying the history of the Israeli-Palestinian conflict may include a chapter on the role Egypt has played in mediating the conflict, or look in the index for the pages where Egypt is mentioned in the text. Consider Whether Your Sources are Current Some disciplines require that you use information that is as current as possible. This is particularly true in disciplines in medicine and the sciences where research conducted becomes obsolete very quickly as new discoveries are made. However, when writing a review in the social sciences, a survey of the history of the literature may be required. In other words, a complete understanding the research problem requires you to deliberately examine how knowledge and perspectives have changed over time. Sort through other current bibliographies or literature reviews in the field to get a sense of what your discipline expects. You can also use this method to explore what is considered by scholars to be a "hot topic" and what is not.

III.  Ways to Organize Your Literature Review

Chronology of Events If your review follows the chronological method, you could write about the materials according to when they were published. This approach should only be followed if a clear path of research building on previous research can be identified and that these trends follow a clear chronological order of development. For example, a literature review that focuses on continuing research about the emergence of German economic power after the fall of the Soviet Union. By Publication Order your sources by publication chronology, then, only if the order demonstrates a more important trend. For instance, you could order a review of literature on environmental studies of brown fields if the progression revealed, for example, a change in the soil collection practices of the researchers who wrote and/or conducted the studies. Thematic [“conceptual categories”] A thematic literature review is the most common approach to summarizing prior research in the social and behavioral sciences. Thematic reviews are organized around a topic or issue, rather than the progression of time, although the progression of time may still be incorporated into a thematic review. For example, a review of the Internet’s impact on American presidential politics could focus on the development of online political satire. While the study focuses on one topic, the Internet’s impact on American presidential politics, it would still be organized chronologically reflecting technological developments in media. The difference in this example between a "chronological" and a "thematic" approach is what is emphasized the most: themes related to the role of the Internet in presidential politics. Note that more authentic thematic reviews tend to break away from chronological order. A review organized in this manner would shift between time periods within each section according to the point being made. Methodological A methodological approach focuses on the methods utilized by the researcher. For the Internet in American presidential politics project, one methodological approach would be to look at cultural differences between the portrayal of American presidents on American, British, and French websites. Or the review might focus on the fundraising impact of the Internet on a particular political party. A methodological scope will influence either the types of documents in the review or the way in which these documents are discussed.

Other Sections of Your Literature Review Once you've decided on the organizational method for your literature review, the sections you need to include in the paper should be easy to figure out because they arise from your organizational strategy. In other words, a chronological review would have subsections for each vital time period; a thematic review would have subtopics based upon factors that relate to the theme or issue. However, sometimes you may need to add additional sections that are necessary for your study, but do not fit in the organizational strategy of the body. What other sections you include in the body is up to you. However, only include what is necessary for the reader to locate your study within the larger scholarship about the research problem.

Here are examples of other sections, usually in the form of a single paragraph, you may need to include depending on the type of review you write:

  • Current Situation : Information necessary to understand the current topic or focus of the literature review.
  • Sources Used : Describes the methods and resources [e.g., databases] you used to identify the literature you reviewed.
  • History : The chronological progression of the field, the research literature, or an idea that is necessary to understand the literature review, if the body of the literature review is not already a chronology.
  • Selection Methods : Criteria you used to select (and perhaps exclude) sources in your literature review. For instance, you might explain that your review includes only peer-reviewed [i.e., scholarly] sources.
  • Standards : Description of the way in which you present your information.
  • Questions for Further Research : What questions about the field has the review sparked? How will you further your research as a result of the review?

IV.  Writing Your Literature Review

Once you've settled on how to organize your literature review, you're ready to write each section. When writing your review, keep in mind these issues.

Use Evidence A literature review section is, in this sense, just like any other academic research paper. Your interpretation of the available sources must be backed up with evidence [citations] that demonstrates that what you are saying is valid. Be Selective Select only the most important points in each source to highlight in the review. The type of information you choose to mention should relate directly to the research problem, whether it is thematic, methodological, or chronological. Related items that provide additional information, but that are not key to understanding the research problem, can be included in a list of further readings . Use Quotes Sparingly Some short quotes are appropriate if you want to emphasize a point, or if what an author stated cannot be easily paraphrased. Sometimes you may need to quote certain terminology that was coined by the author, is not common knowledge, or taken directly from the study. Do not use extensive quotes as a substitute for using your own words in reviewing the literature. Summarize and Synthesize Remember to summarize and synthesize your sources within each thematic paragraph as well as throughout the review. Recapitulate important features of a research study, but then synthesize it by rephrasing the study's significance and relating it to your own work and the work of others. Keep Your Own Voice While the literature review presents others' ideas, your voice [the writer's] should remain front and center. For example, weave references to other sources into what you are writing but maintain your own voice by starting and ending the paragraph with your own ideas and wording. Use Caution When Paraphrasing When paraphrasing a source that is not your own, be sure to represent the author's information or opinions accurately and in your own words. Even when paraphrasing an author’s work, you still must provide a citation to that work.

V.  Common Mistakes to Avoid

These are the most common mistakes made in reviewing social science research literature.

  • Sources in your literature review do not clearly relate to the research problem;
  • You do not take sufficient time to define and identify the most relevant sources to use in the literature review related to the research problem;
  • Relies exclusively on secondary analytical sources rather than including relevant primary research studies or data;
  • Uncritically accepts another researcher's findings and interpretations as valid, rather than examining critically all aspects of the research design and analysis;
  • Does not describe the search procedures that were used in identifying the literature to review;
  • Reports isolated statistical results rather than synthesizing them in chi-squared or meta-analytic methods; and,
  • Only includes research that validates assumptions and does not consider contrary findings and alternative interpretations found in the literature.

Cook, Kathleen E. and Elise Murowchick. “Do Literature Review Skills Transfer from One Course to Another?” Psychology Learning and Teaching 13 (March 2014): 3-11; Fink, Arlene. Conducting Research Literature Reviews: From the Internet to Paper . 2nd ed. Thousand Oaks, CA: Sage, 2005; Hart, Chris. Doing a Literature Review: Releasing the Social Science Research Imagination . Thousand Oaks, CA: Sage Publications, 1998; Jesson, Jill. Doing Your Literature Review: Traditional and Systematic Techniques . London: SAGE, 2011; Literature Review Handout. Online Writing Center. Liberty University; Literature Reviews. The Writing Center. University of North Carolina; Onwuegbuzie, Anthony J. and Rebecca Frels. Seven Steps to a Comprehensive Literature Review: A Multimodal and Cultural Approach . Los Angeles, CA: SAGE, 2016; Ridley, Diana. The Literature Review: A Step-by-Step Guide for Students . 2nd ed. Los Angeles, CA: SAGE, 2012; Randolph, Justus J. “A Guide to Writing the Dissertation Literature Review." Practical Assessment, Research, and Evaluation. vol. 14, June 2009; Sutton, Anthea. Systematic Approaches to a Successful Literature Review . Los Angeles, CA: Sage Publications, 2016; Taylor, Dena. The Literature Review: A Few Tips On Conducting It. University College Writing Centre. University of Toronto; Writing a Literature Review. Academic Skills Centre. University of Canberra.

Writing Tip

Break Out of Your Disciplinary Box!

Thinking interdisciplinarily about a research problem can be a rewarding exercise in applying new ideas, theories, or concepts to an old problem. For example, what might cultural anthropologists say about the continuing conflict in the Middle East? In what ways might geographers view the need for better distribution of social service agencies in large cities than how social workers might study the issue? You don’t want to substitute a thorough review of core research literature in your discipline for studies conducted in other fields of study. However, particularly in the social sciences, thinking about research problems from multiple vectors is a key strategy for finding new solutions to a problem or gaining a new perspective. Consult with a librarian about identifying research databases in other disciplines; almost every field of study has at least one comprehensive database devoted to indexing its research literature.

Frodeman, Robert. The Oxford Handbook of Interdisciplinarity . New York: Oxford University Press, 2010.

Another Writing Tip

Don't Just Review for Content!

While conducting a review of the literature, maximize the time you devote to writing this part of your paper by thinking broadly about what you should be looking for and evaluating. Review not just what scholars are saying, but how are they saying it. Some questions to ask:

  • How are they organizing their ideas?
  • What methods have they used to study the problem?
  • What theories have been used to explain, predict, or understand their research problem?
  • What sources have they cited to support their conclusions?
  • How have they used non-textual elements [e.g., charts, graphs, figures, etc.] to illustrate key points?

When you begin to write your literature review section, you'll be glad you dug deeper into how the research was designed and constructed because it establishes a means for developing more substantial analysis and interpretation of the research problem.

Hart, Chris. Doing a Literature Review: Releasing the Social Science Research Imagination . Thousand Oaks, CA: Sage Publications, 1 998.

Yet Another Writing Tip

When Do I Know I Can Stop Looking and Move On?

Here are several strategies you can utilize to assess whether you've thoroughly reviewed the literature:

  • Look for repeating patterns in the research findings . If the same thing is being said, just by different people, then this likely demonstrates that the research problem has hit a conceptual dead end. At this point consider: Does your study extend current research?  Does it forge a new path? Or, does is merely add more of the same thing being said?
  • Look at sources the authors cite to in their work . If you begin to see the same researchers cited again and again, then this is often an indication that no new ideas have been generated to address the research problem.
  • Search Google Scholar to identify who has subsequently cited leading scholars already identified in your literature review [see next sub-tab]. This is called citation tracking and there are a number of sources that can help you identify who has cited whom, particularly scholars from outside of your discipline. Here again, if the same authors are being cited again and again, this may indicate no new literature has been written on the topic.

Onwuegbuzie, Anthony J. and Rebecca Frels. Seven Steps to a Comprehensive Literature Review: A Multimodal and Cultural Approach . Los Angeles, CA: Sage, 2016; Sutton, Anthea. Systematic Approaches to a Successful Literature Review . Los Angeles, CA: Sage Publications, 2016.

  • << Previous: Theoretical Framework
  • Next: Citation Tracking >>
  • Last Updated: May 25, 2024 4:09 PM
  • URL: https://libguides.usc.edu/writingguide

Auraria Library red logo

Research Methods: Literature Reviews

  • Annotated Bibliographies
  • Literature Reviews
  • Scoping Reviews
  • Systematic Reviews
  • Scholarship of Teaching and Learning
  • Persuasive Arguments
  • Subject Specific Methodology

A literature review involves researching, reading, analyzing, evaluating, and summarizing scholarly literature (typically journals and articles) about a specific topic. The results of a literature review may be an entire report or article OR may be part of a article, thesis, dissertation, or grant proposal. A literature review helps the author learn about the history and nature of their topic, and identify research gaps and problems.

Steps & Elements

Problem formulation

  • Determine your topic and its components by asking a question
  • Research: locate literature related to your topic to identify the gap(s) that can be addressed
  • Read: read the articles or other sources of information
  • Analyze: assess the findings for relevancy
  • Evaluating: determine how the article are relevant to your research and what are the key findings
  • Synthesis: write about the key findings and how it is relevant to your research

Elements of a Literature Review

  • Summarize subject, issue or theory under consideration, along with objectives of the review
  • Divide works under review into categories (e.g. those in support of a particular position, those against, those offering alternative theories entirely)
  • Explain how each work is similar to and how it varies from the others
  • Conclude which pieces are best considered in their argument, are most convincing of their opinions, and make the greatest contribution to the understanding and development of an area of research

Writing a Literature Review Resources

  • How to Write a Literature Review From the Wesleyan University Library
  • Write a Literature Review From the University of California Santa Cruz Library. A Brief overview of a literature review, includes a list of stages for writing a lit review.
  • Literature Reviews From the University of North Carolina Writing Center. Detailed information about writing a literature review.
  • Undertaking a literature review: a step-by-step approach Cronin, P., Ryan, F., & Coughan, M. (2008). Undertaking a literature review: A step-by-step approach. British Journal of Nursing, 17(1), p.38-43

literature review evaluation methods

Literature Review Tutorial

  • << Previous: Annotated Bibliographies
  • Next: Scoping Reviews >>
  • Last Updated: Feb 29, 2024 12:00 PM
  • URL: https://guides.auraria.edu/researchmethods

1100 Lawrence Street Denver, CO 80204 303-315-7700 Ask Us Directions

University of Texas

  • University of Texas Libraries

Literature Reviews

  • What is a literature review?
  • Steps in the Literature Review Process
  • Define your research question
  • Determine inclusion and exclusion criteria
  • Choose databases and search
  • Review Results
  • Synthesize Results
  • Analyze Results
  • Librarian Support

What is a Literature Review?

A literature or narrative review is a comprehensive review and analysis of the published literature on a specific topic or research question. The literature that is reviewed contains: books, articles, academic articles, conference proceedings, association papers, and dissertations. It contains the most pertinent studies and points to important past and current research and practices. It provides background and context, and shows how your research will contribute to the field. 

A literature review should: 

  • Provide a comprehensive and updated review of the literature;
  • Explain why this review has taken place;
  • Articulate a position or hypothesis;
  • Acknowledge and account for conflicting and corroborating points of view

From  S age Research Methods

Purpose of a Literature Review

A literature review can be written as an introduction to a study to:

  • Demonstrate how a study fills a gap in research
  • Compare a study with other research that's been done

Or it can be a separate work (a research article on its own) which:

  • Organizes or describes a topic
  • Describes variables within a particular issue/problem

Limitations of a Literature Review

Some of the limitations of a literature review are:

  • It's a snapshot in time. Unlike other reviews, this one has beginning, a middle and an end. There may be future developments that could make your work less relevant.
  • It may be too focused. Some niche studies may miss the bigger picture.
  • It can be difficult to be comprehensive. There is no way to make sure all the literature on a topic was considered.
  • It is easy to be biased if you stick to top tier journals. There may be other places where people are publishing exemplary research. Look to open access publications and conferences to reflect a more inclusive collection. Also, make sure to include opposing views (and not just supporting evidence).

Source: Grant, Maria J., and Andrew Booth. “A Typology of Reviews: An Analysis of 14 Review Types and Associated Methodologies.” Health Information & Libraries Journal, vol. 26, no. 2, June 2009, pp. 91–108. Wiley Online Library, doi:10.1111/j.1471-1842.2009.00848.x.

Meryl Brodsky : Communication and Information Studies

Hannah Chapman Tripp : Biology, Neuroscience

Carolyn Cunningham : Human Development & Family Sciences, Psychology, Sociology

Larayne Dallas : Engineering

Janelle Hedstrom : Special Education, Curriculum & Instruction, Ed Leadership & Policy ​

Susan Macicak : Linguistics

Imelda Vetter : Dell Medical School

For help in other subject areas, please see the guide to library specialists by subject .

Periodically, UT Libraries runs a workshop covering the basics and library support for literature reviews. While we try to offer these once per academic year, we find providing the recording to be helpful to community members who have missed the session. Following is the most recent recording of the workshop, Conducting a Literature Review. To view the recording, a UT login is required.

  • October 26, 2022 recording
  • Last Updated: Oct 26, 2022 2:49 PM
  • URL: https://guides.lib.utexas.edu/literaturereviews

Creative Commons License

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • JMIR Form Res
  • v.3(4); Oct-Dec 2019

Logo of formative

A Comprehensive Framework to Evaluate Websites: Literature Review and Development of GoodWeb

Rosalie allison.

1 Public Health England, Gloucester, United Kingdom

Catherine Hayes

Cliodna a m mcnulty, vicki young, associated data.

Summary of included studies, including information on the participant.

Interventions: methodologies and tools to evaluate websites.

Methods used or described in each study.

Summary of the most used website attributes evaluated.

Attention is turning toward increasing the quality of websites and quality evaluation to attract new users and retain existing users.

This scoping study aimed to review and define existing worldwide methodologies and techniques to evaluate websites and provide a framework of appropriate website attributes that could be applied to any future website evaluations.

We systematically searched electronic databases and gray literature for studies of website evaluation. The results were exported to EndNote software, duplicates were removed, and eligible studies were identified. The results have been presented in narrative form.

A total of 69 studies met the inclusion criteria. The extracted data included type of website, aim or purpose of the study, study populations (users and experts), sample size, setting (controlled environment and remotely assessed), website attributes evaluated, process of methodology, and process of analysis. Methods of evaluation varied and included questionnaires, observed website browsing, interviews or focus groups, and Web usage analysis. Evaluations using both users and experts and controlled and remote settings are represented. Website attributes that were examined included usability or ease of use, content, design criteria, functionality, appearance, interactivity, satisfaction, and loyalty. Website evaluation methods should be tailored to the needs of specific websites and individual aims of evaluations. GoodWeb, a website evaluation guide, has been presented with a case scenario.

Conclusions

This scoping study supports the open debate of defining the quality of websites, and there are numerous approaches and models to evaluate it. However, as this study provides a framework of the existing literature of website evaluation, it presents a guide of options for evaluating websites, including which attributes to analyze and options for appropriate methods.

Introduction

Since its conception in the early 1990s, there has been an explosion in the use of the internet, with websites taking a central role in diverse fields such as finance, education, medicine, industry, and business. Organizations are increasingly attempting to exploit the benefits of the World Wide Web and its features as an interface for internet-enabled businesses, information provision, and promotional activities [ 1 , 2 ]. As the environment becomes more competitive and websites become more sophisticated, attention is turning toward increasing the quality of the website itself and quality evaluation to attract new and retain existing users [ 3 , 4 ]. What determines website quality has not been conclusively established, and there are many different definitions and meanings of the term quality, mainly in relation to the website’s purpose [ 5 ]. Traditionally, website evaluations have focused on usability, defined as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use [ 6 ].” The design of websites and users’ needs go beyond pure usability, as increased engagement and pleasure experienced during interactions with websites can be more important predictors of website preference than usability [ 7 - 10 ]. Therefore, in the last decade, website evaluations have shifted their focus to users’ experience, employing various assessment techniques [ 11 ], with no universally accepted method or procedure for website evaluation.

This scoping study aimed to review and define existing worldwide methodologies and techniques to evaluate websites and provide a simple framework of appropriate website attributes, which could be applied to future website evaluations.

A scoping study is similar to a systematic review as it collects and reviews content in a field of interest. However, scoping studies cover a broader question and do not rigorously evaluate the quality of the studies included [ 12 ]. Scoping studies are commonly used in the fields of public services such as health and education, as they are more rapid to perform and less costly in terms of staff costs [ 13 ]. Scoping studies can be precursors to a systematic review or stand-alone studies to examine the range of research around a particular topic.

The following research question is based on the need to gain knowledge and insight from worldwide website evaluation to inform the future study design of website evaluations: what website evaluation methodologies can be robustly used to assess users’ experience?

To show how the framework of attributes and methods can be applied to evaluating a website, e-Bug, an international educational health website, will be used as a case scenario [ 14 ].

This scoping study followed a 5-stage framework and methodology, as outlined by Arksey and O’Malley [ 12 ], involving the following: (1) identifying the research question, as above; (2) identifying relevant studies; (3) study selection; (4) charting the data; and (5) collating, summarizing, and reporting the results.

Identifying Relevant Studies

Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines [ 15 ], studies for consideration in the review were located by searching the following electronic databases: Excerpta Medica dataBASE, PsycINFO, Cochrane, Cumulative Index to Nursing and Allied Health Literature, Scopus, ACM digital library, and IEEE Xplore SPORTDiscus. The keywords used referred to the following:

  • Population: websites
  • Intervention: evaluation methodologies
  • Outcome: user’s experience.

Table 1 shows the specific search criteria for each database. These keywords were also used to search gray literature for unpublished or working documents to minimize publication bias.

Full search strategy used to search each electronic database.

a EMBASE: Excerpta Medica database.

b CINAHL: Cumulative Index to Nursing and Allied Health Literature.

c ACM: Association for Computing Machinery.

d IEEE: Institute of Electrical and Electronics Engineers.

Study Selection

Once all sources had been systematically searched, the list of citations was exported to EndNote software to identify eligible studies. By scanning the title, and abstract if necessary, studies that did not fit the inclusion criteria were removed by 2 researchers (RA and CH). As abstracts are not always representative of the full study that follows or capture the full scope [ 16 ], if the title and abstract did not provide sufficient information, the full manuscript was examined to ascertain whether they met all the inclusion criteria, which included (1) studies focused on websites, (2) studies of evaluative methods (eg, use of questionnaire and task completion), (3) studies that reported outcomes that affect the user’s experience (eg, quality, satisfaction, efficiency, effectiveness without necessarily focusing on methodology), (4) studies carried out between 2006 and 2016, (5) studies published in English, and (6) type of study (any study design that is appropriate).

Exclusion criteria included (1) studies that focus on evaluations using solely experts and are not transferrable to user evaluations; (2) studies that are in the form of electronic book or are not freely available on the Web or through OpenAthens, the University of Bath library, or the University of the West of England library; (3)studies that evaluate banking, electronic commerce (e-commerce), or online libraries’ websites and do not have transferrable measures to a range of other websites; (4) studies that report exclusively on minority or special needs groups (eg, blind or deaf users); and (5) studies that do not meet all the inclusion criteria.

Charting the Data

The next stage involved charting key items of information obtained from studies being reviewed. Charting [ 17 ] describes a technique for synthesizing and interpreting qualitative data by sifting, charting, and sorting material according to key issues and themes. This is similar to a systematic review in which the process is called data extraction. The data extracted included general information about the study and specific information relating to, for instance, the study population or target, the type of intervention, outcome measures employed, and the study design.

The information of interest included the following: type of website, aim or purpose of the study, study populations (users and experts), sample size, setting (laboratory, real life, and remotely assessed), website attributes evaluated, process of methodology, and process of analysis.

NVivo version 10.0 software was used for this stage by 2 researchers (RA and CH) to chart the data.

Collating, Summarizing, and Reporting the Results

Although the scoping study does not seek to assess the quality of evidence, it does present an overview of all material reviewed with a narrative account of findings.

Ethics Approval and Consent to Participate

As no primary research was carried out, no ethical approval was required to undertake this scoping study. No specific reference was made to any of the participants in the individual studies, nor does this study infringe on their rights in any way.

The electronic database searches produced 6657 papers; a further 7 papers were identified through other sources. After removing duplicates (n=1058), 5606 publications remained. After titles and abstracts were examined, 784 full-text papers were read and assessed further for eligibility. Of those, 69 articles were identified as suitable by meeting all the inclusion criteria ( Figure 1 ).

An external file that holds a picture, illustration, etc.
Object name is formative_v3i4e14372_fig1.jpg

Preferred Reporting Items for Systematic Reviews and Meta-Analyses flowchart of search results.

Study Characteristics

Studies referred to or used a mixture of users (72%) and experts (39%) to evaluate their websites; 54% used a controlled environment, and 26% evaluated websites remotely ( Multimedia Appendix 1 [ 2 - 4 , 11 , 18 - 85 ]). Remote usability, in its most basic form, involves working with participants who are not in the same physical location as the researcher, employing techniques such as live screen sharing or questionnaires. Advantages to remote website evaluations include the ability to evaluate using a larger number of participants as travel time and costs are not a factor, and participants are able to partake at a time that is appropriate to them, increasing the likelihood of participation and the possibility of a greater diversity of participants [ 18 ]. However, the disadvantages of remote website evaluations, in comparison with a controlled setting, are that system performance, network traffic, and the participant’s computer setup can all affect the results.

A variety of types of websites evaluated were included in this review including government (9%), online news (6%), education (1%), university (12%), and sports organizations (4%). The aspects of quality considered, and their relative importance varied according to the type of website and the goals to be achieved by the users. For example, criteria such as ease of paying or security are not very important to educational websites, whereas they are especially important for online shopping. In this sense, much attention must be paid when evaluating the quality of a website, establishing a specific context of use and purpose [ 19 ].

The context of the participants was also discussed, in relation to the generalizability of results. For example, when evaluations used potential or current users of their website, it was important that computer literacy was reflective of all users [ 20 ]. This could mean ensuring that participants with a range of computer abilities and experiences were used so that results were not biased to the most or least experienced users.

Intervention

A total of 43 evaluation methodologies were identified in the 69 studies in this review. Most of them were variations of similar methodologies, and a brief description of each is provided in Multimedia Appendix 2 . Multimedia Appendix 3 shows the methods used or described in each study.

Questionnaire

Use of questionnaires was the most common methodology referred to (37/69, 54%), including questions to rank or rate attributes and open questions to allow text feedback and suggested improvements. Questionnaires were used in a combination of before or after usability testing to assess usability and overall user experience.

Observed Browsing the Website

Browsing the website using a form of task completion with the participant, such as cognitive walkthrough, was used in 33/69 studies (48%), whereby an expert evaluator used a detailed procedure to simulate task execution and browse all particular solution paths, examining each action while determining if expected user’s goals and memory content would lead to choosing a correct option [ 30 ]. Screen capture was often used (n=6) to record participants’ navigation through the website, and eye tracking was used (n=7) to assess where the eye focuses on each page or the motion of the eye as an individual views a Web page. The think-aloud protocol was used (n=10) to encourage users to express out loud what they were looking at, thinking, doing, and feeling, as they performed tasks. This allows observers to see and understand the cognitive processes associated with task completion. Recording the time to complete tasks (n=6) and mouse movement or clicks (n=8) were used to assess the efficiency of the websites.

Qualitative Data Collection

Several forms of qualitative data collection were used in 27/69 studies (39%). Observed browsing, interviews, and focus groups were used either before or after the use of the website. Pre-website-use, qualitative research was often used to collect details of which website attributes were important for participants or what weighting participants would give to each attribute. Postevaluation, qualitative techniques were used to collate feedback on the quality of the website and any suggestions for improvements.

Automated Usability Evaluation Software

In 9/69 studies (13%), automated usability evaluation focused on developing software, tools, and techniques to speed evaluation (rapid), tools that reach a wider audience for usability testing (remote), and tools that have built-in analyses features (automated). The latter can involve assessing server logs, website coding, and simulations of user experience to assess usability [ 42 ].

Card Sorting

A technique that is often linked with assessing navigability of a website, card sorting, is useful for discovering the logical structure of an unsorted list of statements or ideas by exploring how people group items and structures that maximize the probability of users finding items (5/69 studies, 7%). This can assist with determining effective website structure.

Web Usage Analysis

Of 69 studies, 3 studies used Web usage analysis or Web analytics to identify browsing patterns by analyzing the participants’ navigational behavior. This could include tracking at the widget level, that is, combining knowledge of the mouse coordinates with elements such as buttons and links, with the layout of the HTML pages, enabling complete tracking of all user activity.

Outcomes (Attributes Used to Evaluate Websites)

Often, different terminology for website attributes was used to describe the same or similar concepts ( Multimedia Appendix 4 ). The most used website attributes that were assessed can be broken down into 8 broad categories and further subcategories:

  • Usability or ease of use is the degree to which a website can be used to achieve given goals (n=58). It includes navigation such as intuitiveness, learnability, memorability, and information architecture; effectiveness such as errors; and efficiency.
  • Content (n=41) includes completeness, accuracy, relevancy, timeliness, and understandability of the information.
  • Web design criteria (n=29) include use of media, search engines, help resources, originality of the website, site map, user interface, multilanguage, and maintainability.
  • Functionality (n=31) includes links, website speed, security, and compatibility with devices and browsers.
  • Appearance (n=26) includes layout, font, colors, and page length.
  • Interactivity (n=25) includes sense of community, such as ability to leave feedback and comments and email or share with a friend option or forum discussion boards; personalization; help options such as frequently answered questions or customer services; and background music.
  • Satisfaction (n=26) includes usefulness, entertainment, look and feel, and pleasure.
  • Loyalty (n=8) includes first impression of the website.

GoodWeb: Website Evaluation Guide

As there was such a range of methods used, a suggested guide of options for evaluating websites is presented below ( Figure 2 ), coined GoodWeb, and applied to an evaluation of e-Bug, an international educational health website [ 14 ]. Allison at al [ 86 ] show the full details of how GoodWeb has been applied and outcomes of the e-Bug website evaluation.

An external file that holds a picture, illustration, etc.
Object name is formative_v3i4e14372_fig2.jpg

Framework for website evaluation.

Step 1. What Are the Important Website Attributes That Affect User's Experience of the Chosen Website?

Usability or ease of use, content, Web design criteria, functionality, appearance, interactivity, satisfaction, and loyalty were the umbrella terms that encompassed the website attributes identified or evaluated in the 69 studies in this scoping study. Multimedia Appendix 4 contains a summary of the most used website attributes that have been assessed. Recent website evaluations have shifted focus from usability of websites to an overall user’s experience of website use. A decision on which website attributes to evaluate for specific websites could come from interviews or focus groups with users or experts or a literature search of attributes used in similar evaluations.

Application

In the scenario of evaluating e-Bug or similar educational health websites, the attributes chosen to assess could be the following:

  • Appearance: colors, fonts, media or graphics, page length, style consistency, and first impression
  • Content: clarity, completeness, current and timely information, relevance, reliability, and uniqueness
  • Interactivity: sense of community and modern features
  • Ease of use: home page indication, navigation, guidance, and multilanguage support
  • Technical adequacy: compatibility with other devices, load time, valid links, and limited use of special plug-ins
  • Satisfaction: loyalty

These cover the main website attributes appropriate for an educational health website. If the website did not currently have features such as search engines, site map, background music, it may not be appropriate to evaluate these, but may be better suited to question whether they would be suitable additions to the website; or these could be combined under the heading modern features . Furthermore, security may not be a necessary attribute to evaluate if participant identifiable information or bank details are not needed to use the website.

Step 2. What Is the Best Way to Evaluate These Attributes?

Often, a combination of methods is suitable to evaluate a website, as 1 method may not be appropriate to assess all attributes of interest [ 29 ] (see Multimedia Appendix 3 for a summary of the most used methods for evaluating websites). For example, screen capture of task completion may be appropriate to assess the efficiency of a website but would not be the chosen method to assess loyalty. A questionnaire or qualitative interview may be more appropriate for this attribute.

In the scenario of evaluating e-Bug, a questionnaire before browsing the website would be appropriate to rank the importance of the selected website attributes, chosen in step 1. It would then be appropriate to observe browsing of the website, collecting data on completion of typical task scenarios, using the screen capture function for future reference. This method could be used to evaluate the effectiveness (number of tasks successfully completed), efficiency (whether the most direct route through the website was used to complete the task), and learnability (whether task completion is more efficient or effective second time of trying). It may then be suitable to use a follow-up questionnaire to rate e-Bug against the website attributes previously ranked. The attribute ranking and rating could then be combined to indicate where the website performs well and areas for improvement.

Step 3: Who Should Evaluate the Website?

Both users and experts can be used to evaluate websites. Experts are able to identify areas for improvements, in relation to usability; whereas, users are able to appraise quality as well as identify areas for improvement. In this respect, users are able to fully evaluate user’s experience, where experts may not be able to.

For this reason, it may be more appropriate to use current or potential users of the website for the scenario of evaluating e-Bug.

Step 4: What Setting Should Be Used?

A combination of controlled and remote settings can be used, depending on the methods chosen. For example, it may be appropriate to collect data via a questionnaire, remotely, to increase sample size and reach a more diverse audience, whereas a controlled setting may be more appropriate for task completion using eye-tracking methods.

Strengths and Limitations

A scoping study differs from a systematic review, in that it does not critically appraise the quality of the studies before extracting or charting the data. Therefore, this study cannot compare the effectiveness of the different methods or methodologies in evaluating the website attributes. However, what it does do is review and summarize a huge amount of literature, from different sources, in a format that is understandable and informative for future designs of website evaluations.

Furthermore, studies that evaluate banking, e-commerce, or online libraries’ websites and do not have transferrable measures to a range of other websites were excluded from this study. This decision was made to limit the number of studies that met the remaining inclusion criteria, and it was deemed that the website attributes for these websites would be too specialist and not necessarily transferable to a range of websites. Therefore, the findings of this study may not be generalizable to all types of website. However, Multimedia Appendix 1 shows that data were extracted from a very broad range of websites when it was deemed that the information was transferrable to a range of other websites.

A robust website evaluation can identify areas for improvement to both fulfill the goals and desires of its users [ 62 ] and influence their perception of the organization and overall quality of resources [ 48 ]. An improved website could attract and retain more online users; therefore, an evidence-based website evaluation guide is essential.

This scoping study emphasizes the fact that the debate about how to define the quality of websites remains open, and there are numerous approaches and models to evaluate it. Multimedia Appendix 2 shows existing methodologies or tools that can be used to evaluate websites. Many of these are variations of similar approaches; therefore, it is not strictly necessary to use these tools at face value; however, some could be used to guide analysis, following data collection. By following steps 1 to 4 of GoodWeb, the framework suggested in this study, taking into account the desired participants and setting and website evaluation methods, can be tailored to the needs of specific websites and individual aims of evaluations.

Acknowledgments

This work was supported by the Primary Care Unit, Public Health England. This study is not applicable as secondary research.

Abbreviations

Multimedia appendix 1, multimedia appendix 2, multimedia appendix 3, multimedia appendix 4.

Authors' Contributions: RA wrote the protocol with input from CH, CM, and VY. RA and CH conducted the scoping review. RA wrote the final manuscript with input from CH, CM, and VY. All authors reviewed and approved the final manuscript.

Conflicts of Interest: None declared.

A Systematic Literature Review on Healthcare Facility Evaluation Methods

Affiliation.

  • 1 Faculty of Architecture and Urbanism at the University of São Paulo (FAUUSP), FAU Cidade Universitária. Rua do Lago, São Paulo, Brazil.
  • PMID: 37157787
  • DOI: 10.1177/19375867231166094

To present a systematic literature review on predesign evaluation (PDE), postoccupancy evaluation (POE), and evidence-based design (EBD); to delimit the concepts and relationships of these terms and place them in the building life cycle framework to guide their application and indicate a common understanding and possible gaps. The preferred reporting items for systematic reviews and meta-analyses protocol was used. Inclusion criteria cover texts that present a concept, method, procedure, or tool and use the example in healthcare services or other environments. The reports were excluded if there was no evidence of a relationship between the terms, if cited rhetorically, duplicated, or if an instrument was not related to at least one other term. The identification used Scopus and Web of Science and considered reports until December 2021 (search period). When extracting the evidence, formal quality criteria were observed and sentences and other elements were collected as evidence and tabulated to segment topics of interest. The searches identified 799 reports with 494 duplicates. In the selection, 53 records were selected from 305 obtained in 14 searches. The classification extracted concepts, relationships, and frameworks. Results indicate a consistent understanding of POE and EBD and a diffuse understanding of PDE. A summary of the three concepts including two frameworks is proposed. Situations are contextualized where these frameworks are used in specific areas of research. One of these frameworks provides a basis for classifying building assessment methods, procedures, and tools but does not detail the classification criteria. Thus, more detailed adjustments should be considered in specific studies.

Keywords: EBD framework; design methodology; evidence-based design (EBD); postoccupancy evaluation (POE); predesign evaluation (PDE); pre–post design; project brief; research methodology; research-informed design; systematic literature review.

Publication types

  • Systematic Review
  • Delivery of Health Care*
  • Health Facilities*

Loading metrics

Open Access

Peer-reviewed

Research Article

Frameworks for procurement, integration, monitoring, and evaluation of artificial intelligence tools in clinical settings: A systematic review

Contributed equally to this work with: Sarim Dawar Khan, Zahra Hoodbhoy

Roles Data curation, Formal analysis, Methodology, Project administration, Writing – original draft

Affiliation CITRIC Health Data Science Centre, Department of Medicine, Aga Khan University, Karachi, Pakistan

Roles Conceptualization, Methodology, Supervision, Writing – review & editing

Affiliations CITRIC Health Data Science Centre, Department of Medicine, Aga Khan University, Karachi, Pakistan, Department of Paediatrics and Child Health, Aga Khan University, Karachi, Pakistan

Roles Project administration, Writing – original draft, Writing – review & editing

ORCID logo

Roles Methodology, Writing – review & editing

Affiliation Duke Institute for Health Innovation, Duke University School of Medicine, Durham, North Carolina, United States

Roles Data curation, Formal analysis, Methodology, Writing – review & editing

Affiliations Population Health Science Institute, Newcastle University, Newcastle upon Tyne, United Kingdom, Newcastle upon Tyne Hospitals NHS Foundation Trust, Newcastle upon Tyne, United Kingdom, Moorfields Eye Hospital NHS Foundation Trust, London, United Kingdom

Roles Data curation, Formal analysis, Project administration

Roles Data curation, Formal analysis

Roles Writing – original draft, Writing – review & editing

Roles Methodology, Visualization

Roles Methodology, Project administration, Visualization

Roles Supervision, Writing – review & editing

Affiliations Duke Clinical Research Institute, Duke University School of Medicine, Durham, North Carolina, United States, Division of Cardiology, Duke University School of Medicine, Durham, North Carolina, United States

¶ ‡ ZS and MPS also contributed equally to this work.

Affiliations CITRIC Health Data Science Centre, Department of Medicine, Aga Khan University, Karachi, Pakistan, Department of Medicine, Aga Khan University, Karachi, Pakistan

* E-mail: [email protected]

  • Sarim Dawar Khan, 
  • Zahra Hoodbhoy, 
  • Mohummad Hassan Raza Raja, 
  • Jee Young Kim, 
  • Henry David Jeffry Hogg, 
  • Afshan Anwar Ali Manji, 
  • Freya Gulamali, 
  • Alifia Hasan, 
  • Asim Shaikh, 

PLOS

  • Published: May 29, 2024
  • https://doi.org/10.1371/journal.pdig.0000514
  • Reader Comments

Fig 1

Research on the applications of artificial intelligence (AI) tools in medicine has increased exponentially over the last few years but its implementation in clinical practice has not seen a commensurate increase with a lack of consensus on implementing and maintaining such tools. This systematic review aims to summarize frameworks focusing on procuring, implementing, monitoring, and evaluating AI tools in clinical practice. A comprehensive literature search, following PRSIMA guidelines was performed on MEDLINE, Wiley Cochrane, Scopus, and EBSCO databases, to identify and include articles recommending practices, frameworks or guidelines for AI procurement, integration, monitoring, and evaluation. From the included articles, data regarding study aim, use of a framework, rationale of the framework, details regarding AI implementation involving procurement, integration, monitoring, and evaluation were extracted. The extracted details were then mapped on to the Donabedian Plan, Do, Study, Act cycle domains. The search yielded 17,537 unique articles, out of which 47 were evaluated for inclusion based on their full texts and 25 articles were included in the review. Common themes extracted included transparency, feasibility of operation within existing workflows, integrating into existing workflows, validation of the tool using predefined performance indicators and improving the algorithm and/or adjusting the tool to improve performance. Among the four domains (Plan, Do, Study, Act) the most common domain was Plan (84%, n = 21), followed by Study (60%, n = 15), Do (52%, n = 13), & Act (24%, n = 6). Among 172 authors, only 1 (0.6%) was from a low-income country (LIC) and 2 (1.2%) were from lower-middle-income countries (LMICs). Healthcare professionals cite the implementation of AI tools within clinical settings as challenging owing to low levels of evidence focusing on integration in the Do and Act domains. The current healthcare AI landscape calls for increased data sharing and knowledge translation to facilitate common goals and reap maximum clinical benefit.

Author summary

The use of artificial intelligence (AI) tools has seen exponential growth in multiple industries, over the past few years. Despite this, the implementation of these tools in healthcare settings is lagging with less than 600 AI tools approved by the United States Food and Drug Administration and fewer job AI related job postings in healthcare as compared to other industries. In this systematic review, we tried to organize and synthesize data and themes from published literature regarding key aspects of AI tool implementation; namely procurement, integration, monitoring and evaluation and map the extracted themes on to the Plan-Do-Study-Act framework. We found that currently the majority of literature on AI implementation in healthcare settings focuses on the “Plan” and “Study” domains with considerably less emphasis on the “Do” and “Act” domains. This is perhaps the reason why experts currently cite the implementation of AI tools in healthcare settings as challenging. Furthermore, the current AI healthcare landscape has poor representation from low and lower-middle-income countries. To ensure, the healthcare industry is able to implement AI tool into clinical workforce, across a variety of settings globally, we call for diverse and inclusive collaborations, coupled with further research targeted on the under-investigated stages of AI implementation.

Citation: Khan SD, Hoodbhoy Z, Raja MHR, Kim JY, Hogg HDJ, Manji AAA, et al. (2024) Frameworks for procurement, integration, monitoring, and evaluation of artificial intelligence tools in clinical settings: A systematic review. PLOS Digit Health 3(5): e0000514. https://doi.org/10.1371/journal.pdig.0000514

Editor: Zhao Ni, Yale University, UNITED STATES

Received: September 4, 2023; Accepted: April 18, 2024; Published: May 29, 2024

Copyright: © 2024 Khan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Funding: This work was supported by the Patrick J. McGovern Foundation (Grant ID 383000239 to SDK, ZH, MHR, JYK, AAAM, FG, AH, AS, ST, NSK, MRP, SB, ZS, MPS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: MPS is a co-inventor of intellectual property licensed by Duke University to Clinetic, Inc., KelaHealth, Inc, and Cohere-Med, Inc. MPS holds equity in Clinetic, Inc. MPS has received honorarium for a conference presentation from Roche. MPS is a board member of Machine Learning for Health Care, a non-profit that convenes an annual research conference. SB is a co-inventor of intellectual property licensed by Duke University to Clinetic, Inc. and Cohere-Med, Inc. SB holds equity in Clinetic, Inc.

Introduction

The use of Artificial Intelligence (AI) tools has been exponentially growing, with several applications in the healthcare industry and tremendous potential to improve health outcomes. While there has been a rapid increase in literature on the use of AI in healthcare, the implementation of AI tools is lagging in both high-income and low-income settings, compared to other industries, has been noted, with fewer than 600 Food and Drug Administration-approved AI algorithms, and even fewer being presently used in clinical settings [ 1 – 4 ]. The development-implementation gap has been further assessed by Goldfarb et al., using job advertisements as a surrogate marker to measure technology diffusion patterns, finding among skilled healthcare job postings between 2015–2018, 1 in 1250 postings required AI skills, comparatively lower than other skilled sectors (information technology, management, finance and insurance, manufacturing etc.) [ 5 ].

Implementation of AI tools is a multi-phase process that involves procurement, integration, monitoring, and evaluation [ 6 , 7 ]. Procurement involves the scouting process before integrating an AI tool, including decisions whether to build the tool or buy the tool. Integration involves deploying an AI tool and incorporating it into existing clinical workflows. Monitoring and evaluation occur post-integration and entails keeping track of tool performance metrics, determining the impact of integrating the tool, and modifying it as needed to ensure it keeps functioning at its original intended level of performance. A key barrier highlighted by healthcare leaders across the globe to AI implementation in healthcare includes a lack of a systematic approach to AI procurement, implementation, monitoring and evaluation, since the majority of research on AI in healthcare does not comprehensively explore the multiple, complex steps involved in ensuring optimal implementation [ 8 – 11 ].

This systematic review aims to summarize themes arising from frameworks focusing on procuring, integrating, monitoring, and evaluating AI tools in clinical practice.

This systematic review followed the Preferred Items for Systematic Review and Meta-Analysis (PRISMA) guidelines for systematic reviews ( S1 Checklist ) [ 12 ]. This review is registered on PROSPERO (ID: CRD42022336899).

Information sources and search strategy

We searched electronic databases (MEDLINE, Wiley Cochrane, Scopus, EBSCO) until June 2022. The search string contained terms that described technology, setting, framework, and implementation phase including AI tool procurement, integration, monitoring, evaluation, including standard MeSH terms. Terms that weren’t standard MeSH terms, such as “clinical setting” were added following iterative discussions. To capture papers that were methodical guidelines for AI implementation, as opposed to experiential papers, and recognizing the heterogeneous nature of “frameworks”, ranging from commentaries to complex, extensively researched models, multiple terms such as “framework”, “model” and “guidelines” were used in the search strategy, without explicit definitions with the understanding that these encompassing terms would capture all relevant literature, which would later be refined as per the inclusion and exclusion criteria. The following search string was employed on MEDLINE: ("Artificial Intelligence"[Mesh] OR "Artificial Intelligence" OR "Machine Learning") AND ("clinical setting*"[tiab] OR clinic*[tiab] OR "Hospital" OR "Ambulatory Care"[Mesh] OR "Ambulatory Care Facilities"[Mesh]) AND (framework OR model OR guidelines) AND (monitoring OR evaluation OR procurement OR integration OR maintenance) without any restrictions. Search strategy used for the other databases are described in the appendix ( S1 Appendix ). All search strings were designed and transformed according to the database by the lead librarian (KM) at The Aga Khan University.

Eligibility criteria

Inclusion criteria..

All studies focused on implementing AI tools in a clinical setting were included. AI implementation was broadly conceptualized to consist of procurement, integration, monitoring, and evaluation. There was no restriction on the types of article included.

Exclusion criteria.

Studies published in any language besides English were excluded. Studies describing a single step of implementation (e.g., procurement) for a single AI tool that did not present a framework for implementation were not included, along with studies that discussed the experience of consumers using an AI tool as opposed to discussion on AI frameworks.

Study selection

Retrieved articles from the systematic search were imported into EndNote Reference Manager (Version X9; Clarivate Analytics, Philadelphia, Pennsylvania) and duplicate articles were removed. All articles were screened in duplicate by two independent pairs of reviewers (AM and JH, FG and SDK). Full texts of articles were then comprehensively reviewed for inclusion based on the predetermined criteria. Due to the heterogenous nature of articles curated (including opinion pieces) a risk of bias assessment was not conducted, as an appropriate, validated tool does not exist for this purpose.

Data extraction

Three pairs of reviewers (SK and SG, SDK and FG, HDJH and AA) independently extracted data from the selected studies by using a spreadsheet. Pairs attempted to resolve disagreements first, followed by adjudication by a third external reviewer (ZH) if needed. Data extracted comprised of the following items: name of authors, year of publication, journal of publication, country of origin, World Bank region (high-income, middle-income, low-income) for the corresponding author, study aim(s), rationale, methodology, framework novelty, and framework components. Framework component categories included procurement, integration, post-implementation monitoring and evaluation [ 6 , 7 ].

Data analysis

The qualitative data were extracted and delineated into themes based on the concepts presented in each individual study. Due to the lack of risk of bias assessment, a sensitivity analysis was not conducted. Once extracted, the themes (that encompassed the four stages of implementation (procurement, integration, evaluation, and monitoring)) were then clustered into different categories through iterative discussion and agreement within the investigator team. The study team felt that while a holistic framework for AI implementation does not yet exist, there are analogous structures that are widely used in healthcare quality improvement. One of the best established structures used for iterative quality improvement is the plan-do-study-act (PDSA) method ( S1 Fig ) [ 13 ]. PDSA is commonly used for a variety of healthcare improvement efforts [ 14 ], including patient feedback systems [ 15 ] and adherence to guideline-based practices [ 16 ]. This method has four stages: plan, do, study, and act. The ‘plan’ stage identifies a change to be improved; the ‘do’ stage tests the change; the ‘study’ stage examines the success of the change and the ‘act’ stage identifies adaptations and next steps to inform a new cycle [ 13 ]. PDSA is well suited to serve as a foundation for implementing AI, because it is well understood by healthcare leaders around the globe and offers a high level of abstraction to accommodate the great breadth of relevant use cases and implementation contexts. Hence the PDSA framework was deductively chosen, and the extracted themes from the articles (irrespective of whether the original article(s) contained the PDSA framework) were then mapped onto the 4 domains of PDSA framework, with the ‘plan’ domain representing the steps required in procurement, the ‘do’ domain representing the clinical integration, the ‘study’ domain highlighting the monitoring and evaluation processes and the ‘act’ domain representing the actions taken after the monitoring and evaluation process to improve functioning of the tool. This is displayed in S1 Table .

Baseline characteristics of included articles

A total of 17,537 unique studies were returned by the search strategy, with 47 studies included after title and abstract screening for full text review. 25 studies were included in the systematic review following full-text review. 22 studies were excluded in total because they either focused on pre-implementation processes (n = 12), evaluated the use of a singular tool (n = 4), evaluated perceptions of consumers (n = 4) or did not focus on a clinical setting (n = 2). Fig 1 . Shows the PRISMA diagram for this process. A range of articles, from narrative reviews and systematic reviews to opinion pieces and letters to the editor, were included for the review.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pdig.0000514.g001

The year of publication of the included articles ranged from 2017 to 2022 with the most (40%, n = 10) articles being published in 2020 and the least being published in 2017 and 2018 (4%, n = 1 each). All corresponding authors of the 25 included articles (100%) originated from high-income countries with the most common country of author affiliation being United States of America (52%, n = 13), followed by the United Kingdom, Canada, and Australia (24%, n = 2 each). Among 172 authors, only 1 (0.6%) was from a low-income country (LIC)(Uganda) and 2 (1.2%) from low-middle-income country (LMIC) (India and Ghana) ( Table 1 ). When stated, funding organizations included institutions in the US, Canada, the European Union and South Korea [ 17 – 24 ].

thumbnail

https://doi.org/10.1371/journal.pdig.0000514.t001

From the 25 included articles, a total of 17 themes were extracted, which were later mapped to respective domains. Table 2 . Shows a summary of the distribution of themes across all the PDSA domains including a few sample quotes from eligible articles. Fig 2 . Shows a Sankey diagram highlighting the overlap between all themes across all articles. The extracted themes are discussed below.

thumbnail

https://doi.org/10.1371/journal.pdig.0000514.g002

thumbnail

https://doi.org/10.1371/journal.pdig.0000514.t002

Seven themes were clustered and mapped to the Plan domain. Most articles in the Plan domain focused on the themes of feasibility of operation within existing workflows (48%, n = 12), followed by transparency (32%, n = 8) and ethical issues and bias (32%, n = 8), the cost of purchasing and implementing the tool (20%, n = 5), regulatory approval (20%, n = 5), rationale for use of AI tools (16% n = 4) and legal liability for harm (12%, n = 3). Example quotes related to each theme are captured in Table 2 .

1) Rationale of use of AI tools. Frameworks highlight the need to select clinically relevant problems and identify the need for acquiring an AI tool before initiating the procurement process [ 27 , 34 – 36 ].

2) Ethical issues and bias. Frameworks noted that AI tools may be developed in the context of competitive venture capitalism, the values, and ethics of which often differ from, and potentially may be incompatible with, the values of the healthcare industry. While ethical considerations should occur at all stages, it is especially important, before any tool is implemented, AI tool should be critically analyzed in their social, legal, and economic domains, to ensure ethical use while fulfilling its initially intended purpose [ 17 , 18 , 23 , 27 , 29 , 32 , 33 , 37 ].

3) Transparency. Transparency of AI tools is needed to increase trust in it and ensure it is fulfilling its initially intended purpose. Black box AI tools introduce implementation challenges. Teams implementing AI must balance priorities related to accuracy and interpretability. Even without model interpretability, frameworks highlight the importance of transparency in the training population, model functionality, architecture, risk factors and outcome definition. Frameworks also recommend transparency in reporting of model performance metrics as well as the test sets and methods to derive model performance [ 24 , 25 , 28 , 29 , 37 – 40 ].

4) Legal liability for harm. There is emphasis on the legal liability that healthcare settings may face from implementing AI tools that potentially cause harm. There is a need to clarify the degree to which an AI tool developer or clinician user is responsible for potential adverse events. Relevant stakeholders involved in the whole implementation process need to be identified to know who is to be held accountable in case of an adverse event [ 23 , 25 , 29 ].

5) Regulatory requirements: Regulatory frameworks differ across geographies and are in flux. Regulatory decisions about AI tool adoption should be made based on proof of clinically important improvements in relevant patient outcomes [ 22 , 23 , 26 , 32 , 36 ].

6) Cost of purchasing and implementing a tool. Cost is an important factor to consider when deciding to implement an AI tool. The cost should be compared to the baseline standard of care without the tool. Organizations should avoid selecting AI tools that fail to create value for patients or clinicians [ 23 , 26 , 27 , 36 , 41 ].

7) Feasibility of AI tool implementation . A careful analysis of available computing and storage resources should be carried out to ensure sufficient resources are in place to implement a new AI tool. Some AI tools might need specialized infrastructure, particularly if they use large datasets, such as images or high frequency streaming data. Moreover, similar efforts should be made to assess the differences between the cohort on which the AI tool was trained and the patient cohort in the implementation context. It is suggested to locally validate AI tools, develop a proper adoption plan, and provide clinician users sufficient training to increase the likelihood of success [ 20 , 25 , 26 , 28 , 29 , 33 , 35 – 38 , 40 , 41 ].

The following four themes were clustered and mapped to the Do domain. Articles that were clustered in the Do domain primarily focused on integrating into clinical workflows (44%, n = 11). User training was the second most common theme (24%, n = 6), followed by appropriate technical expertise (16%, n = 4) and user acceptability (8%, n = 2). Example quotes related to each theme are captured in Table 2 .

1) Appropriate technical expertise . Frameworks emphasized that the team responsible for implementing and evaluating the new AI tool should include people with different relevant expertise. Specific perspectives that should be include a machine learning expert and clinical expert (i.e. a healthcare professional who has extensive knowledge, experience, and expertise in a specific clinical area that the AI tool is being deployed for). Some frameworks suggested involving individuals with expertise across clinical and technical domains who can bridge among the different stakeholders. Inadequate representation among the team may lead to poor quality of the AI tool and patient harm due to incorrect information presented to clinician users [ 27 , 30 , 40 , 41 ].

2) User training. Frameworks highlighted the need to train clinician end users to get the maximum benefit from newly implemented AI tools, from understanding and interacting with the user interface to interpreting the outputs from the tool. A rigorous and comprehensive training plan should be executed to train the end-users with the required skillset so that they can handle high-risk patient situations [ 27 , 29 , 33 , 35 , 37 , 41 ].

3) User acceptability. Frameworks highlighted the key fact that AI models can be used in inappropriate ways that can potentially be harmful to patients. Unlike drugs, AI models do not come with that clear instructions to help users avoid inappropriate use that can lead to negative effects, hence user acceptability evaluates the how well the end users acclimatize to using the tool [ 25 , 30 ].

4) Integrating into clinical workflows. For AI tools to have clinical impact, the healthcare delivery setting and clinician users must be equipped to effectively use the tool. Healthcare delivery settings should ensure that individual clinicians are empowered to use the tool effectively [ 17 , 20 , 25 , 27 , 28 , 30 , 31 , 33 , 35 , 37 , 41 ].

Five themes were clustered and mapped to the Study domain. Articles that were clustered in the Study domain primarily focused on validation of the tool using predefined performance indicators (40%, n = 10). Assessment of clinical outcomes was the second most common theme (24%, n = 6), followed by user experience (8% n = 2), reporting of adverse events (4%, n = 1) and cost evaluation (4%, n = 1). Example quotes related to each theme are captured in Table 2 .

1) User experience. User experience in the study domain concerned the perception of AI system outputs from different perspectives ranging from professionals to patients. It is important to look at barriers to effective use, including trust, instructions, documentation, and user training [ 21 , 27 ].

2) Validation of the tool using predefined performance indicators. Frameworks discussed many different metrics and approaches to AI tool evaluation, including metrics related to sensitivity, specificity, precision, F1 score, the area under the receiver operating curve (ROC), and calibration plots. In addition to the metrics themselves, it is important to specify how the metrics are calculated. Frameworks also discussed the importance of evaluating AI tools on local, independent datasets and potentially fine-tuning AI tools to local settings, if needed [ 20 – 23 , 27 , 29 , 31 , 35 , 37 , 39 ].

3) Cost evaluation. Frameworks discussed the importance of accounting for costs associated with installation, use, and maintenance of AI tools. A particularly important dimension of costs is burden placed on frontline clinicians and changes in time required to complete clinical duties [ 27 ].

4) Assessment of clinical outcomes. Frameworks highlighted the importance of determining if a given AI tool leads to an improvement in clinical patient outcomes. AI tools are unlikely to improve patient outcomes unless clinician users effectively use the tool to intervene on patients. Changes to clinical decision making should be assessed to also ensure that clinician users do not over-rely on the AI tool [ 18 , 19 , 22 , 25 , 30 , 35 ].

5) Reporting adverse events. Frameworks discussed the importance of defining processes to report adverse events / system failures to relevant regulatory agencies. Healthcare settings should agree on protocols for reporting with the AI tool developer. Software updates that address known problems should be categorized as low-risk, medium-risk or high-risk to ensure stable appropriate use at the time of updating [ 32 ].

One theme was mapped to the Act domain.

1 ) Improvement of the tool/algorithm to improve performance. Frameworks discussed the need for tailored guidance on the use of AI tools that continuously learn from new data and allowing users and sites to adjust and fine-tune model thresholds to optimize performance for local contexts. For all AI tools, continuous monitoring should be in place and there should be channels for clinician users to provide feedback to AI tool developers for necessary changes This theme was mentioned by 6 articles, with example quotes related to theme captured in Table 2 . (24%, n = 6) [ 27 , 29 , 33 , 35 , 37 , 41 ].

Framework coverage of PDSA domains.

Among the four domains (Plan, Do, Study, Act) the most common domain was Plan (84%, n = 21), followed by Study (60%, n = 15), Do (52%, n = 13), and Act (24%, n = 6). Among the 25 included frameworks, four (16%) discussed all 4 domains, four (16%) discussed only 3 domains, ten (40%) discussed only 2 domains, and seven (28%) discussed only 1 domain.

Principal findings

In this systematic review, we comprehensively synthesized themes emerging from AI implementation frameworks, in healthcare, with a specific focus on the different phases of implementation. To help frame the AI implementation phases, we utilized the broadly recognizable PDSA approach. The present study found that current literature on AI implementation mainly focused on Plan and Study domains, whereas Do and Act domains were discussed less often, with a disparity in the representation of LMICs/LICs. Almost all framework authors originated from high-income countries (167 out of 172 authors, 97.1%), with the United States of America being the most represented (68 out of 172 authors, 39.5%).

Assessment of the existing frameworks

Finding the most commonly evaluated domains to be Plan and Study is encouraging as the capacity for strategic change management has been identified as a major barrier to AI implementation in healthcare [ 8 ]. Crossnohere et al. explored 14 AI frameworks in medicine and found comparable findings to the current study where most of the frameworks focused on development and validation subthemes in each domain [ 42 ]. This focus may help to mitigate against potential risks from algorithm integration, such as dataset shift, accidental fitting of confounders and differences in performance metrics owing to generalization to new populations [ 43 ]. The need for evolving, unified regulatory mechanisms, with improved understanding of the capabilities of AI, further drives the conversation towards the initial steps of implementation [ 44 ]. This could explain why researchers often choose to focus on the Plan and Study domains much more than other features of AI tool use, since these steps can be focused on ensuring minimal adverse effect on human outcomes, before implementing the AI tool in a wider setting, especially in healthcare, where the margin of error is minimal, if not, none at all.

The most common themes in the Plan domain were assessing feasibility of model operation within existing workflows, transparency and ethical issues and bias. Researchers across context emphasized the importance of effectively integrating AI tools into clinical workflows to enable positive impacts to clinical outcomes. Similarly, there was consensus among existing frameworks to provide transparency around how models are developed and function, by understanding the internal workings of the tool to comprehend medical decisions stemming from the utilization of AI, to help drive adoption and successful roll outs of AI tools [ 45 ]. Furthermore, there is still vast public concern surrounding the ethical issues in utilizing AI tools in clinical settings [ 46 ]. The least common themes in the Plan domain were rationale for use and legal liability for harm. Without a clear problem statement and rationale for use, adoption of AI is unlikely. Unfortunately, existing frameworks do not yet emphasize the importance of deeply understanding and articulating the problem addressed by an AI tool. Similarly, the lack of emphasis placed on legal liability for harm likely stems from variable approaches to product liability and a general lack of understanding of how to attribute responsibility and accountability of product performance.

The most common theme in the Study domain was validation against predefined performance indicators. Owing to their popularity, when these tools are studied, validation and assessment for clinical outcomes compared to standard of care strategies are perhaps easier to conduct as compared to final implementation procedures. Although, validation of the tool is absolutely vital for institutions to develop clinically trustworthy decision support systems [ 47 ], it is not the sole factor responsible for ensuring that an institution commits to a tool. User experience, economic burden, and regulatory compliance are equally important, if not more important, especially in LMICs [ 48 , 49 ].

We found that the Do and Act phases were the least commonly discussed domains. The fact that these domains were the least focused on across medical literature may contribute to the difficulties reported in the implementation of AI tools into existing human processes and clinical settings [ 50 ]. Within the Do domain implementation challenges are not only faced in clinical applications, but also extended to other healthcare disciplines, such as the delivery of medical education, where lack of technical knowledge is often cited as the main reason for difficulties [ 51 ]. Key challenges in implementation identified previously also include logistical complications and human barriers to adoption, such as ease of use, as well as sociocultural implications [ 43 ], which remain under evaluated. These aspects of implementation potentially form the backbone of supporting the practical rollout of AI tools. However, only a small number of studies focused on user acceptability, user training, and technical expertise requirements, which are key facilitators of successful integration [ 52 ]. Furthermore, it is potentially due to the emerging nature of the field, but the Act domain was by far the least prevalent in eligible articles with only 6 articles discussing improvement of the AI tool following integration.

Gaps in the existing frameworks

We identified that in all included articles, in the current systematic review, HICs tend to dominate the research landscape [ 53 ]. HICs have a robust and diverse funding portfolio and are home to the leading institutions that specialize in all aspects of AI [ 54 ]. The role of HICs in AI development is corroborated by existing literature, for example, three systematic reviews of randomized controlled trials (RCTs) assessing AI tools were published in 2021 and 2022 [ 55 – 57 ]. In total, these reviews included 95 studies published in English conducted across 29 countries. The most common settings were the United States, Netherlands, Canada, Spain, and the United Kingdom (n = 3, 3%). Other than China, the Global South is barely represented, with a single study conducted in India, a single study conducted in South America, and no studies conducted in Africa. This is mirrored by qualitative research, where a recent systematic review found among 102 eligible studies, 90 (88.2%) were from countries meeting the United Nations Development Programme’s definition of “very high human development” [ 58 ].

While LICs/LMICs have great potential to benefit from AI tools with their high disease burdens, their lack of representation puts them at a significant disadvantage in AI adoption. Because existing frameworks were developed for resource and capability rich environment, they may not be generalizable or applicable to LICs/LMICs. They considered neither severe limitations in local equipment, trained personnel, infrastructure, data protection frameworks, and public policies that these countries encounter [ 59 ] nor problems unique to these countries, such as societal acceptance [ 60 ] and physician readiness [ 61 ]. In addition, it has also been argued that AI tools should be contextually relevant and designed to fit a specific setting [ 44 ]. LICs/LMICs often have poor governance frameworks which are vital for the success of AI implementation. Governance is a key theme that is often region specific and contextual, providing a clear structure for ethical oversight and implementation processes. If development of AI is not inclusive of researchers in LICs/LMICs, it has the potential to make these regions slow adopters of technology [ 62 ].

Certain themes, which are important in terms of AI use and were expected to be extracted, were notably missing from literature. The fact that the Act domain was least discussed revealed that the existing frameworks failed to discuss when and how AI tools should be decommissioned and what needs to be considered for upgrading existing tools. Furthermore, while there is great potential to implement AI into healthcare there appears to be a disconnect between developers and end users—a missing link. Crossnohere et al. found that among the frameworks examined for the use of AI in medicine, they were least likely to offer direction with regards to engagement with relevant stakeholders and end users, to facilitate the adaption of AI [ 42 ]. Successful implementation of AI requires active collaboration between developers and end users and “facilitators” who promote this collaboration by connecting these parties [ 42 , 63 ]. The lack of these “facilitators” of AI technology will mean that emerging AI technology may remain confined to a minority of early adopters, with very few tools gaining widespread traction.

Strengths, Limitations and future directions

This review has appreciable strengths and some limitations. This is the first study evaluating implementation of AI tools in clinical settings across the entirety of the medical literature using a robust search strategy. A preestablished, extensively researched framework (PDSA) was also employed for domain and theme mapping. The PDSA framework, when utilized for the distinct mapping of AI implementation procedures in the literature, has been done previously but we believe the current paper takes a different approach by mapping distinct themes of AI implementation to a modified PDSA framework [ 64 ]. The current study aimed to focus on four key concepts with regards to AI implementation, namely procurement, integration, monitoring, and evaluation. These were felt to be a comprehensive yet succinct list that describe the steps of AI implementation within healthcare settings, but by no means are meant to be an exhaustive list. As AI only becomes more dominant in healthcare, the need to continuously appraise these tools will arise and hence has important implications with regards to Quality Improvement. Limitations of the current review include the exclusion of studies published in other languages that might have allowed for the exclusion of some relevant studies and the lack of a risk of bias assessment, due to a lack of validated tools for opinion pieces. The term “decision support” was not used in the search strategy, since we were ideally looking to capture frameworks and guidelines from our search strategy on AI implementation rather than articles referring to specific decision support tools. We recognize this may have inadvertently missed some articles however, we felt the terms in the search strategy, formulated iteratively, adequately picked up the necessary articles. A significant number of articles included had an inherently high risk of bias since they are simply expert opinion, and not empirical evidence. Additionally due to the heterogeneity in language surrounding AI implementation, there was considerable difficulty conducting a literature search and some studies may not have been captured by the search strategy. Furthermore, the study only searched scientific papers from four databases, namely MEDLINE, Wiley Cochrane, Scopus, EBSCO. The current review was also not able to compare implementation processes across different countries.

In order to develop clinically applicable strategies to tackle barriers to the implementation of AI tools, we propose that future studies evaluating specific AI tools place additional importance on the specific themes within the later stages of implementation. For future research, strategies to facilitate implementation of AI tools may be developed by identifying subthemes within each PDSA domain. LIC and LMIC stakeholders can fill gaps in current frameworks and must be proactive and intentionally engaged in efforts to develop, integrate, monitor, and evaluate AI tools to ensure wider adoption and benefit globally.

The existing frameworks on AI implementation largely focus on the initial stage of implementation and are generated with little input from LICs/LMICs. Healthcare professionals repeatedly cite how challenging it is to implement AI in their clinical settings with little guidance on how to do so. For future adoption of AI in healthcare, it is necessary to develop a more comprehensive and inclusive framework through engaging collaborators across the globe from different socioeconomic backgrounds and conduct additional studies that evaluate these parameters. Implementation guided by diverse and inclusive collaborations, coupled with further research targeted on under-investigated stages of AI implementation are needed before institutions can start to swiftly adopt existing tools within their clinical settings.

Supporting information

S1 checklist. prisma checklist..

https://doi.org/10.1371/journal.pdig.0000514.s001

S1 Fig. The PDSA cycle.

https://doi.org/10.1371/journal.pdig.0000514.s002

S1 Table. Domains of the Modified PDSA framework for AI implementation.

https://doi.org/10.1371/journal.pdig.0000514.s003

S1 Appendix. Search Strategy.

https://doi.org/10.1371/journal.pdig.0000514.s004

Acknowledgments

The authors gratefully acknowledge the role of Dr. Khwaja Mustafa, Head Librarian at the Aga Khan University for facilitating and synthesizing the initial literature search.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 2. Center for Devices and Radiological Health. Artificial Intelligence and machine learning (AI/ml)-enabled medical devices. Food and Drug Administration. 2022 [cited 2023 Aug 20]. Available from: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices#resources .
  • 5. Goldfarb A, Teodoridis F. Why is AI adoption in health care lagging? Washington DC: Brookings Institution; 2022.

A systematic literature review of empirical research on ChatGPT in education

  • Open access
  • Published: 26 May 2024
  • Volume 3 , article number  60 , ( 2024 )

Cite this article

You have full access to this open access article

literature review evaluation methods

  • Yazid Albadarin   ORCID: orcid.org/0009-0005-8068-8902 1 ,
  • Mohammed Saqr 1 ,
  • Nicolas Pope 1 &
  • Markku Tukiainen 1  

104 Accesses

Explore all metrics

Over the last four decades, studies have investigated the incorporation of Artificial Intelligence (AI) into education. A recent prominent AI-powered technology that has impacted the education sector is ChatGPT. This article provides a systematic review of 14 empirical studies incorporating ChatGPT into various educational settings, published in 2022 and before the 10th of April 2023—the date of conducting the search process. It carefully followed the essential steps outlined in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA 2020) guidelines, as well as Okoli’s (Okoli in Commun Assoc Inf Syst, 2015) steps for conducting a rigorous and transparent systematic review. In this review, we aimed to explore how students and teachers have utilized ChatGPT in various educational settings, as well as the primary findings of those studies. By employing Creswell’s (Creswell in Educational research: planning, conducting, and evaluating quantitative and qualitative research [Ebook], Pearson Education, London, 2015) coding techniques for data extraction and interpretation, we sought to gain insight into their initial attempts at ChatGPT incorporation into education. This approach also enabled us to extract insights and considerations that can facilitate its effective and responsible use in future educational contexts. The results of this review show that learners have utilized ChatGPT as a virtual intelligent assistant, where it offered instant feedback, on-demand answers, and explanations of complex topics. Additionally, learners have used it to enhance their writing and language skills by generating ideas, composing essays, summarizing, translating, paraphrasing texts, or checking grammar. Moreover, learners turned to it as an aiding tool to facilitate their directed and personalized learning by assisting in understanding concepts and homework, providing structured learning plans, and clarifying assignments and tasks. However, the results of specific studies (n = 3, 21.4%) show that overuse of ChatGPT may negatively impact innovative capacities and collaborative learning competencies among learners. Educators, on the other hand, have utilized ChatGPT to create lesson plans, generate quizzes, and provide additional resources, which helped them enhance their productivity and efficiency and promote different teaching methodologies. Despite these benefits, the majority of the reviewed studies recommend the importance of conducting structured training, support, and clear guidelines for both learners and educators to mitigate the drawbacks. This includes developing critical evaluation skills to assess the accuracy and relevance of information provided by ChatGPT, as well as strategies for integrating human interaction and collaboration into learning activities that involve AI tools. Furthermore, they also recommend ongoing research and proactive dialogue with policymakers, stakeholders, and educational practitioners to refine and enhance the use of AI in learning environments. This review could serve as an insightful resource for practitioners who seek to integrate ChatGPT into education and stimulate further research in the field.

Similar content being viewed by others

literature review evaluation methods

Empowering learners with ChatGPT: insights from a systematic literature exploration

literature review evaluation methods

Incorporating AI in foreign language education: An investigation into ChatGPT’s effect on foreign language learners

literature review evaluation methods

Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT

Avoid common mistakes on your manuscript.

1 Introduction

Educational technology, a rapidly evolving field, plays a crucial role in reshaping the landscape of teaching and learning [ 82 ]. One of the most transformative technological innovations of our era that has influenced the field of education is Artificial Intelligence (AI) [ 50 ]. Over the last four decades, AI in education (AIEd) has gained remarkable attention for its potential to make significant advancements in learning, instructional methods, and administrative tasks within educational settings [ 11 ]. In particular, a large language model (LLM), a type of AI algorithm that applies artificial neural networks (ANNs) and uses massively large data sets to understand, summarize, generate, and predict new content that is almost difficult to differentiate from human creations [ 79 ], has opened up novel possibilities for enhancing various aspects of education, from content creation to personalized instruction [ 35 ]. Chatbots that leverage the capabilities of LLMs to understand and generate human-like responses have also presented the capacity to enhance student learning and educational outcomes by engaging students, offering timely support, and fostering interactive learning experiences [ 46 ].

The ongoing and remarkable technological advancements in chatbots have made their use more convenient, increasingly natural and effortless, and have expanded their potential for deployment across various domains [ 70 ]. One prominent example of chatbot applications is the Chat Generative Pre-Trained Transformer, known as ChatGPT, which was introduced by OpenAI, a leading AI research lab, on November 30th, 2022. ChatGPT employs a variety of deep learning techniques to generate human-like text, with a particular focus on recurrent neural networks (RNNs). Long short-term memory (LSTM) allows it to grasp the context of the text being processed and retain information from previous inputs. Also, the transformer architecture, a neural network architecture based on the self-attention mechanism, allows it to analyze specific parts of the input, thereby enabling it to produce more natural-sounding and coherent output. Additionally, the unsupervised generative pre-training and the fine-tuning methods allow ChatGPT to generate more relevant and accurate text for specific tasks [ 31 , 62 ]. Furthermore, reinforcement learning from human feedback (RLHF), a machine learning approach that combines reinforcement learning techniques with human-provided feedback, has helped improve ChatGPT’s model by accelerating the learning process and making it significantly more efficient.

This cutting-edge natural language processing (NLP) tool is widely recognized as one of today's most advanced LLMs-based chatbots [ 70 ], allowing users to ask questions and receive detailed, coherent, systematic, personalized, convincing, and informative human-like responses [ 55 ], even within complex and ambiguous contexts [ 63 , 77 ]. ChatGPT is considered the fastest-growing technology in history: in just three months following its public launch, it amassed an estimated 120 million monthly active users [ 16 ] with an estimated 13 million daily queries [ 49 ], surpassing all other applications [ 64 ]. This remarkable growth can be attributed to the unique features and user-friendly interface that ChatGPT offers. Its intuitive design allows users to interact seamlessly with the technology, making it accessible to a diverse range of individuals, regardless of their technical expertise [ 78 ]. Additionally, its exceptional performance results from a combination of advanced algorithms, continuous enhancements, and extensive training on a diverse dataset that includes various text sources such as books, articles, websites, and online forums [ 63 ], have contributed to a more engaging and satisfying user experience [ 62 ]. These factors collectively explain its remarkable global growth and set it apart from predecessors like Bard, Bing Chat, ERNIE, and others.

In this context, several studies have explored the technological advancements of chatbots. One noteworthy recent research effort, conducted by Schöbel et al. [ 70 ], stands out for its comprehensive analysis of more than 5,000 studies on communication agents. This study offered a comprehensive overview of the historical progression and future prospects of communication agents, including ChatGPT. Moreover, other studies have focused on making comparisons, particularly between ChatGPT and alternative chatbots like Bard, Bing Chat, ERNIE, LaMDA, BlenderBot, and various others. For example, O’Leary [ 53 ] compared two chatbots, LaMDA and BlenderBot, with ChatGPT and revealed that ChatGPT outperformed both. This superiority arises from ChatGPT’s capacity to handle a wider range of questions and generate slightly varied perspectives within specific contexts. Similarly, ChatGPT exhibited an impressive ability to formulate interpretable responses that were easily understood when compared with Google's feature snippet [ 34 ]. Additionally, ChatGPT was compared to other LLMs-based chatbots, including Bard and BERT, as well as ERNIE. The findings indicated that ChatGPT exhibited strong performance in the given tasks, often outperforming the other models [ 59 ].

Furthermore, in the education context, a comprehensive study systematically compared a range of the most promising chatbots, including Bard, Bing Chat, ChatGPT, and Ernie across a multidisciplinary test that required higher-order thinking. The study revealed that ChatGPT achieved the highest score, surpassing Bing Chat and Bard [ 64 ]. Similarly, a comparative analysis was conducted to compare ChatGPT with Bard in answering a set of 30 mathematical questions and logic problems, grouped into two question sets. Set (A) is unavailable online, while Set (B) is available online. The results revealed ChatGPT's superiority in Set (A) over Bard. Nevertheless, Bard's advantage emerged in Set (B) due to its capacity to access the internet directly and retrieve answers, a capability that ChatGPT does not possess [ 57 ]. However, through these varied assessments, ChatGPT consistently highlights its exceptional prowess compared to various alternatives in the ever-evolving chatbot technology.

The widespread adoption of chatbots, especially ChatGPT, by millions of students and educators, has sparked extensive discussions regarding its incorporation into the education sector [ 64 ]. Accordingly, many scholars have contributed to the discourse, expressing both optimism and pessimism regarding the incorporation of ChatGPT into education. For example, ChatGPT has been highlighted for its capabilities in enriching the learning and teaching experience through its ability to support different learning approaches, including adaptive learning, personalized learning, and self-directed learning [ 58 , 60 , 91 ]), deliver summative and formative feedback to students and provide real-time responses to questions, increase the accessibility of information [ 22 , 40 , 43 ], foster students’ performance, engagement and motivation [ 14 , 44 , 58 ], and enhance teaching practices [ 17 , 18 , 64 , 74 ].

On the other hand, concerns have been also raised regarding its potential negative effects on learning and teaching. These include the dissemination of false information and references [ 12 , 23 , 61 , 85 ], biased reinforcement [ 47 , 50 ], compromised academic integrity [ 18 , 40 , 66 , 74 ], and the potential decline in students' skills [ 43 , 61 , 64 , 74 ]. As a result, ChatGPT has been banned in multiple countries, including Russia, China, Venezuela, Belarus, and Iran, as well as in various educational institutions in India, Italy, Western Australia, France, and the United States [ 52 , 90 ].

Clearly, the advent of chatbots, especially ChatGPT, has provoked significant controversy due to their potential impact on learning and teaching. This indicates the necessity for further exploration to gain a deeper understanding of this technology and carefully evaluate its potential benefits, limitations, challenges, and threats to education [ 79 ]. Therefore, conducting a systematic literature review will provide valuable insights into the potential prospects and obstacles linked to its incorporation into education. This systematic literature review will primarily focus on ChatGPT, driven by the aforementioned key factors outlined above.

However, the existing literature lacks a systematic literature review of empirical studies. Thus, this systematic literature review aims to address this gap by synthesizing the existing empirical studies conducted on chatbots, particularly ChatGPT, in the field of education, highlighting how ChatGPT has been utilized in educational settings, and identifying any existing gaps. This review may be particularly useful for researchers in the field and educators who are contemplating the integration of ChatGPT or any chatbot into education. The following research questions will guide this study:

What are students' and teachers' initial attempts at utilizing ChatGPT in education?

What are the main findings derived from empirical studies that have incorporated ChatGPT into learning and teaching?

2 Methodology

To conduct this study, the authors followed the essential steps of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA 2020) and Okoli’s [ 54 ] steps for conducting a systematic review. These included identifying the study’s purpose, drafting a protocol, applying a practical screening process, searching the literature, extracting relevant data, evaluating the quality of the included studies, synthesizing the studies, and ultimately writing the review. The subsequent section provides an extensive explanation of how these steps were carried out in this study.

2.1 Identify the purpose

Given the widespread adoption of ChatGPT by students and teachers for various educational purposes, often without a thorough understanding of responsible and effective use or a clear recognition of its potential impact on learning and teaching, the authors recognized the need for further exploration of ChatGPT's impact on education in this early stage. Therefore, they have chosen to conduct a systematic literature review of existing empirical studies that incorporate ChatGPT into educational settings. Despite the limited number of empirical studies due to the novelty of the topic, their goal is to gain a deeper understanding of this technology and proactively evaluate its potential benefits, limitations, challenges, and threats to education. This effort could help to understand initial reactions and attempts at incorporating ChatGPT into education and bring out insights and considerations that can inform the future development of education.

2.2 Draft the protocol

The next step is formulating the protocol. This protocol serves to outline the study process in a rigorous and transparent manner, mitigating researcher bias in study selection and data extraction [ 88 ]. The protocol will include the following steps: generating the research question, predefining a literature search strategy, identifying search locations, establishing selection criteria, assessing the studies, developing a data extraction strategy, and creating a timeline.

2.3 Apply practical screen

The screening step aims to accurately filter the articles resulting from the searching step and select the empirical studies that have incorporated ChatGPT into educational contexts, which will guide us in answering the research questions and achieving the objectives of this study. To ensure the rigorous execution of this step, our inclusion and exclusion criteria were determined based on the authors' experience and informed by previous successful systematic reviews [ 21 ]. Table 1 summarizes the inclusion and exclusion criteria for study selection.

2.4 Literature search

We conducted a thorough literature search to identify articles that explored, examined, and addressed the use of ChatGPT in Educational contexts. We utilized two research databases: Dimensions.ai, which provides access to a large number of research publications, and lens.org, which offers access to over 300 million articles, patents, and other research outputs from diverse sources. Additionally, we included three databases, Scopus, Web of Knowledge, and ERIC, which contain relevant research on the topic that addresses our research questions. To browse and identify relevant articles, we used the following search formula: ("ChatGPT" AND "Education"), which included the Boolean operator "AND" to get more specific results. The subject area in the Scopus and ERIC databases were narrowed to "ChatGPT" and "Education" keywords, and in the WoS database was limited to the "Education" category. The search was conducted between the 3rd and 10th of April 2023, which resulted in 276 articles from all selected databases (111 articles from Dimensions.ai, 65 from Scopus, 28 from Web of Science, 14 from ERIC, and 58 from Lens.org). These articles were imported into the Rayyan web-based system for analysis. The duplicates were identified automatically by the system. Subsequently, the first author manually reviewed the duplicated articles ensured that they had the same content, and then removed them, leaving us with 135 unique articles. Afterward, the titles, abstracts, and keywords of the first 40 manuscripts were scanned and reviewed by the first author and were discussed with the second and third authors to resolve any disagreements. Subsequently, the first author proceeded with the filtering process for all articles and carefully applied the inclusion and exclusion criteria as presented in Table  1 . Articles that met any one of the exclusion criteria were eliminated, resulting in 26 articles. Afterward, the authors met to carefully scan and discuss them. The authors agreed to eliminate any empirical studies solely focused on checking ChatGPT capabilities, as these studies do not guide us in addressing the research questions and achieving the study's objectives. This resulted in 14 articles eligible for analysis.

2.5 Quality appraisal

The examination and evaluation of the quality of the extracted articles is a vital step [ 9 ]. Therefore, the extracted articles were carefully evaluated for quality using Fink’s [ 24 ] standards, which emphasize the necessity for detailed descriptions of methodology, results, conclusions, strengths, and limitations. The process began with a thorough assessment of each study's design, data collection, and analysis methods to ensure their appropriateness and comprehensive execution. The clarity, consistency, and logical progression from data to results and conclusions were also critically examined. Potential biases and recognized limitations within the studies were also scrutinized. Ultimately, two articles were excluded for failing to meet Fink’s criteria, particularly in providing sufficient detail on methodology, results, conclusions, strengths, or limitations. The review process is illustrated in Fig.  1 .

figure 1

The study selection process

2.6 Data extraction

The next step is data extraction, the process of capturing the key information and categories from the included studies. To improve efficiency, reduce variation among authors, and minimize errors in data analysis, the coding categories were constructed using Creswell's [ 15 ] coding techniques for data extraction and interpretation. The coding process involves three sequential steps. The initial stage encompasses open coding , where the researcher examines the data, generates codes to describe and categorize it, and gains a deeper understanding without preconceived ideas. Following open coding is axial coding , where the interrelationships between codes from open coding are analyzed to establish more comprehensive categories or themes. The process concludes with selective coding , refining and integrating categories or themes to identify core concepts emerging from the data. The first coder performed the coding process, then engaged in discussions with the second and third authors to finalize the coding categories for the first five articles. The first coder then proceeded to code all studies and engaged again in discussions with the other authors to ensure the finalization of the coding process. After a comprehensive analysis and capturing of the key information from the included studies, the data extraction and interpretation process yielded several themes. These themes have been categorized and are presented in Table  2 . It is important to note that open coding results were removed from Table  2 for aesthetic reasons, as it included many generic aspects, such as words, short phrases, or sentences mentioned in the studies.

2.7 Synthesize studies

In this stage, we will gather, discuss, and analyze the key findings that emerged from the selected studies. The synthesis stage is considered a transition from an author-centric to a concept-centric focus, enabling us to map all the provided information to achieve the most effective evaluation of the data [ 87 ]. Initially, the authors extracted data that included general information about the selected studies, including the author(s)' names, study titles, years of publication, educational levels, research methodologies, sample sizes, participants, main aims or objectives, raw data sources, and analysis methods. Following that, all key information and significant results from the selected studies were compiled using Creswell’s [ 15 ] coding techniques for data extraction and interpretation to identify core concepts and themes emerging from the data, focusing on those that directly contributed to our research questions and objectives, such as the initial utilization of ChatGPT in learning and teaching, learners' and educators' familiarity with ChatGPT, and the main findings of each study. Finally, the data related to each selected study were extracted into an Excel spreadsheet for data processing. The Excel spreadsheet was reviewed by the authors, including a series of discussions to ensure the finalization of this process and prepare it for further analysis. Afterward, the final result being analyzed and presented in various types of charts and graphs. Table 4 presents the extracted data from the selected studies, with each study labeled with a capital 'S' followed by a number.

This section consists of two main parts. The first part provides a descriptive analysis of the data compiled from the reviewed studies. The second part presents the answers to the research questions and the main findings of these studies.

3.1 Part 1: descriptive analysis

This section will provide a descriptive analysis of the reviewed studies, including educational levels and fields, participants distribution, country contribution, research methodologies, study sample size, study population, publication year, list of journals, familiarity with ChatGPT, source of data, and the main aims and objectives of the studies. Table 4 presents a comprehensive overview of the extracted data from the selected studies.

3.1.1 The number of the reviewed studies and publication years

The total number of the reviewed studies was 14. All studies were empirical studies and published in different journals focusing on Education and Technology. One study was published in 2022 [S1], while the remaining were published in 2023 [S2]-[S14]. Table 3 illustrates the year of publication, the names of the journals, and the number of reviewed studies published in each journal for the studies reviewed.

3.1.2 Educational levels and fields

The majority of the reviewed studies, 11 studies, were conducted in higher education institutions [S1]-[S10] and [S13]. Two studies did not specify the educational level of the population [S12] and [S14], while one study focused on elementary education [S11]. However, the reviewed studies covered various fields of education. Three studies focused on Arts and Humanities Education [S8], [S11], and [S14], specifically English Education. Two studies focused on Engineering Education, with one in Computer Engineering [S2] and the other in Construction Education [S3]. Two studies focused on Mathematics Education [S5] and [S12]. One study focused on Social Science Education [S13]. One study focused on Early Education [S4]. One study focused on Journalism Education [S9]. Finally, three studies did not specify the field of education [S1], [S6], and [S7]. Figure  2 represents the educational levels in the reviewed studies, while Fig.  3 represents the context of the reviewed studies.

figure 2

Educational levels in the reviewed studies

figure 3

Context of the reviewed studies

3.1.3 Participants distribution and countries contribution

The reviewed studies have been conducted across different geographic regions, providing a diverse representation of the studies. The majority of the studies, 10 in total, [S1]-[S3], [S5]-[S9], [S11], and [S14], primarily focused on participants from single countries such as Pakistan, the United Arab Emirates, China, Indonesia, Poland, Saudi Arabia, South Korea, Spain, Tajikistan, and the United States. In contrast, four studies, [S4], [S10], [S12], and [S13], involved participants from multiple countries, including China and the United States [S4], China, the United Kingdom, and the United States [S10], the United Arab Emirates, Oman, Saudi Arabia, and Jordan [S12], Turkey, Sweden, Canada, and Australia [ 13 ]. Figures  4 and 5 illustrate the distribution of participants, whether from single or multiple countries, and the contribution of each country in the reviewed studies, respectively.

figure 4

The reviewed studies conducted in single or multiple countries

figure 5

The Contribution of each country in the studies

3.1.4 Study population and sample size

Four study populations were included: university students, university teachers, university teachers and students, and elementary school teachers. Six studies involved university students [S2], [S3], [S5] and [S6]-[S8]. Three studies focused on university teachers [S1], [S4], and [S6], while one study specifically targeted elementary school teachers [S11]. Additionally, four studies included both university teachers and students [S10] and [ 12 , 13 , 14 ], and among them, study [S13] specifically included postgraduate students. In terms of the sample size of the reviewed studies, nine studies included a small sample size of less than 50 participants [S1], [S3], [S6], [S8], and [S10]-[S13]. Three studies had 50–100 participants [S2], [S9], and [S14]. Only one study had more than 100 participants [S7]. It is worth mentioning that study [S4] adopted a mixed methods approach, including 10 participants for qualitative analysis and 110 participants for quantitative analysis.

3.1.5 Participants’ familiarity with using ChatGPT

The reviewed studies recruited a diverse range of participants with varying levels of familiarity with ChatGPT. Five studies [S2], [S4], [S6], [S8], and [S12] involved participants already familiar with ChatGPT, while eight studies [S1], [S3], [S5], [S7], [S9], [S10], [S13] and [S14] included individuals with differing levels of familiarity. Notably, one study [S11] had participants who were entirely unfamiliar with ChatGPT. It is important to note that four studies [S3], [S5], [S9], and [S11] provided training or guidance to their participants before conducting their studies, while ten studies [S1], [S2], [S4], [S6]-[S8], [S10], and [S12]-[S14] did not provide training due to the participants' existing familiarity with ChatGPT.

3.1.6 Research methodology approaches and source(S) of data

The reviewed studies adopted various research methodology approaches. Seven studies adopted qualitative research methodology [S1], [S4], [S6], [S8], [S10], [S11], and [S12], while three studies adopted quantitative research methodology [S3], [S7], and [S14], and four studies employed mixed-methods, which involved a combination of both the strengths of qualitative and quantitative methods [S2], [S5], [S9], and [S13].

In terms of the source(s) of data, the reviewed studies obtained their data from various sources, such as interviews, questionnaires, and pre-and post-tests. Six studies relied on interviews as their primary source of data collection [S1], [S4], [S6], [S10], [S11], and [S12], four studies relied on questionnaires [S2], [S7], [S13], and [S14], two studies combined the use of pre-and post-tests and questionnaires for data collection [S3] and [S9], while two studies combined the use of questionnaires and interviews to obtain the data [S5] and [S8]. It is important to note that six of the reviewed studies were quasi-experimental [S3], [S5], [S8], [S9], [S12], and [S14], while the remaining ones were experimental studies [S1], [S2], [S4], [S6], [S7], [S10], [S11], and [S13]. Figures  6 and 7 illustrate the research methodologies and the source (s) of data used in the reviewed studies, respectively.

figure 6

Research methodologies in the reviewed studies

figure 7

Source of data in the reviewed studies

3.1.7 The aim and objectives of the studies

The reviewed studies encompassed a diverse set of aims, with several of them incorporating multiple primary objectives. Six studies [S3], [S6], [S7], [S8], [S11], and [S12] examined the integration of ChatGPT in educational contexts, and four studies [S4], [S5], [S13], and [S14] investigated the various implications of its use in education, while three studies [S2], [S9], and [S10] aimed to explore both its integration and implications in education. Additionally, seven studies explicitly explored attitudes and perceptions of students [S2] and [S3], educators [S1] and [S6], or both [S10], [S12], and [S13] regarding the utilization of ChatGPT in educational settings.

3.2 Part 2: research questions and main findings of the reviewed studies

This part will present the answers to the research questions and the main findings of the reviewed studies, classified into two main categories (learning and teaching) according to AI Education classification by [ 36 ]. Figure  8 summarizes the main findings of the reviewed studies in a visually informative diagram. Table 4 provides a detailed list of the key information extracted from the selected studies that led to generating these themes.

figure 8

The main findings in the reviewed studies

4 Students' initial attempts at utilizing ChatGPT in learning and main findings from students' perspective

4.1 virtual intelligent assistant.

Nine studies demonstrated that ChatGPT has been utilized by students as an intelligent assistant to enhance and support their learning. Students employed it for various purposes, such as answering on-demand questions [S2]-[S5], [S8], [S10], and [S12], providing valuable information and learning resources [S2]-[S5], [S6], and [S8], as well as receiving immediate feedback [S2], [S4], [S9], [S10], and [S12]. In this regard, students generally were confident in the accuracy of ChatGPT's responses, considering them relevant, reliable, and detailed [S3], [S4], [S5], and [S8]. However, some students indicated the need for improvement, as they found that answers are not always accurate [S2], and that misleading information may have been provided or that it may not always align with their expectations [S6] and [S10]. It was also observed by the students that the accuracy of ChatGPT is dependent on several factors, including the quality and specificity of the user's input, the complexity of the question or topic, and the scope and relevance of its training data [S12]. Many students felt that ChatGPT's answers were not always accurate and most of them believed that it requires good background knowledge to work with.

4.2 Writing and language proficiency assistant

Six of the reviewed studies highlighted that ChatGPT has been utilized by students as a valuable assistant tool to improve their academic writing skills and language proficiency. Among these studies, three mainly focused on English education, demonstrating that students showed sufficient mastery in using ChatGPT for generating ideas, summarizing, paraphrasing texts, and completing writing essays [S8], [S11], and [S14]. Furthermore, ChatGPT helped them in writing by making students active investigators rather than passive knowledge recipients and facilitated the development of their writing skills [S11] and [S14]. Similarly, ChatGPT allowed students to generate unique ideas and perspectives, leading to deeper analysis and reflection on their journalism writing [S9]. In terms of language proficiency, ChatGPT allowed participants to translate content into their home languages, making it more accessible and relevant to their context [S4]. It also enabled them to request changes in linguistic tones or flavors [S8]. Moreover, participants used it to check grammar or as a dictionary [S11].

4.3 Valuable resource for learning approaches

Five studies demonstrated that students used ChatGPT as a valuable complementary resource for self-directed learning. It provided learning resources and guidance on diverse educational topics and created a supportive home learning environment [S2] and [S4]. Moreover, it offered step-by-step guidance to grasp concepts at their own pace and enhance their understanding [S5], streamlined task and project completion carried out independently [S7], provided comprehensive and easy-to-understand explanations on various subjects [S10], and assisted in studying geometry operations, thereby empowering them to explore geometry operations at their own pace [S12]. Three studies showed that students used ChatGPT as a valuable learning resource for personalized learning. It delivered age-appropriate conversations and tailored teaching based on a child's interests [S4], acted as a personalized learning assistant, adapted to their needs and pace, which assisted them in understanding mathematical concepts [S12], and enabled personalized learning experiences in social sciences by adapting to students' needs and learning styles [S13]. On the other hand, it is important to note that, according to one study [S5], students suggested that using ChatGPT may negatively affect collaborative learning competencies between students.

4.4 Enhancing students' competencies

Six of the reviewed studies have shown that ChatGPT is a valuable tool for improving a wide range of skills among students. Two studies have provided evidence that ChatGPT led to improvements in students' critical thinking, reasoning skills, and hazard recognition competencies through engaging them in interactive conversations or activities and providing responses related to their disciplines in journalism [S5] and construction education [S9]. Furthermore, two studies focused on mathematical education have shown the positive impact of ChatGPT on students' problem-solving abilities in unraveling problem-solving questions [S12] and enhancing the students' understanding of the problem-solving process [S5]. Lastly, one study indicated that ChatGPT effectively contributed to the enhancement of conversational social skills [S4].

4.5 Supporting students' academic success

Seven of the reviewed studies highlighted that students found ChatGPT to be beneficial for learning as it enhanced learning efficiency and improved the learning experience. It has been observed to improve students' efficiency in computer engineering studies by providing well-structured responses and good explanations [S2]. Additionally, students found it extremely useful for hazard reporting [S3], and it also enhanced their efficiency in solving mathematics problems and capabilities [S5] and [S12]. Furthermore, by finding information, generating ideas, translating texts, and providing alternative questions, ChatGPT aided students in deepening their understanding of various subjects [S6]. It contributed to an increase in students' overall productivity [S7] and improved efficiency in composing written tasks [S8]. Regarding learning experiences, ChatGPT was instrumental in assisting students in identifying hazards that they might have otherwise overlooked [S3]. It also improved students' learning experiences in solving mathematics problems and developing abilities [S5] and [S12]. Moreover, it increased students' successful completion of important tasks in their studies [S7], particularly those involving average difficulty writing tasks [S8]. Additionally, ChatGPT increased the chances of educational success by providing students with baseline knowledge on various topics [S10].

5 Teachers' initial attempts at utilizing ChatGPT in teaching and main findings from teachers' perspective

5.1 valuable resource for teaching.

The reviewed studies showed that teachers have employed ChatGPT to recommend, modify, and generate diverse, creative, organized, and engaging educational contents, teaching materials, and testing resources more rapidly [S4], [S6], [S10] and [S11]. Additionally, teachers experienced increased productivity as ChatGPT facilitated quick and accurate responses to questions, fact-checking, and information searches [S1]. It also proved valuable in constructing new knowledge [S6] and providing timely answers to students' questions in classrooms [S11]. Moreover, ChatGPT enhanced teachers' efficiency by generating new ideas for activities and preplanning activities for their students [S4] and [S6], including interactive language game partners [S11].

5.2 Improving productivity and efficiency

The reviewed studies showed that participants' productivity and work efficiency have been significantly enhanced by using ChatGPT as it enabled them to allocate more time to other tasks and reduce their overall workloads [S6], [S10], [S11], [S13], and [S14]. However, three studies [S1], [S4], and [S11], indicated a negative perception and attitude among teachers toward using ChatGPT. This negativity stemmed from a lack of necessary skills to use it effectively [S1], a limited familiarity with it [S4], and occasional inaccuracies in the content provided by it [S10].

5.3 Catalyzing new teaching methodologies

Five of the reviewed studies highlighted that educators found the necessity of redefining their teaching profession with the assistance of ChatGPT [S11], developing new effective learning strategies [S4], and adapting teaching strategies and methodologies to ensure the development of essential skills for future engineers [S5]. They also emphasized the importance of adopting new educational philosophies and approaches that can evolve with the introduction of ChatGPT into the classroom [S12]. Furthermore, updating curricula to focus on improving human-specific features, such as emotional intelligence, creativity, and philosophical perspectives [S13], was found to be essential.

5.4 Effective utilization of CHATGPT in teaching

According to the reviewed studies, effective utilization of ChatGPT in education requires providing teachers with well-structured training, support, and adequate background on how to use ChatGPT responsibly [S1], [S3], [S11], and [S12]. Establishing clear rules and regulations regarding its usage is essential to ensure it positively impacts the teaching and learning processes, including students' skills [S1], [S4], [S5], [S8], [S9], and [S11]-[S14]. Moreover, conducting further research and engaging in discussions with policymakers and stakeholders is indeed crucial for the successful integration of ChatGPT in education and to maximize the benefits for both educators and students [S1], [S6]-[S10], and [S12]-[S14].

6 Discussion

The purpose of this review is to conduct a systematic review of empirical studies that have explored the utilization of ChatGPT, one of today’s most advanced LLM-based chatbots, in education. The findings of the reviewed studies showed several ways of ChatGPT utilization in different learning and teaching practices as well as it provided insights and considerations that can facilitate its effective and responsible use in future educational contexts. The results of the reviewed studies came from diverse fields of education, which helped us avoid a biased review that is limited to a specific field. Similarly, the reviewed studies have been conducted across different geographic regions. This kind of variety in geographic representation enriched the findings of this review.

In response to RQ1 , "What are students' and teachers' initial attempts at utilizing ChatGPT in education?", the findings from this review provide comprehensive insights. Chatbots, including ChatGPT, play a crucial role in supporting student learning, enhancing their learning experiences, and facilitating diverse learning approaches [ 42 , 43 ]. This review found that this tool, ChatGPT, has been instrumental in enhancing students' learning experiences by serving as a virtual intelligent assistant, providing immediate feedback, on-demand answers, and engaging in educational conversations. Additionally, students have benefited from ChatGPT’s ability to generate ideas, compose essays, and perform tasks like summarizing, translating, paraphrasing texts, or checking grammar, thereby enhancing their writing and language competencies. Furthermore, students have turned to ChatGPT for assistance in understanding concepts and homework, providing structured learning plans, and clarifying assignments and tasks, which fosters a supportive home learning environment, allowing them to take responsibility for their own learning and cultivate the skills and approaches essential for supportive home learning environment [ 26 , 27 , 28 ]. This finding aligns with the study of Saqr et al. [ 68 , 69 ] who highlighted that, when students actively engage in their own learning process, it yields additional advantages, such as heightened motivation, enhanced achievement, and the cultivation of enthusiasm, turning them into advocates for their own learning.

Moreover, students have utilized ChatGPT for tailored teaching and step-by-step guidance on diverse educational topics, streamlining task and project completion, and generating and recommending educational content. This personalization enhances the learning environment, leading to increased academic success. This finding aligns with other recent studies [ 26 , 27 , 28 , 60 , 66 ] which revealed that ChatGPT has the potential to offer personalized learning experiences and support an effective learning process by providing students with customized feedback and explanations tailored to their needs and abilities. Ultimately, fostering students' performance, engagement, and motivation, leading to increase students' academic success [ 14 , 44 , 58 ]. This ultimate outcome is in line with the findings of Saqr et al. [ 68 , 69 ], which emphasized that learning strategies are important catalysts of students' learning, as students who utilize effective learning strategies are more likely to have better academic achievement.

Teachers, too, have capitalized on ChatGPT's capabilities to enhance productivity and efficiency, using it for creating lesson plans, generating quizzes, providing additional resources, generating and preplanning new ideas for activities, and aiding in answering students’ questions. This adoption of technology introduces new opportunities to support teaching and learning practices, enhancing teacher productivity. This finding aligns with those of Day [ 17 ], De Castro [ 18 ], and Su and Yang [ 74 ] as well as with those of Valtonen et al. [ 82 ], who revealed that emerging technological advancements have opened up novel opportunities and means to support teaching and learning practices, and enhance teachers’ productivity.

In response to RQ2 , "What are the main findings derived from empirical studies that have incorporated ChatGPT into learning and teaching?", the findings from this review provide profound insights and raise significant concerns. Starting with the insights, chatbots, including ChatGPT, have demonstrated the potential to reshape and revolutionize education, creating new, novel opportunities for enhancing the learning process and outcomes [ 83 ], facilitating different learning approaches, and offering a range of pedagogical benefits [ 19 , 43 , 72 ]. In this context, this review found that ChatGPT could open avenues for educators to adopt or develop new effective learning and teaching strategies that can evolve with the introduction of ChatGPT into the classroom. Nonetheless, there is an evident lack of research understanding regarding the potential impact of generative machine learning models within diverse educational settings [ 83 ]. This necessitates teachers to attain a high level of proficiency in incorporating chatbots, such as ChatGPT, into their classrooms to create inventive, well-structured, and captivating learning strategies. In the same vein, the review also found that teachers without the requisite skills to utilize ChatGPT realized that it did not contribute positively to their work and could potentially have adverse effects [ 37 ]. This concern could lead to inequity of access to the benefits of chatbots, including ChatGPT, as individuals who lack the necessary expertise may not be able to harness their full potential, resulting in disparities in educational outcomes and opportunities. Therefore, immediate action is needed to address these potential issues. A potential solution is offering training, support, and competency development for teachers to ensure that all of them can leverage chatbots, including ChatGPT, effectively and equitably in their educational practices [ 5 , 28 , 80 ], which could enhance accessibility and inclusivity, and potentially result in innovative outcomes [ 82 , 83 ].

Additionally, chatbots, including ChatGPT, have the potential to significantly impact students' thinking abilities, including retention, reasoning, analysis skills [ 19 , 45 ], and foster innovation and creativity capabilities [ 83 ]. This review found that ChatGPT could contribute to improving a wide range of skills among students. However, it found that frequent use of ChatGPT may result in a decrease in innovative capacities, collaborative skills and cognitive capacities, and students' motivation to attend classes, as well as could lead to reduced higher-order thinking skills among students [ 22 , 29 ]. Therefore, immediate action is needed to carefully examine the long-term impact of chatbots such as ChatGPT, on learning outcomes as well as to explore its incorporation into educational settings as a supportive tool without compromising students' cognitive development and critical thinking abilities. In the same vein, the review also found that it is challenging to draw a consistent conclusion regarding the potential of ChatGPT to aid self-directed learning approach. This finding aligns with the recent study of Baskara [ 8 ]. Therefore, further research is needed to explore the potential of ChatGPT for self-directed learning. One potential solution involves utilizing learning analytics as a novel approach to examine various aspects of students' learning and support them in their individual endeavors [ 32 ]. This approach can bridge this gap by facilitating an in-depth analysis of how learners engage with ChatGPT, identifying trends in self-directed learning behavior, and assessing its influence on their outcomes.

Turning to the significant concerns, on the other hand, a fundamental challenge with LLM-based chatbots, including ChatGPT, is the accuracy and quality of the provided information and responses, as they provide false information as truth—a phenomenon often referred to as "hallucination" [ 3 , 49 ]. In this context, this review found that the provided information was not entirely satisfactory. Consequently, the utilization of chatbots presents potential concerns, such as generating and providing inaccurate or misleading information, especially for students who utilize it to support their learning. This finding aligns with other findings [ 6 , 30 , 35 , 40 ] which revealed that incorporating chatbots such as ChatGPT, into education presents challenges related to its accuracy and reliability due to its training on a large corpus of data, which may contain inaccuracies and the way users formulate or ask ChatGPT. Therefore, immediate action is needed to address these potential issues. One possible solution is to equip students with the necessary skills and competencies, which include a background understanding of how to use it effectively and the ability to assess and evaluate the information it generates, as the accuracy and the quality of the provided information depend on the input, its complexity, the topic, and the relevance of its training data [ 28 , 49 , 86 ]. However, it's also essential to examine how learners can be educated about how these models operate, the data used in their training, and how to recognize their limitations, challenges, and issues [ 79 ].

Furthermore, chatbots present a substantial challenge concerning maintaining academic integrity [ 20 , 56 ] and copyright violations [ 83 ], which are significant concerns in education. The review found that the potential misuse of ChatGPT might foster cheating, facilitate plagiarism, and threaten academic integrity. This issue is also affirmed by the research conducted by Basic et al. [ 7 ], who presented evidence that students who utilized ChatGPT in their writing assignments had more plagiarism cases than those who did not. These findings align with the conclusions drawn by Cotton et al. [ 13 ], Hisan and Amri [ 33 ] and Sullivan et al. [ 75 ], who revealed that the integration of chatbots such as ChatGPT into education poses a significant challenge to the preservation of academic integrity. Moreover, chatbots, including ChatGPT, have increased the difficulty in identifying plagiarism [ 47 , 67 , 76 ]. The findings from previous studies [ 1 , 84 ] indicate that AI-generated text often went undetected by plagiarism software, such as Turnitin. However, Turnitin and other similar plagiarism detection tools, such as ZeroGPT, GPTZero, and Copyleaks, have since evolved, incorporating enhanced techniques to detect AI-generated text, despite the possibility of false positives, as noted in different studies that have found these tools still not yet fully ready to accurately and reliably identify AI-generated text [ 10 , 51 ], and new novel detection methods may need to be created and implemented for AI-generated text detection [ 4 ]. This potential issue could lead to another concern, which is the difficulty of accurately evaluating student performance when they utilize chatbots such as ChatGPT assistance in their assignments. Consequently, the most LLM-driven chatbots present a substantial challenge to traditional assessments [ 64 ]. The findings from previous studies indicate the importance of rethinking, improving, and redesigning innovative assessment methods in the era of chatbots [ 14 , 20 , 64 , 75 ]. These methods should prioritize the process of evaluating students' ability to apply knowledge to complex cases and demonstrate comprehension, rather than solely focusing on the final product for assessment. Therefore, immediate action is needed to address these potential issues. One possible solution would be the development of clear guidelines, regulatory policies, and pedagogical guidance. These measures would help regulate the proper and ethical utilization of chatbots, such as ChatGPT, and must be established before their introduction to students [ 35 , 38 , 39 , 41 , 89 ].

In summary, our review has delved into the utilization of ChatGPT, a prominent example of chatbots, in education, addressing the question of how ChatGPT has been utilized in education. However, there remain significant gaps, which necessitate further research to shed light on this area.

7 Conclusions

This systematic review has shed light on the varied initial attempts at incorporating ChatGPT into education by both learners and educators, while also offering insights and considerations that can facilitate its effective and responsible use in future educational contexts. From the analysis of 14 selected studies, the review revealed the dual-edged impact of ChatGPT in educational settings. On the positive side, ChatGPT significantly aided the learning process in various ways. Learners have used it as a virtual intelligent assistant, benefiting from its ability to provide immediate feedback, on-demand answers, and easy access to educational resources. Additionally, it was clear that learners have used it to enhance their writing and language skills, engaging in practices such as generating ideas, composing essays, and performing tasks like summarizing, translating, paraphrasing texts, or checking grammar. Importantly, other learners have utilized it in supporting and facilitating their directed and personalized learning on a broad range of educational topics, assisting in understanding concepts and homework, providing structured learning plans, and clarifying assignments and tasks. Educators, on the other hand, found ChatGPT beneficial for enhancing productivity and efficiency. They used it for creating lesson plans, generating quizzes, providing additional resources, and answers learners' questions, which saved time and allowed for more dynamic and engaging teaching strategies and methodologies.

However, the review also pointed out negative impacts. The results revealed that overuse of ChatGPT could decrease innovative capacities and collaborative learning among learners. Specifically, relying too much on ChatGPT for quick answers can inhibit learners' critical thinking and problem-solving skills. Learners might not engage deeply with the material or consider multiple solutions to a problem. This tendency was particularly evident in group projects, where learners preferred consulting ChatGPT individually for solutions over brainstorming and collaborating with peers, which negatively affected their teamwork abilities. On a broader level, integrating ChatGPT into education has also raised several concerns, including the potential for providing inaccurate or misleading information, issues of inequity in access, challenges related to academic integrity, and the possibility of misusing the technology.

Accordingly, this review emphasizes the urgency of developing clear rules, policies, and regulations to ensure ChatGPT's effective and responsible use in educational settings, alongside other chatbots, by both learners and educators. This requires providing well-structured training to educate them on responsible usage and understanding its limitations, along with offering sufficient background information. Moreover, it highlights the importance of rethinking, improving, and redesigning innovative teaching and assessment methods in the era of ChatGPT. Furthermore, conducting further research and engaging in discussions with policymakers and stakeholders are essential steps to maximize the benefits for both educators and learners and ensure academic integrity.

It is important to acknowledge that this review has certain limitations. Firstly, the limited inclusion of reviewed studies can be attributed to several reasons, including the novelty of the technology, as new technologies often face initial skepticism and cautious adoption; the lack of clear guidelines or best practices for leveraging this technology for educational purposes; and institutional or governmental policies affecting the utilization of this technology in educational contexts. These factors, in turn, have affected the number of studies available for review. Secondly, the utilization of the original version of ChatGPT, based on GPT-3 or GPT-3.5, implies that new studies utilizing the updated version, GPT-4 may lead to different findings. Therefore, conducting follow-up systematic reviews is essential once more empirical studies on ChatGPT are published. Additionally, long-term studies are necessary to thoroughly examine and assess the impact of ChatGPT on various educational practices.

Despite these limitations, this systematic review has highlighted the transformative potential of ChatGPT in education, revealing its diverse utilization by learners and educators alike and summarized the benefits of incorporating it into education, as well as the forefront critical concerns and challenges that must be addressed to facilitate its effective and responsible use in future educational contexts. This review could serve as an insightful resource for practitioners who seek to integrate ChatGPT into education and stimulate further research in the field.

Data availability

The data supporting our findings are available upon request.

Abbreviations

  • Artificial intelligence

AI in education

Large language model

Artificial neural networks

Chat Generative Pre-Trained Transformer

Recurrent neural networks

Long short-term memory

Reinforcement learning from human feedback

Natural language processing

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

AlAfnan MA, Dishari S, Jovic M, Lomidze K. ChatGPT as an educational tool: opportunities, challenges, and recommendations for communication, business writing, and composition courses. J Artif Intell Technol. 2023. https://doi.org/10.37965/jait.2023.0184 .

Article   Google Scholar  

Ali JKM, Shamsan MAA, Hezam TA, Mohammed AAQ. Impact of ChatGPT on learning motivation. J Engl Stud Arabia Felix. 2023;2(1):41–9. https://doi.org/10.56540/jesaf.v2i1.51 .

Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023. https://doi.org/10.7759/cureus.35179 .

Anderson N, Belavý DL, Perle SM, Hendricks S, Hespanhol L, Verhagen E, Memon AR. AI did not write this manuscript, or did it? Can we trick the AI text detector into generated texts? The potential future of ChatGPT and AI in sports & exercise medicine manuscript generation. BMJ Open Sport Exerc Med. 2023;9(1): e001568. https://doi.org/10.1136/bmjsem-2023-001568 .

Ausat AMA, Massang B, Efendi M, Nofirman N, Riady Y. Can chat GPT replace the role of the teacher in the classroom: a fundamental analysis. J Educ. 2023;5(4):16100–6.

Google Scholar  

Baidoo-Anu D, Ansah L. Education in the Era of generative artificial intelligence (AI): understanding the potential benefits of ChatGPT in promoting teaching and learning. Soc Sci Res Netw. 2023. https://doi.org/10.2139/ssrn.4337484 .

Basic Z, Banovac A, Kruzic I, Jerkovic I. Better by you, better than me, chatgpt3 as writing assistance in students essays. 2023. arXiv preprint arXiv:2302.04536 .‏

Baskara FR. The promises and pitfalls of using chat GPT for self-determined learning in higher education: an argumentative review. Prosiding Seminar Nasional Fakultas Tarbiyah dan Ilmu Keguruan IAIM Sinjai. 2023;2:95–101. https://doi.org/10.47435/sentikjar.v2i0.1825 .

Behera RK, Bala PK, Dhir A. The emerging role of cognitive computing in healthcare: a systematic literature review. Int J Med Inform. 2019;129:154–66. https://doi.org/10.1016/j.ijmedinf.2019.04.024 .

Chaka C. Detecting AI content in responses generated by ChatGPT, YouChat, and Chatsonic: the case of five AI content detection tools. J Appl Learn Teach. 2023. https://doi.org/10.37074/jalt.2023.6.2.12 .

Chiu TKF, Xia Q, Zhou X, Chai CS, Cheng M. Systematic literature review on opportunities, challenges, and future research recommendations of artificial intelligence in education. Comput Educ Artif Intell. 2023;4:100118. https://doi.org/10.1016/j.caeai.2022.100118 .

Choi EPH, Lee JJ, Ho M, Kwok JYY, Lok KYW. Chatting or cheating? The impacts of ChatGPT and other artificial intelligence language models on nurse education. Nurse Educ Today. 2023;125:105796. https://doi.org/10.1016/j.nedt.2023.105796 .

Cotton D, Cotton PA, Shipway JR. Chatting and cheating: ensuring academic integrity in the era of ChatGPT. Innov Educ Teach Int. 2023. https://doi.org/10.1080/14703297.2023.2190148 .

Crawford J, Cowling M, Allen K. Leadership is needed for ethical ChatGPT: Character, assessment, and learning using artificial intelligence (AI). J Univ Teach Learn Pract. 2023. https://doi.org/10.53761/1.20.3.02 .

Creswell JW. Educational research: planning, conducting, and evaluating quantitative and qualitative research [Ebook]. 4th ed. London: Pearson Education; 2015.

Curry D. ChatGPT Revenue and Usage Statistics (2023)—Business of Apps. 2023. https://www.businessofapps.com/data/chatgpt-statistics/

Day T. A preliminary investigation of fake peer-reviewed citations and references generated by ChatGPT. Prof Geogr. 2023. https://doi.org/10.1080/00330124.2023.2190373 .

De Castro CA. A Discussion about the Impact of ChatGPT in education: benefits and concerns. J Bus Theor Pract. 2023;11(2):p28. https://doi.org/10.22158/jbtp.v11n2p28 .

Deng X, Yu Z. A meta-analysis and systematic review of the effect of Chatbot technology use in sustainable education. Sustainability. 2023;15(4):2940. https://doi.org/10.3390/su15042940 .

Eke DO. ChatGPT and the rise of generative AI: threat to academic integrity? J Responsib Technol. 2023;13:100060. https://doi.org/10.1016/j.jrt.2023.100060 .

Elmoazen R, Saqr M, Tedre M, Hirsto L. A systematic literature review of empirical research on epistemic network analysis in education. IEEE Access. 2022;10:17330–48. https://doi.org/10.1109/access.2022.3149812 .

Farrokhnia M, Banihashem SK, Noroozi O, Wals AEJ. A SWOT analysis of ChatGPT: implications for educational practice and research. Innov Educ Teach Int. 2023. https://doi.org/10.1080/14703297.2023.2195846 .

Fergus S, Botha M, Ostovar M. Evaluating academic answers generated using ChatGPT. J Chem Educ. 2023;100(4):1672–5. https://doi.org/10.1021/acs.jchemed.3c00087 .

Fink A. Conducting research literature reviews: from the Internet to Paper. Incorporated: SAGE Publications; 2010.

Firaina R, Sulisworo D. Exploring the usage of ChatGPT in higher education: frequency and impact on productivity. Buletin Edukasi Indonesia (BEI). 2023;2(01):39–46. https://doi.org/10.56741/bei.v2i01.310 .

Firat, M. (2023). How chat GPT can transform autodidactic experiences and open education.  Department of Distance Education, Open Education Faculty, Anadolu Unive .‏ https://orcid.org/0000-0001-8707-5918

Firat M. What ChatGPT means for universities: perceptions of scholars and students. J Appl Learn Teach. 2023. https://doi.org/10.37074/jalt.2023.6.1.22 .

Fuchs K. Exploring the opportunities and challenges of NLP models in higher education: is Chat GPT a blessing or a curse? Front Educ. 2023. https://doi.org/10.3389/feduc.2023.1166682 .

García-Peñalvo FJ. La percepción de la inteligencia artificial en contextos educativos tras el lanzamiento de ChatGPT: disrupción o pánico. Educ Knowl Soc. 2023;24: e31279. https://doi.org/10.14201/eks.31279 .

Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor A, Chartash D. How does ChatGPT perform on the United States medical Licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9: e45312. https://doi.org/10.2196/45312 .

Hashana AJ, Brundha P, Ayoobkhan MUA, Fazila S. Deep Learning in ChatGPT—A Survey. In   2023 7th international conference on trends in electronics and informatics (ICOEI) . 2023. (pp. 1001–1005). IEEE. https://doi.org/10.1109/icoei56765.2023.10125852

Hirsto L, Saqr M, López-Pernas S, Valtonen T. (2022). A systematic narrative review of learning analytics research in K-12 and schools.  Proceedings . https://ceur-ws.org/Vol-3383/FLAIEC22_paper_9536.pdf

Hisan UK, Amri MM. ChatGPT and medical education: a double-edged sword. J Pedag Educ Sci. 2023;2(01):71–89. https://doi.org/10.13140/RG.2.2.31280.23043/1 .

Hopkins AM, Logan JM, Kichenadasse G, Sorich MJ. Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm-shift. JNCI Cancer Spectr. 2023. https://doi.org/10.1093/jncics/pkad010 .

Househ M, AlSaad R, Alhuwail D, Ahmed A, Healy MG, Latifi S, Sheikh J. Large Language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9: e48291. https://doi.org/10.2196/48291 .

Ilkka T. The impact of artificial intelligence on learning, teaching, and education. Minist de Educ. 2018. https://doi.org/10.2760/12297 .

Iqbal N, Ahmed H, Azhar KA. Exploring teachers’ attitudes towards using CHATGPT. Globa J Manag Adm Sci. 2022;3(4):97–111. https://doi.org/10.46568/gjmas.v3i4.163 .

Irfan M, Murray L, Ali S. Integration of Artificial intelligence in academia: a case study of critical teaching and learning in Higher education. Globa Soc Sci Rev. 2023;8(1):352–64. https://doi.org/10.31703/gssr.2023(viii-i).32 .

Jeon JH, Lee S. Large language models in education: a focus on the complementary relationship between human teachers and ChatGPT. Educ Inf Technol. 2023. https://doi.org/10.1007/s10639-023-11834-1 .

Khan RA, Jawaid M, Khan AR, Sajjad M. ChatGPT—Reshaping medical education and clinical management. Pak J Med Sci. 2023. https://doi.org/10.12669/pjms.39.2.7653 .

King MR. A conversation on artificial intelligence, Chatbots, and plagiarism in higher education. Cell Mol Bioeng. 2023;16(1):1–2. https://doi.org/10.1007/s12195-022-00754-8 .

Kooli C. Chatbots in education and research: a critical examination of ethical implications and solutions. Sustainability. 2023;15(7):5614. https://doi.org/10.3390/su15075614 .

Kuhail MA, Alturki N, Alramlawi S, Alhejori K. Interacting with educational chatbots: a systematic review. Educ Inf Technol. 2022;28(1):973–1018. https://doi.org/10.1007/s10639-022-11177-3 .

Lee H. The rise of ChatGPT: exploring its potential in medical education. Anat Sci Educ. 2023. https://doi.org/10.1002/ase.2270 .

Li L, Subbareddy R, Raghavendra CG. AI intelligence Chatbot to improve students learning in the higher education platform. J Interconnect Netw. 2022. https://doi.org/10.1142/s0219265921430325 .

Limna P. A Review of Artificial Intelligence (AI) in Education during the Digital Era. 2022. https://ssrn.com/abstract=4160798

Lo CK. What is the impact of ChatGPT on education? A rapid review of the literature. Educ Sci. 2023;13(4):410. https://doi.org/10.3390/educsci13040410 .

Luo W, He H, Liu J, Berson IR, Berson MJ, Zhou Y, Li H. Aladdin’s genie or pandora’s box For early childhood education? Experts chat on the roles, challenges, and developments of ChatGPT. Early Educ Dev. 2023. https://doi.org/10.1080/10409289.2023.2214181 .

Meyer JG, Urbanowicz RJ, Martin P, O’Connor K, Li R, Peng P, Moore JH. ChatGPT and large language models in academia: opportunities and challenges. Biodata Min. 2023. https://doi.org/10.1186/s13040-023-00339-9 .

Mhlanga D. Open AI in education, the responsible and ethical use of ChatGPT towards lifelong learning. Soc Sci Res Netw. 2023. https://doi.org/10.2139/ssrn.4354422 .

Neumann, M., Rauschenberger, M., & Schön, E. M. (2023). “We Need To Talk About ChatGPT”: The Future of AI and Higher Education.‏ https://doi.org/10.1109/seeng59157.2023.00010

Nolan B. Here are the schools and colleges that have banned the use of ChatGPT over plagiarism and misinformation fears. Business Insider . 2023. https://www.businessinsider.com

O’Leary DE. An analysis of three chatbots: BlenderBot, ChatGPT and LaMDA. Int J Intell Syst Account, Financ Manag. 2023;30(1):41–54. https://doi.org/10.1002/isaf.1531 .

Okoli C. A guide to conducting a standalone systematic literature review. Commun Assoc Inf Syst. 2015. https://doi.org/10.17705/1cais.03743 .

OpenAI. (2023). https://openai.com/blog/chatgpt

Perkins M. Academic integrity considerations of AI large language models in the post-pandemic era: ChatGPT and beyond. J Univ Teach Learn Pract. 2023. https://doi.org/10.53761/1.20.02.07 .

Plevris V, Papazafeiropoulos G, Rios AJ. Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. arXiv (Cornell University) . 2023. https://doi.org/10.48550/arxiv.2305.18618

Rahman MM, Watanobe Y (2023) ChatGPT for education and research: opportunities, threats, and strategies. Appl Sci 13(9):5783. https://doi.org/10.3390/app13095783

Ram B, Verma P. Artificial intelligence AI-based Chatbot study of ChatGPT, google AI bard and baidu AI. World J Adv Eng Technol Sci. 2023;8(1):258–61. https://doi.org/10.30574/wjaets.2023.8.1.0045 .

Rasul T, Nair S, Kalendra D, Robin M, de Oliveira Santini F, Ladeira WJ, Heathcote L. The role of ChatGPT in higher education: benefits, challenges, and future research directions. J Appl Learn Teach. 2023. https://doi.org/10.37074/jalt.2023.6.1.29 .

Ratnam M, Sharm B, Tomer A. ChatGPT: educational artificial intelligence. Int J Adv Trends Comput Sci Eng. 2023;12(2):84–91. https://doi.org/10.30534/ijatcse/2023/091222023 .

Ray PP. ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys Syst. 2023;3:121–54. https://doi.org/10.1016/j.iotcps.2023.04.003 .

Roumeliotis KI, Tselikas ND. ChatGPT and Open-AI models: a preliminary review. Future Internet. 2023;15(6):192. https://doi.org/10.3390/fi15060192 .

Rudolph J, Tan S, Tan S. War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education. J Appl Learn Teach. 2023. https://doi.org/10.37074/jalt.2023.6.1.23 .

Ruiz LMS, Moll-López S, Nuñez-Pérez A, Moraño J, Vega-Fleitas E. ChatGPT challenges blended learning methodologies in engineering education: a case study in mathematics. Appl Sci. 2023;13(10):6039. https://doi.org/10.3390/app13106039 .

Sallam M, Salim NA, Barakat M, Al-Tammemi AB. ChatGPT applications in medical, dental, pharmacy, and public health education: a descriptive study highlighting the advantages and limitations. Narra J. 2023;3(1): e103. https://doi.org/10.52225/narra.v3i1.103 .

Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for scientific writing? Crit Care. 2023. https://doi.org/10.1186/s13054-023-04380-2 .

Saqr M, López-Pernas S, Helske S, Hrastinski S. The longitudinal association between engagement and achievement varies by time, students’ profiles, and achievement state: a full program study. Comput Educ. 2023;199:104787. https://doi.org/10.1016/j.compedu.2023.104787 .

Saqr M, Matcha W, Uzir N, Jovanović J, Gašević D, López-Pernas S. Transferring effective learning strategies across learning contexts matters: a study in problem-based learning. Australas J Educ Technol. 2023;39(3):9.

Schöbel S, Schmitt A, Benner D, Saqr M, Janson A, Leimeister JM. Charting the evolution and future of conversational agents: a research agenda along five waves and new frontiers. Inf Syst Front. 2023. https://doi.org/10.1007/s10796-023-10375-9 .

Shoufan A. Exploring students’ perceptions of CHATGPT: thematic analysis and follow-up survey. IEEE Access. 2023. https://doi.org/10.1109/access.2023.3268224 .

Sonderegger S, Seufert S. Chatbot-mediated learning: conceptual framework for the design of Chatbot use cases in education. Gallen: Institute for Educational Management and Technologies, University of St; 2022. https://doi.org/10.5220/0010999200003182 .

Book   Google Scholar  

Strzelecki A. To use or not to use ChatGPT in higher education? A study of students’ acceptance and use of technology. Interact Learn Environ. 2023. https://doi.org/10.1080/10494820.2023.2209881 .

Su J, Yang W. Unlocking the power of ChatGPT: a framework for applying generative AI in education. ECNU Rev Educ. 2023. https://doi.org/10.1177/20965311231168423 .

Sullivan M, Kelly A, McLaughlan P. ChatGPT in higher education: Considerations for academic integrity and student learning. J ApplLearn Teach. 2023;6(1):1–10. https://doi.org/10.37074/jalt.2023.6.1.17 .

Szabo A. ChatGPT is a breakthrough in science and education but fails a test in sports and exercise psychology. Balt J Sport Health Sci. 2023;1(128):25–40. https://doi.org/10.33607/bjshs.v127i4.1233 .

Taecharungroj V. “What can ChatGPT do?” analyzing early reactions to the innovative AI chatbot on Twitter. Big Data Cognit Comput. 2023;7(1):35. https://doi.org/10.3390/bdcc7010035 .

Tam S, Said RB. User preferences for ChatGPT-powered conversational interfaces versus traditional methods. Biomed Eng Soc. 2023. https://doi.org/10.58496/mjcsc/2023/004 .

Tedre M, Kahila J, Vartiainen H. (2023). Exploration on how co-designing with AI facilitates critical evaluation of ethics of AI in craft education. In: Langran E, Christensen P, Sanson J (Eds).  Proceedings of Society for Information Technology and Teacher Education International Conference . 2023. pp. 2289–2296.

Tlili A, Shehata B, Adarkwah MA, Bozkurt A, Hickey DT, Huang R, Agyemang B. What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learn Environ. 2023. https://doi.org/10.1186/s40561-023-00237-x .

Uddin SMJ, Albert A, Ovid A, Alsharef A. Leveraging CHATGPT to aid construction hazard recognition and support safety education and training. Sustainability. 2023;15(9):7121. https://doi.org/10.3390/su15097121 .

Valtonen T, López-Pernas S, Saqr M, Vartiainen H, Sointu E, Tedre M. The nature and building blocks of educational technology research. Comput Hum Behav. 2022;128:107123. https://doi.org/10.1016/j.chb.2021.107123 .

Vartiainen H, Tedre M. Using artificial intelligence in craft education: crafting with text-to-image generative models. Digit Creat. 2023;34(1):1–21. https://doi.org/10.1080/14626268.2023.2174557 .

Ventayen RJM. OpenAI ChatGPT generated results: similarity index of artificial intelligence-based contents. Soc Sci Res Netw. 2023. https://doi.org/10.2139/ssrn.4332664 .

Wagner MW, Ertl-Wagner BB. Accuracy of information and references using ChatGPT-3 for retrieval of clinical radiological information. Can Assoc Radiol J. 2023. https://doi.org/10.1177/08465371231171125 .

Wardat Y, Tashtoush MA, AlAli R, Jarrah AM. ChatGPT: a revolutionary tool for teaching and learning mathematics. Eurasia J Math, Sci Technol Educ. 2023;19(7):em2286. https://doi.org/10.29333/ejmste/13272 .

Webster J, Watson RT. Analyzing the past to prepare for the future: writing a literature review. Manag Inf Syst Quart. 2002;26(2):3.

Xiao Y, Watson ME. Guidance on conducting a systematic literature review. J Plan Educ Res. 2017;39(1):93–112. https://doi.org/10.1177/0739456x17723971 .

Yan D. Impact of ChatGPT on learners in a L2 writing practicum: an exploratory investigation. Educ Inf Technol. 2023. https://doi.org/10.1007/s10639-023-11742-4 .

Yu H. Reflection on whether Chat GPT should be banned by academia from the perspective of education and teaching. Front Psychol. 2023;14:1181712. https://doi.org/10.3389/fpsyg.2023.1181712 .

Zhu C, Sun M, Luo J, Li T, Wang M. How to harness the potential of ChatGPT in education? Knowl Manag ELearn. 2023;15(2):133–52. https://doi.org/10.34105/j.kmel.2023.15.008 .

Download references

The paper is co-funded by the Academy of Finland (Suomen Akatemia) Research Council for Natural Sciences and Engineering for the project Towards precision education: Idiographic learning analytics (TOPEILA), Decision Number 350560.

Author information

Authors and affiliations.

School of Computing, University of Eastern Finland, 80100, Joensuu, Finland

Yazid Albadarin, Mohammed Saqr, Nicolas Pope & Markku Tukiainen

You can also search for this author in PubMed   Google Scholar

Contributions

YA contributed to the literature search, data analysis, discussion, and conclusion. Additionally, YA contributed to the manuscript’s writing, editing, and finalization. MS contributed to the study’s design, conceptualization, acquisition of funding, project administration, allocation of resources, supervision, validation, literature search, and analysis of results. Furthermore, MS contributed to the manuscript's writing, revising, and approving it in its finalized state. NP contributed to the results, and discussions, and provided supervision. NP also contributed to the writing process, revisions, and the final approval of the manuscript in its finalized state. MT contributed to the study's conceptualization, resource management, supervision, writing, revising the manuscript, and approving it.

Corresponding author

Correspondence to Yazid Albadarin .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

See Table  4

The process of synthesizing the data presented in Table  4 involved identifying the relevant studies through a search process of databases (ERIC, Scopus, Web of Knowledge, Dimensions.ai, and lens.org) using specific keywords "ChatGPT" and "education". Following this, inclusion/exclusion criteria were applied, and data extraction was performed using Creswell's [ 15 ] coding techniques to capture key information and identify common themes across the included studies.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Albadarin, Y., Saqr, M., Pope, N. et al. A systematic literature review of empirical research on ChatGPT in education. Discov Educ 3 , 60 (2024). https://doi.org/10.1007/s44217-024-00138-2

Download citation

Received : 22 October 2023

Accepted : 10 May 2024

Published : 26 May 2024

DOI : https://doi.org/10.1007/s44217-024-00138-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Large language models
  • Educational technology
  • Systematic review

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. 1 Literature Review Methodology Flowchart Download Scientific Diagram

    literature review evaluation methods

  2. FREE 5+ Sample Literature Review Templates in PDF

    literature review evaluation methods

  3. Evaluating results

    literature review evaluation methods

  4. Literature Review

    literature review evaluation methods

  5. review of literature ppt

    literature review evaluation methods

  6. 50 Smart Literature Review Templates (APA) ᐅ TemplateLab

    literature review evaluation methods

VIDEO

  1. How to Write Literature Review for Research Proposal

  2. Approaches to Literature Review

  3. Overview of Literature Reviews: Spring 2024 Systematic Reviews Webinar Series

  4. Literature review in research

  5. Literature Review

  6. Tutorial (Systematic) Literature Reviews

COMMENTS

  1. Methodological Approaches to Literature Review

    The literature review can serve various functions in the contexts of education and research. ... Development, and Evaluation (GRADE) is an internationally recognized approach to rate the quality of evidence and the strength of recommendations and is ... The purpose, process, and methods of writing a literature review. AORN J. 2016;103(3):265 ...

  2. Chapter 9 Methods for Literature Reviews

    9.3. Types of Review Articles and Brief Illustrations. EHealth researchers have at their disposal a number of approaches and methods for making sense out of existing literature, all with the purpose of casting current research findings into historical contexts or explaining contradictions that might exist among a set of primary research studies conducted on a particular topic.

  3. How to Write a Literature Review

    Examples of literature reviews. Step 1 - Search for relevant literature. Step 2 - Evaluate and select sources. Step 3 - Identify themes, debates, and gaps. Step 4 - Outline your literature review's structure. Step 5 - Write your literature review.

  4. Literature review as a research methodology: An ...

    This is why the literature review as a research method is more relevant than ever. Traditional literature reviews often lack thoroughness and rigor and are conducted ad hoc, rather than following a specific methodology. Therefore, questions can be raised about the quality and trustworthiness of these types of reviews.

  5. Guidance on Conducting a Systematic Literature Review

    Introduction. Literature review is an essential feature of academic research. Fundamentally, knowledge advancement must be built on prior existing work. To push the knowledge frontier, we must know where the frontier is. By reviewing relevant literature, we understand the breadth and depth of the existing body of work and identify gaps to explore.

  6. PDF METHODOLOGY OF THE LITERATURE REVIEW

    In the field of research, the term method represents the specific approaches and procedures that the researcher systematically utilizes that are manifested in the research design, sampling design, data collec-tion, data analysis, data interpretation, and so forth. The literature review represents a method because the literature reviewer chooses ...

  7. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  8. Writing a Literature Review

    A literature review is a survey of scholarly articles, books, or other sources that pertain to a specific topic, area of research, or theory. The literature review offers brief descriptions, summaries, and critical evaluations of each work, and does so in the form of a well-organized essay.

  9. Writing a literature review

    Writing a literature review requires a range of skills to gather, sort, evaluate and summarise peer-reviewed published data into a relevant and informative unbiased narrative. Digital access to research papers, academic texts, review articles, reference databases and public data sets are all sources of information that are available to enrich ...

  10. Literature Review: The What, Why and How-to Guide

    Example: Predictors and Outcomes of U.S. Quality Maternity Leave: A Review and Conceptual Framework: 10.1177/08948453211037398 ; Systematic review: "The authors of a systematic review use a specific procedure to search the research literature, select the studies to include in their review, and critically evaluate the studies they find." (p. 139).

  11. Steps in the Literature Review Process

    Literature Review and Research Design by Dave Harris This book looks at literature review in the process of research design, and how to develop a research practice that will build skills in reading and writing about research literature--skills that remain valuable in both academic and professional careers. Literature review is approached as a process of engaging with the discourse of scholarly ...

  12. Method Article How-to conduct a systematic literature review: A quick

    Method details Overview. A Systematic Literature Review (SLR) is a research methodology to collect, identify, and critically analyze the available research studies (e.g., articles, conference proceedings, books, dissertations) through a systematic procedure [12].An SLR updates the reader with current literature about a subject [6].The goal is to review critical points of current knowledge on a ...

  13. Steps in Conducting a Literature Review

    A literature review is an integrated analysis-- not just a summary-- of scholarly writings and other relevant evidence related directly to your research question.That is, it represents a synthesis of the evidence that provides background information on your topic and shows a association between the evidence and your research question.

  14. Writing a Literature Review

    A literature review is a document or section of a document that collects key sources on a topic and discusses those sources in conversation with each other (also called synthesis ). The lit review is an important genre in many disciplines, not just literature (i.e., the study of works of literature such as novels and plays).

  15. Critically reviewing literature: A tutorial for new researchers

    Critically reviewing the literature is an indispensible skill which is used throughout a research career. This article demystifies the processes involved in systematically and critically reviewing the literature to demonstrate knowledge, identify research ideas, position research and develop theory. Although aimed primarily at research students ...

  16. PDF Literature Reviews: Methods and Applications

    3. Defined search and evaluation methods 4. Reproducible 5. High validity of review conclusions 1. Must adhere to established guidelines 2. Valid literature base required 3. Robust (enough) literature to review 4. Variation in study methods within reviewed literature may affect results 1. Identify relevant evidence 2. Assess quality of evidence 3.

  17. (PDF) Literature Review as a Research Methodology: An overview and

    Literature reviews allow scientists to argue that they are expanding current. expertise - improving on what already exists and filling the gaps that remain. This paper demonstrates the literatu ...

  18. 5. The Literature Review

    A literature review may consist of simply a summary of key sources, but in the social sciences, a literature review usually has an organizational pattern and combines both summary and synthesis, often within specific conceptual categories.A summary is a recap of the important information of the source, but a synthesis is a re-organization, or a reshuffling, of that information in a way that ...

  19. Research Methods: Literature Reviews

    Elements of a Literature Review. Summarize subject, issue or theory under consideration, along with objectives of the review; Divide works under review into categories (e.g. those in support of a particular position, those against, those offering alternative theories entirely) Explain how each work is similar to and how it varies from the others

  20. What is a literature review?

    A literature or narrative review is a comprehensive review and analysis of the published literature on a specific topic or research question. The literature that is reviewed contains: books, articles, academic articles, conference proceedings, association papers, and dissertations. It contains the most pertinent studies and points to important ...

  21. Text Generation: A Systematic Literature Review of Tasks, Evaluation

    The most used evaluation methods for text generation are model-free metrics because they are easy to adapt and have low computational costs (Table2). ... Text Generation: A Systematic Literature Review of Tasks, Evaluation, and Challenges 35 [234] Yizhe Zhang, Guoyin Wang, Chunyuan Li, et al. 2020. POINTER: Constrained Progressive Text ...

  22. A Comprehensive Framework to Evaluate Websites: Literature Review and

    However, as this study provides a framework of the existing literature of website evaluation, it presents a guide of options for evaluating websites, including which attributes to analyze and options for appropriate methods. Keywords: user experience, usability, human-computer interaction, software testing, quality testing, scoping study.

  23. A Systematic Literature Review on Healthcare Facility Evaluation Methods

    Abstract. To present a systematic literature review on predesign evaluation (PDE), postoccupancy evaluation (POE), and evidence-based design (EBD); to delimit the concepts and relationships of these terms and place them in the building life cycle framework to guide their application and indicate a common understanding and possible gaps.

  24. Text Generation: A Systematic Literature Review of Tasks, Evaluation

    We provide a systematic literature review comprising 244 selected papers between 2017 and 2024. This review categorizes works in text generation into five main tasks: open-ended text generation, summarization, translation, paraphrasing, and question answering. For each task, we review their relevant characteristics, sub-tasks, and specific ...

  25. A Meta-analysis of Effects of Automated Writing Evaluation ...

    In the literature review, the primary focus on outcomes not directly related to writing skills stems from the broader scope of research interests and methodological limitations. ... Educators and designers should highlight how to combine both writing evaluation methods and maximize the improvements in L2 writing skills. Future Research ...

  26. Frameworks for procurement, integration, monitoring, and evaluation of

    In this systematic review, we tried to organize and synthesize data and themes from published literature regarding key aspects of AI tool implementation; namely procurement, integration, monitoring and evaluation and map the extracted themes on to the Plan-Do-Study-Act framework.

  27. A Systematic Review of the Different Methods Assessing ...

    Method: Pre- and post-questionnaire surveys, as well as a Fuzzy comprehensive evaluation model, were used to evaluate the changes in the knowledge, attitudes, and behaviors of engineering students at Tongling University before and after the integration of a new curriculum. ... Method: A systematic review of the literature on integrating ...

  28. A systematic literature review of empirical research on ChatGPT in

    2.7 Synthesize studies. In this stage, we will gather, discuss, and analyze the key findings that emerged from the selected studies. The synthesis stage is considered a transition from an author-centric to a concept-centric focus, enabling us to map all the provided information to achieve the most effective evaluation of the data [].Initially, the authors extracted data that included general ...

  29. Critical Analysis: The Often-Missing Step in Conducting Literature

    Literature reviews are essential in moving our evidence-base forward. "A literature review makes a significant contribution when the authors add to the body of knowledge through providing new insights" (Bearman, 2016, p. 383).Although there are many methods for conducting a literature review (e.g., systematic review, scoping review, qualitative synthesis), some commonalities in ...

  30. Lessons from the Trenches on Reproducible Evaluation of Language Models

    Preprint. Under review. 4.1 Design We provide an overview of lm-eval's major components and design philosophy. At its core, lm-eval allows for the contribution of two types of implementations: evaluation Tasks and integrations with novel LM implementations. Tasks lm-evalis built around modular implementations of evaluation tasks, implemented as a Task class using a common API.