Machine learning models for abstract screening task - A systematic literature review application for health economics and outcome research

  Jingcheng Du
  Ekin Soysal
  Dong Wang
  Long He
  Bin Lin
  Jingqi Wang
  Frank J. Manion
  Yeran Li
  Elise Wu
  Lixia Yao  

Systematic literature reviews (SLRs) are critical for life-science research. However, the manual selection and retrieval of relevant publications can be a time-consuming process. This study aims to (1) develop two disease-specific annotated corpora, one for human papillomavirus (HPV) associated diseases and the other for pneumococcal-associated pediatric diseases (PAPD), and (2) optimize machine- and deep-learning models to facilitate automation of the SLR abstract screening.

This study constructed two disease-specific SLR screening corpora for HPV and PAPD, which contained citation metadata and corresponding abstracts. Performance was evaluated using precision, recall, accuracy, and F1-score of multiple combinations of machine- and deep-learning algorithms and features such as keywords and MeSH terms.

Results and conclusions

The HPV corpus contained 1697 entries, with 538 relevant and 1159 irrelevant articles. The PAPD corpus included 2865 entries, with 711 relevant and 2154 irrelevant articles. Adding additional features beyond title and abstract improved the performance (measured in Accuracy) of machine learning models by 3% for HPV corpus and 2% for PAPD corpus. Transformer-based deep learning models that consistently outperformed conventional machine learning algorithms, highlighting the strength of domain-specific pre-trained language models for SLR abstract screening. This study provides a foundation for the development of more intelligent SLR systems.

Systematic literature reviews (SLRs) are an essential tool in many areas of health sciences, enabling researchers to understand the current knowledge around a topic and identify future research and development directions. In the field of health economics and outcomes research (HEOR), SLRs play a crucial role in synthesizing evidence around unmet medical needs, comparing treatment options, and preparing the design and execution of future real-world evidence studies. SLRs provide a comprehensive and transparent analysis of available evidence, allowing researchers to make informed decisions and improve patient outcomes.

Conducting a SLR involves synthesizing high-quality evidence from biomedical literature in a transparent and reproducible manner, and seeks to include all available evidence on a given research question, and provides some assessment regarding quality of the evidence [ 1 , 2 ]. To conduct an SLR one or more bibliographic databases are queried based on a given research question and a corresponding set of inclusion and exclusion criteria, resulting in the selection of a relevant set of abstracts. The abstracts are reviewed, further refining the set of articles that are used to address the research question. Finally, appropriate data is systematically extracted from the articles and summarized [ 1 , 3 ].

The current approach to conducting a SLR is through manual review, with data collection, and summary done by domain experts against pre-specified eligibility criteria. This is time-consuming, labor-intensive, expensive, and non-scalable given the current more-than linear growth of the biomedical literature [ 4 ]. Michelson and Reuter estimate that each SLR costs approximately $141,194.80 and that on average major pharmaceutical companies conduct 23.36 SLRs, and major academic centers 177.32 SLRs per year, though the cost may vary based on the scope of different reviews [ 4 ]. Clearly automated methods are needed, both from a cost/time savings perspective, and for the ability to effectively scan and identify increasing amounts of literature, thereby allowing the domain experts to spend more time analyzing the data and gleaning the insights.

One major task of SLR project that involves large amounts of manual effort, is the abstract screening task. For this task, selection criteria are developed and the citation metadata and abstract for articles tentatively meeting these criteria are retrieved from one or more bibliographic databases (e.g., PubMed). The abstracts are then examined in more detail to determine if they are relevant to the research question(s) and should be included or excluded from further consideration. Consequently, the task of determining whether articles are relevant or not based on their titles, abstracts and metadata can be treated as a binary classification task, which can be addressed by natural language processing (NLP). NLP involves recognizing entities and relationships expressed in text and leverages machine-learning (ML) and deep-learning (DL) algorithms together with computational semantics to extract information. The past decade has witnessed significant advances in these areas for biomedical literature mining. A comprehensive review on how NLP techniques in particular are being applied for automatic mining and knowledge extraction from biomedical literature can be found in Zhao et al. [ 5 ].

Materials and methods

The aims of this study were to: (1) identify and develop two disease-specific corpora, one for human papillomavirus (HPV) associated diseases and the other for pneumococcal-associated pediatric diseases suitable for training the ML and DL models underlying the necessary NLP functions; (2) investigate and optimize the performance of the ML and DL models using different sets of features (e.g., keywords, Medical Subject Heading (MeSH) terms [ 6 ]) to facilitate automation of the abstract screening tasks necessary to construct a SLR. Note that these screening corpora can be used as training data to build different NLP models. We intend to freely share these two corpora with the entire scientific community so they can serve as benchmark corpora for future NLP model development in this area.

SLR corpora preparation

Two completed disease-specific SLR studies by Merck & Co., Inc., Rahway, NJ, USA were used as the basis to construct corpora for abstract-level screening. The two SLR studies were both relevant to health economics and outcome research, including one for human papillomavirus (HPV) associated diseases (referred to as the HPV corpus), and one for pneumococcal-associated pediatric diseases (which we refer to as the PAPD corpus). Both of the original SLR studies contained literature from PubMed/MEDLINE and EMBASE. Since we intended for the screening corpora to be released to the community, we only kept citations found from PubMed/MEDLINE in the finalized corpora. Because the original SLR studies did not contain the PubMed ID (PMID) for each article, we matched each article’s citation information (if available) against PubMed and then collected meta-data such as authors, journals, keywords, MeSH terms, publication types, etc., using PubMed Entrez Programming Utilities (E-utilities) Application Programming Interface (API). The detailed description of the two corpora can be seen in Table  1 . Both of the resulting corpora are publicly available at [ https://github.com/Merck/NLP-SLR-corpora ].

Machine learning algorithms

Although deep learning algorithms have demonstrated superior performance on many NLP tasks, conventional machine learning algorithms have certain advantages, such as low computation costs and faster training and prediction speed.

We evaluated four traditional ML-based document classification algorithms, XGBoost [ 7 ], Support Vector Machines (SVM) [ 8 ], Logistic regression (LR) [ 9 ], and Random Forest [ 10 ] on the binary inclusion/exclusion classification task for abstract screening. Salient characteristics of these models are as follows:

XGBoost: Short for “eXtreme Gradient Boosting”, XGBoost is a boosting-based ensemble of algorithms that turn weak learners into strong learners by focusing on where the individual models went wrong. In Gradient Boosting, individual weak models train upon the difference between the prediction and the actual results [ 7 ]. We set max_depth at 3, n_estimators at 150 and learning rate at 0.7.

Support vector machine (SVM): SVM is one of the most robust prediction methods based on statistical learning frameworks. It aims to find a hyperplane in an N-dimensional space (where N = the number of features) that distinctly classifies the data points [ 8 ]. We set C at 100, gamma at 0.005 and kernel as radial basis function.

Logistic regression (LR): LR is a classic statistical model that in its basic form uses a logistic function to model a binary dependent variable [ 9 ]. We set C at 5 and penalty as l2.

Random forest (RF): RF is a machine learning technique that utilizes ensemble learning to combine many decision trees classifiers through bagging or bootstrap aggregating [ 10 ]. We set n_estimators at 100 and max_depth at 14.

These four algorithms were trained for both the HPV screening task and the PAPD screening task using the corresponding training corpus.

For each of the four algorithms, we examined performance using (1) only the baseline feature criteria (title and abstract of each article), and (2) with five additional meta-data features (MeSH, Authors, Keywords, Journal, Publication types.) retrieved from each article using the PubMed E-utilities API. Conventionally, title and abstract are the first information a human reviewer would depend on when making a judgment for inclusion or exclusion of an article. Consequently, we used title and abstract as the baseline features to classify whether an abstract should be included at the abstract screening stage. We further evaluated the performance with additional features that can be retrieved by PubMed E-utilities API, including MeSH terms, authors, journal, keywords and publication type. For baseline evaluation, we concatenated the titles and abstracts and extracted the TF-IDF (term frequency-inverse document frequency) vector for the corpus. TF-IDF evaluates how relevant a word is to a document in a collection of documents. For additional features, we extracted TF-IDF vector using each feature respectively and then concatenated the extracted vectors with title and abstract vector. XGBoost was selected for the feature evaluation process, due to its relatively quick computational running time and robust performance.

Deep learning algorithms

Conventional ML methods rely heavily on manually designed features and suffer from the challenges of data sparsity and poor transportability when applied to new use cases. Deep learning (DL) is a set of machine learning algorithms based on deep neural networks that has advanced performance of text classification along with many other NLP tasks. Transformer-based deep learning models, such as BERT (Bidirectional encoder representations from transformers), have achieved state-of-the-art performance in many NLP tasks [ 11 ]. A Transformer is an emerging architecture of deep learning models designed to handle sequential input data such as natural language by adopting the mechanisms of attention to differentially weigh the significance of each part of the input data [ 12 ]. The BERT model and its variants (which use Transformer as a basic unit) leverage the power of transfer learning by first pre-training the models over 100’s of millions of parameters using large volumes of unlabeled textual data. The resulting model is then fine-tuned for a particular downstream NLP application, such as text classification, named entity recognition, relation extraction, etc. The following three BERT models were evaluated against both the HPV and Pediatric pneumococcal corpus using two sets of features (title and abstract versus adding all additional features into the text). For all BERT models, we used Adam optimizer with weight decay. We set learning rate at 1e-5, batch size at 8 and number of epochs at 20.

BERT base: this is the original BERT model released by Google. The BERT base model was pre-trained on textual data in the general domain, i.e., BooksCorpus (800 M words) and English Wikipedia (2500 M words) [ 11 ].

BioBERT base: as the biomedical language is different from general language, the BERT models trained on general textual data may not work well on biomedical NLP tasks. BioBERT was further pre-trained (based on original BERT models) in the large-scale biomedical corpora, including PubMed abstracts (4.5B words) and PubMed Central Full-text articles (13.5B words) [ 13 ].

PubMedBERT: PubMedBERT was pre-trained from scratch using abstracts from PubMed. This model has achieved state-of-the-art performance on several biomedical NLP tasks on Biomedical Language Understanding and Reasoning Benchmark [ 14 ].

Text pre-processing and libraries that were used

We have removed special characters and common English words as a part of text pre-processing. Default tokenizer from scikit-learn was adopted for tokenization. Scikit-learn was also used for TF-IDF feature extraction and machine learning algorithms implementation. Transformers libraries from Hugging Face were used for deep learning algorithms implementation.

Evaluation datasets were constructed from the HPV and Pediatric pneumococcal corpora and were split into training, validation and testing sets with a ratio of 8:1:1 for the two evaluation tasks: (1) ML algorithms performance assessment; and (2) DL algorithms performance assessment. Models were fitted on the training sets, and model hyperparameters were optimized on the validation sets and the performance were evaluated on the testing sets. The following major metrics are expressed by the noted calculations:

Where True positive is an outcome where the model correctly predicts the positive (e.g., “included” in our tasks) class. Similarly, a True negative is an outcome where the model correctly predicts the negative class (e.g., “excluded” in our tasks). False positive is an outcome where the model incorrectly predicts the positive class, and a False negative is an outcome where the model incorrectly predicts the negative class. We have repeated all experiments five times and reported the mean scores with standard deviation.

Table  2 shows the baseline comparison using different feature combinations for the SLR text classification tasks using XGBoost. As noted, adding additional features in addition to title and abstract was effective in further improving the classification accuracy. Specifically, using all available features for the HPV classification increased accuracy by ? ∼  3% and F1 score by ? ∼  3%; using all available features for Pediatric pneumococcal classification increased accuracy by ? ∼  2% and F1 score by ? ∼  4%. As observed, adding additional features provided a stronger boost in precision, which contributed to the overall performance improvement.

The comparison of the article inclusion/exclusion classification task for four machine learning algorithms with all features is shown in Table  3 . XGBoost achieved the highest accuracy and F-1 scores in both tasks. Table  4 shows the comparison between XGBoost and deep learning algorithms on the classification tasks for each disease. Both XGBoost and deep learning models consistently have achieved higher accuracy scores when using all features as input. Among all models, BioBERT has achieved the highest accuracy at 0.88, compared with XGBoost at 0.86. XGBoost has the highest F1 score at 0.8 and the highest recall score at 0.9 for inclusion prediction.

Discussions and conclusions

Abstract screening is a crucial step in conducting a systematic literature review (SLR), as it helps to identify relevant citations and reduces the effort required for full-text screening and data element extraction. However, screening thousands of abstracts can be a time-consuming and burdensome task for scientific reviewers. In this study, we systematically investigated the use of various machine learning and deep learning algorithms, using different sets of features, to automate abstract screening tasks. We evaluated these algorithms using disease-focused SLR corpora, including one for human papillomavirus (HPV) associated diseases and another for pneumococcal-associated pediatric diseases (PADA). The publicly available corpora used in this study can be used by the scientific community for advanced algorithm development and evaluation. Our findings suggest that machine learning and deep learning algorithms can effectively automate abstract screening tasks, saving valuable time and effort in the SLR process.

Although machine learning and deep learning algorithms trained on the two SLR corpora showed some variations in performance, there were also some consistencies. Firstly, adding additional citation features significantly improved the performance of conventional machine learning algorithms, although the improvement was not as strong in transformer-based deep learning models. This may be because transformer models were mostly pre-trained on abstracts, which do not include additional citation information like MeSH terms, keywords, and journal names. Secondly, when using only title and abstract as input, transformer models consistently outperformed conventional machine learning algorithms, highlighting the strength of subject domain-specific pre-trained language models. When all citation features were combined as input, conventional machine learning algorithms showed comparable performance to deep learning models. Given the much lower computation costs and faster training and prediction time, XGBoost or support vector machines with all citation features could be an excellent choice for developing an abstract screening system.

Some limitations remain for this study. Although we’ve evaluated cutting-edge machine learning and deep learning algorithms on two SLR corpora, we did not conduct much task-specific customization to the learning algorithms, including task-specific feature engineering and rule-based post-processing, which could offer additional benefits to the performance. As the focus of this study is to provide generalizable strategies for employing machine learning to abstract screening tasks, we leave the task-specific customization to future improvement. The corpora we evaluated in this study mainly focus on health economics and outcome research, the generalizability of learning algorithms to another domain will benefit from formal examination.

Extensive studies have shown the superiority of transformer-based deep learning models for many NLP tasks [ 11 , 13 , 14 , 15 , 16 ]. Based on our experiments, however, adding features to the pre-trained language models that have not seen these features before may not significantly boost their performance. It would be interesting to find a better way of encoding additional features to these pre-trained language models to maximize their performance. In addition, transfer learning has proven to be an effective technique to improve the performance on a target task by leveraging annotation data from a source task [ 17 , 18 , 19 ]. Thus, for a new SLR abstract screening task, it would be worthwhile to investigate the use of transfer learning by adapting our (publicly available) corpora to the new target task.

When labeled data is available, supervised machine learning algorithms can be very effective and efficient for article screening. However, as there is increasing need for explainability and transparency in NLP-assisted SLR workflow, supervised machine learning algorithms are facing challenges in explaining why certain papers fail to fulfill the criteria. The recent advances in large language models (LLMs), such as ChatGPT [ 20 ] and Gemini [ 21 ], show remarkable performance on NLP tasks and good potentials in explainablity. Although there are some concerns on the bias and hallucinations that LLMs could bring, it would be worthwhile to evaluate further how LLMs could be applied to SLR tasks and understand the performance of using LLMs to take free-text article screening criteria as the input and provide explainanation for article screening decisions.

The annotated corpora underlying this article are available at https://github.com/Merck/NLP-SLR-corpora

This research was supported by Merck Sharp & Dohme LLC, a subsidiary of Merck & Co., Inc., Rahway, NJ, USA.

Study concept and design: JD and LY Corpus preparation: DW, YL and LY Experiments: JD and ES Draft of the manuscript: JD, DW, FJM and LY Acquisition, analysis, or interpretation of data: JD, ES, DW and LY Critical revision of the manuscript for important intellectual content: JD, ES, DW, LH, BL, JW, FJM, YL, EW, LY Study supervision: LY.

The content is the sole responsibility of the authors and does not necessarily represent the official views of Merck & Co., Inc., Rahway, NJ, USA or Intelligent Medical Objects.

DW is an employee of Merck Sharp & Dohme LLC, a subsidiary of Merck & Co., Inc., Rahway, NJ, USA. EW, YL, and LY were employees of Merck Sharp & Dohme LLC, a subsidiary of Merck & Co., Inc., Rahway, NJ, USA for this work. JD, LH, JW, and FJM are employees of Intelligent Medical Objects. ES was an employee of Intelligent Medical Objects during his contributions, and is currently an employee of EBSCO Information Services.

Received: 19 May 2023

Accepted: 18 April 2024

Published: 09 May 2024

Communication in Primary Healthcare: A State-of-the-Art Literature Review of Conversation-Analytic Research


  Nuffield Department of Primary Care Health Sciences, University of Oxford, Radcliffe Observatory Quarter, U.K.
  School of Primary Care, Population Sciences and Medical Education, University of Southampton, Aldermoor Health Centre, U.K.
We report the first state-of-the-art review of conversation-analytic (CA) research on communication in primary healthcare. We conducted a systematic search across multiple bibliographic databases and specialist sources and employed backward and forward citation tracking. We included 177 empirical studies spanning four decades of research and 16 different countries/health systems, with data in 17 languages. The majority of studies originated in United States and United Kingdom and focused on medical visits between physicians and adult patients. We generated three broad research themes in order to synthesize the study findings: managing agendas, managing participation, and managing authority. We characterize the state-of-the-art for each theme, illustrating the progression of the work and making comparisons across different languages and health systems, where possible. We consider practical applications of the findings, reflect on the state of current knowledge, and suggest some directions for future research. Data reported are in multiple languages.

© 2024 The Author(s). Published with license by Taylor & Francis Group, LLC.

Biodiesel supply chain network design: a comprehensive review with qualitative and quantitative insights

Cite this article

sources of review of literature in research slideshare

  Sourena Rahmani
  Alireza Goli
  Ali Zackery  

The global community is actively pursuing alternative energy sources to mitigate environmental concerns and decrease dependence on fossil fuels. Biodiesel, recognized as a clean and eco-friendly fuel with advantages over petroleum-based alternatives, has been identified as a viable substitute. However, its commercialization encounters challenges due to costly production processes. Establishing a more efficient supply chain for mass production and distribution could surmount these obstacles, rendering biodiesel a cost-effective solution. Despite numerous review articles across various renewable energy supply chain domains, there remains a gap in the literature specifically addressing the biodiesel supply chain network design. This research entails a comprehensive systematic literature review (SLR) focusing on the design of biodiesel supply chain networks. The primary objective is to formulate an economically, environmentally, and socially optimized supply chain framework. The review also seeks to offer a holistic overview of pertinent technical terms and key activities involved in these supply chains. Through this SLR, a thorough examination and synthesis of existing literature will yield valuable insights into the design and optimization of biodiesel supply chains. Additionally, it will identify critical research gaps in the field, proposing the exploration of fourth-generation feedstocks, integration of multi-channel chains, and the incorporation of sustainability and resilience aspects into the supply chain network design. These proposed areas aim to address existing knowledge gaps and enhance the overall effectiveness of biodiesel supply chain networks.

Source: Crippa et al. (2022); Friedlingstein et al. (2020); Grassi et al. (2022); Liu et al. (2022)

sources of review of literature in research slideshare

Source: Chisti (2007)

sources of review of literature in research slideshare

Department of Industrial Engineering and Futures Studies, Faculty of Engineering, University of Isfahan, Isfahan, Iran

Sourena Rahmani, Alireza Goli & Ali Zackery

All authors contributed to the study's conception and design. Material preparation, data collection, and analysis were performed by Sourena Rahmani, Alireza Goli, and Ali Zackery. The first draft of the manuscript was written by Sourena Rahmani. Alireza Goli and Ali Zackery commented on previous versions of the manuscript.

Responsible Editor: Ta Yeong Wu

Rahmani, S., Goli, A. & Zackery, A. Biodiesel supply chain network design: a comprehensive review with qualitative and quantitative insights. Environ Sci Pollut Res (2024). https://doi.org/10.1007/s11356-024-33392-w

Received: 09 August 2023

Accepted: 16 April 2024

Published: 11 May 2024

Road sediment, an underutilized material in environmental science research: A review of perspectives on United States studies with international context

Road sediment is a pervasive environmental medium that acts as both source and sink for a variety of natural and anthropogenic particles and often is enriched in heavy metals. Road sediment is generally understudied in the United States (U.S.) relative to other environmental media and compared to countries such as China and the United Kingdom (U.K.). However, the U.S. is an ideal target for these studies due to the diverse climates and wealth of geo-chemical, socioeconomic, demographic, and health data. This review outlines the existing U.S. road sediment literature while also providing key international perspectives and context. Furthermore, the most comprehensive table of U.S. road sediment studies to date is presented, which includes elemental concentrations , sample size, size fraction, collection and analytical methods, as well as digestion procedure. Overall, there were observed differences in studies by sampling time period for elemental concentrations, but not necessarily by climate in the U.S. Other key concepts addressed in this road sediment review include the processes controlling its distribution, the variety of nomenclature used, an-thropogenic enrichment of heavy metals, electron microscopy, health risk assessments , remediation, and future directions of road sediment investigations. Going forward, it is recommended that studies with a higher geographic diversity are performed that consider smaller cities and rural areas. Furthermore, environmental justice must be a focus as community science studies of road sediment can elucidate pollution issues impacting areas of high need. Finally, this review calls for consistency in sampling, data reporting, and nomenclature to effectively expand work on understudied elements, particles, and background sediments.


