The use of large language models and generative AI in evidence synthesis and literature review is a rapidly evolving research area and comprehensive and widely accepted guidelines are still under development. Below are a few useful references to learn more about the application of LLMs in evidence synthesis, such as systematic review.
Sandner, E., Hu, B., Simiceanu, A., Fontana, L., Jakovljevic, I., Henriques, A., Wagner, A., & Gütl, C. (2024). Screening Automation for Systematic Reviews: A 5-Tier Prompting Approach Meeting Cochrane’s Sensitivity Requirement. 2024 2nd International Conference on Foundation and Large Language Models (FLLM), 150–159.
Sysrev has a built-in generative AI auto-label feature that uses the OpenAI's GPT-4o model. This can be used to automate the labeling process. Sysrev generates an auto-label report that allows users to compare auto-labeling results to human labeling, such that labels can be improved, assessed and optimized to maximize accuracy and provide a transparent assessment process.
There are a few important things to know about the auto-label feature:
There are some useful settings that you can apply to control how the auto-labeler runs, including:
These features and settings are covered in more detail below.
Your label question acts as the generative AI prompt that is used by the auto-labeler to apply label answers to each record. For categorical labels, the auto-labeler will also access the Categories to retrieve its answers. To set up your labels for auto-labeling, go to Manage -> Label Definitions. Click on an existing label to edit it or create a new one by clicking on the label type at the bottom of the page.
In the following example, we have created a categorical label and want the auto-labeler to answer the question "What types of wildfire impacts are covered by this review?", providing one or more of the following answers: health, environmental, ecological, economic, or social. We could also include a 'none of these' option, but if we don't, the auto-labeler will simply not provide an answer.
Note that we have selected the following: the checkbox for Auto-label Extraction to turn on this label for the auto-labeler; the Citations only option under Full Text to indicate that we want the auto-labeler to only look at the metadata and not any attached PDFs; and we have selected No to Probability and Reasonings (more about this below).
After clicking Save, we can go back to the Articles tab to set the articles that we want to auto-label. In this example, we filtered for only records that have been Included by two reviewers. We set our Max Articles (the number of records we want the auto-labeler to label) to 20. You can see that the auto-labeler has estimated the cost of this run to be $0.05.
When you are ready, and your selection does not exceed the budget in your account, click Run auto label. If successful, you should see the message "Last run just now: success" at the bottom of the auto-labeler box. To view auto-label answers, click on one of the labeled articles from the list and scroll down. Below the article abstract you will see the auto-label answers. In this example, you can see that the auto-label identified ecological and environmental impacts in this study record. It also included this record with 80% certainty.
Once you run the auto-labeler, you will see auto-label answers at the bottom of each record as shown in the image above. If you enabled "Probability and Reasoning", you can view the auto-labeler's reasoning process by clicking on the dropdown arrow next to the auto-label answer.
You will also now also see an auto-label report, located at the bottom left of the Overview page of your project. This report provides detailed analytics of the auto-label answers in comparison to reviewer answers. More information about the auto-label report can be found in a box below.
In general, before running the auto-labeler across the entirety of your project documents, you should test, assess and optimize your auto-labels on a small random sample of records. Here is a recommended workflow for optimizing and then using the auto-labeler for a literature or document review project.
Once the auto-labeler has been run, an Auto Label report will be generated. The report compares the accuracy of the auto-labeler against user generated values. If both a user and the auto-labeler review an article, the label analysis will appear in the report.
Note that the report will only be generated if the auto-labeler was run on labels and records for which there has already been some human reviewer activity.
The Auto Label report can be found at the very bottom of the project Overview page. A donut chart visualization provides a quick snapshot of the performance of the auto-labeler compared to user labels. Clicking on the See the full report link will take you to the full report.
The graphs at the top of the report provide a snapshot of the of the performance of the auto-labeler compared to user labels. In addition to the donut chart, bar graphs provide the number of disagreements and agreements between each user and the auto-labeler.
You can view the results for the most recent run of the auto-labeler (i.e., showing results for only the records and labels that were used in the latest run) or all auto-labeler runs (i.e., the most recent results for every label and record for which the auto-labeler has been used) by clicking the appropriate box under Report scope. During prompt engineering, it is most useful to look at last run only.
This section provides details of true and false positives and negatives for each user, as well as a number of performance metrics:
The numbers of true and false positives and negatives are hyperlinks which will open a new window showing the corresponding records. Reviewing these will help you determine what adjustments might be useful to improve auto-labeler accuracy (or to improve human reviewing, as the auto-labeler can sometimes indicate systematic errors by human reviewers).
This section summarizes the true and false positives and negatives for each label, as well as the performance metrics described above, at the label level for all user answers combined.
This section provide a summary at the article level of user and auto-label answers. Click on the arrows in the bottom right corner to scroll through these articles.
There are a number of techniques you can use to improve the accuracy of your auto-labels. Here are a few things to try:
Answer true or false for the following questions about this article.
1. Is this a systematic review or meta-analysis?
2. Does this focus on the impacts of wildfires?
3. Does this include at least one health, environmental or economic impact of wildfires?
If all of the answers are true, include this article. If any of the answers are false, exclude the article.