CMU LibGuides: A guide to Sysrev for collaborative literature and document reviews: Using LLM auto-labelling

About Sysrev auto-labeling

Sysrev has a built-in generative AI auto-label feature that uses the OpenAI's GPT-4o model. This can be used to automate the labeling process. Sysrev generates an auto-label report that allows users to compare auto-labeling results to human labeling, such that labels can be improved, assessed and optimized to maximize accuracy and provide a transparent assessment process.

There are a few important things to know about the auto-label feature:

It costs money to run the auto-labeler: Running the auto-labeler requires funds in your Sysrev project. If you are a project lead, fill out this funds request form to request funds from the CMU Libraries to use the auto-label feature. Note that funds are approved on a case-by-case basis. At this time, individual users are not able to add their own funds to projects in the CMU Organizational account.
The cost of running the auto-labeler varies based on how many records you are labeling, the number of labels that are activated for auto-labeling and the amount of content that is being 'read' by the auto-labeler (e.g., citation only vs full PDF), among other factors.
Only owners and admins on the project can run the auto-labeler: If your user status for a project is set at 'member', you will not be able to use the auto-labeler.

There are some useful settings that you can apply to control how the auto-labeler runs, including:

You can choose which auto-labels to activate: Each label you create has settings to indicate whether the label should be included or ignored in the the auto-labeling process. Note that the default is to include the label in the auto-labeling process.
You can choose what information the auto-label should 'read': You can choose, at the label level, whether the auto-labeler should consider citation information (e.g., title and abstract) only or should also 'read' attached content like PDFs.
You can have the auto-label pre-populate label answers or not: You can set whether or not the auto-label fills in label answers.
You can choose to auto-label only a subset of records: You can set an article filter and/or limit the auto-labeler to a certain number of records to control how many records are labeled in each run of the auto-labeler.

These features and settings are covered in more detail below.

How to use the auto-label feature

Your label question acts as the generative AI prompt that is used by the auto-labeler to apply label answers to each record. For categorical labels, the auto-labeler will also access the Categories to retrieve its answers. To set up your labels for auto-labeling, go to Manage -> Label Definitions. Click on an existing label to edit it or create a new one by clicking on the label type at the bottom of the page.

In the following example, we have created a categorical label and want the auto-labeler to answer the question "What types of wildfire impacts are covered by this review?", providing one or more of the following answers: health, environmental, ecological, economic, or social. We could also include a 'none of these' option, but if we don't, the auto-labeler will simply not provide an answer.

Note that we have selected the following: the checkbox for Auto-label Extraction to turn on this label for the auto-labeler; the Citations only option under Full Text to indicate that we want the auto-labeler to only look at the metadata and not any attached PDFs; in this example we have selected No to Probability and Reasonings. Note that Probability and Reasonings will add a set of built-in reasoning steps to your prompt and will likely impact auto-labeler performance. It is useful to run your prompt with and without this feature selected to determine which setting works better in your case.

1. Set your label question prompt and auto-label settings

After clicking Save, we can go back to the Articles tab to set the articles that we want to auto-label. In this example, we filtered for only records that have been Included by two reviewers. We set our Max Articles (the number of records we want the auto-labeler to label) to 20. You can see that the auto-labeler has estimated the cost of this run to be $0.05.

2. Set up your article filter and auto-label limit

When you are ready, and your selection does not exceed the budget in your account, click Run auto label. If successful, you should see the message "Last run just now: success" at the bottom of the auto-labeler box. To view auto-label answers, click on one of the labeled articles from the list and scroll down. Below the article abstract you will see the auto-label answers. In this example, you can see that the auto-label identified ecological and environmental impacts in this study record. It also included this record with 80% certainty.

3. Review auto-label answers

Once you run the auto-labeler, you will see auto-label answers at the bottom of each record as shown in the image above. If you enabled "Probability and Reasoning", you can view the auto-labeler's reasoning process by clicking on the dropdown arrow next to the auto-label answer.

You will also now also see an auto-label report, located at the bottom left of the Overview page of your project. This report provides detailed analytics of the auto-label answers in comparison to reviewer answers. More information about the auto-label report can be found in a box below.

Suggested workflow for testing and running auto-labels

In general, before running the auto-labeler across the entirety of your project documents, you should test, assess and optimize your auto-labels on a small random sample of records. Here is a recommended workflow for optimizing and then using the auto-labeler for a literature or document review project.

Two human reviewers label records in a project. Go to Manage and then the Options section of Settings and select Full for Article Review Priority. This allows you to quickly accumulate a number of double-reviewed records. Consider skipping records without abstracts to avoid running the auto-labeler on incomplete metadata.
Set Prefill review answers to No: This is found in the Options section of Manage -> Settings. We advise turning this off while refining your label prompts, and turning it on once label prompts are finalized and you are ready to auto-label all records.
Set article filter to only double-reviewed records: To achieve this, set the following two labels:
1. Match -> Filter Type: Consensus + Consensus Status: Determined + Inclusion: Any
2. Exclude -> Filter Type: Consensus + Consensus Status: Single + Inclusion: Any
Ensure that your label auto-label settings are correct: Review each active label to ensure that only desired labels have the Auto Label Extraction box checked, and that Full Text is set to the desired content (citations only vs. attachments). Note: This is important for ensuring that you don't accidentally overspend your budget. You can also turn on or off "Probability and Reasoning". Turning this on will add a set of reasoning steps to your prompt and may result in changes in auto-labeler performance. This will also provide some insight into why the auto-labeler chose its answer. This can be helpful in the prompt engineering stage.
Set auto-label Max Articles to desired number of records for initial assessment: This setting will work through the filtered articles in order up to the max number provided. We recommend auto-labeling at least 20 double-reviewed articles at a time while testing and refining auto-labeling prompts.
Run the auto-labeler.
Review the Auto Label Report: This will appear at the bottom of the Overview page. More information about reading and using the report is in the box below.
Refine label question prompts: See below for tips on improving your label question prompts.
Set article filter for a new set of double-labeled records: If you want to test your revised prompts on a new set of records (recommended to avoid over-engineering your prompts to one small set of articles), you can add an additional filter to the already filtered set of double-screened records: Exclude -> Filter Type: Auto label + Label: Any label. Alternatively, you can increase the Max Articles setting in the auto-labeler to label both previously auto-labeled records and a set of new records.
Review the Auto Label Report and repeat steps 8-10 until you have reached a comfortable accuracy score.
Auto-label the remaining records in the project: At this point, you can set Prefill review answers to Yes (in the Options section of Manage -> Settings). Run the auto-labeler for all records by removing filters and increasing the Max Articles setting. Once complete, reviewers can review articles to check the auto-label answers to ensure accuracy.

Understanding the auto-label report

Once the auto-labeler has been run, an Auto Label report will be generated. The report compares the accuracy of the auto-labeler against user generated values. If both a user and the auto-labeler review an article, the label analysis will appear in the report.

Note that the report will only be generated if the auto-labeler was run on labels and records for which there has already been some human reviewer activity.

The Auto Label report can be found at the very bottom of the project Overview page. A donut chart visualization provides a quick snapshot of the performance of the auto-labeler compared to user labels. Clicking on the See the full report link will take you to the full report.

The Auto Label report overview

The graphs at the top of the report provide a snapshot of the of the performance of the auto-labeler compared to user labels. In addition to the donut chart, bar graphs provide the number of disagreements and agreements between each user and the auto-labeler.

Report scope

You can view the results for the most recent run of the auto-labeler (i.e., showing results for only the records and labels that were used in the latest run) or all auto-labeler runs (i.e., the most recent results for every label and record for which the auto-labeler has been used) by clicking the appropriate box under Report scope. During prompt engineering, it is most useful to look at last run only.

Performance by user

This section provides details of true and false positives and negatives for each user, as well as a number of performance metrics:

Recall: ratio of true positives to the total actual positives
Precision: ratio of true positives to the total predicted positives
Specificity: ratio of true negatives to the total actual negatives
Accuracy: ratio of the sum of true positives and true negatives to the total number of instances

The numbers of true and false positives and negatives are hyperlinks which will open a new window showing the corresponding records. Reviewing these will help you determine what adjustments might be useful to improve auto-labeler accuracy (or to improve human reviewing, as the auto-labeler can sometimes indicate systematic errors by human reviewers).

Performance by label

This section summarizes the true and false positives and negatives for each label, as well as the performance metrics described above, at the label level for all user answers combined.

Label values by article

This section provide a summary at the article level of user and auto-label answers. Click on the arrows in the bottom right corner to scroll through these articles.

Tips for optimizing your label prompts

There are a number of techniques you can use to improve the accuracy of your auto-labels. Here are a few things to try:

Break down more complex questions into parts. For example, if you want to know if an article meets many criteria for inclusion and you want the auto-labeler to provide a Yes or No answer, ask it to work through a set of questions related to each criteria. In the following example, we want to include articles that are systematic reviews or meta-analyses about the impacts of wildfires on health, the environment or economic factors. We can break this down and give the auto-labeler the following prompt in the label's Question section:

Answer true or false for the following questions about this article.

1. Is this a systematic review or meta-analysis?

2. Does this focus on the impacts of wildfires?

3. Does this include at least one health, environmental or economic impact of wildfires?

If all of the answers are true, include this article. If any of the answers are false, exclude the article.

Use another genAI tool like chatGPT, Microsoft Copilot or Google Gemini to improve your prompt. Provide the prompt and some examples of correct and incorrect answers and ask it to help you revise the prompt to achieve better results.
Filter out records without abstracts. When running the auto-labeler on just citation data (not PDFs), it performs more accurately when it has abstracts, and not just titles to rely on. We recommend, when doing the initial human screening, to skip records without abstracts. In this way, when you set your article filters for the auto-labeler, it will only include articles with abstracts.
Review the Probability and Reasoning. When this option is enabled for a label, you can review the auto-labelers reasoning process and its estimation of how likely the answer is correct (i.e. its confidence in its answer). To find this information, click on the individual article, scroll down to Auto Labels, and click on the dropdown arrow to open the reasoning text for that label.
Consider a scale of relevance to improve sensitivity. Rather than asking for a yes/no answer for inclusion, you can use a categorical label to ask the auto-labeler to supply an answer based on a 5-point scale of relevance. In this way, you can filter results that the auto-labeler found highly likely to be relevant (1) to unlikely to be relevant (5), including an 'undecidable' category. For more information about this approach, see Sandnor et al. (2024).