Differences

This shows you the differences between two versions of the page.

--- wiki:autolit:screening:inclusionpredictionmodel [2024/07/24 16:17]
jthurnham [Screening Model]
+++ wiki:autolit:screening:inclusionpredictionmodel [2024/09/23 11:59] (current)
jthurnham [Interpreting Robot Screener]
@@ Line 3: / Line 3: @@
 The Screening Model uses AI to learn from screening decisions within a specific nest, generating inclusion probabilities based on configuration.
-You may use the screening model in two ways:
+You may use the screening model in **two ways:**
-  * **Generate inclusion probabilities** to be displayed on records and assist in your own manual screening.
+  * [[wiki:autolit:screening:model|Generate inclusion probabilities]] to be displayed on records and assist in your own manual screening.
   * //Dual modes only:// Turn on **Robot Screener** to replace an expert reviewer, which makes decisions based on these probabilities.
@@ Line 10: / Line 10: @@
 The below guidance is specifically for using Robot Screener, for information on training the model for probability generation only and general information on how the model works see [[wiki:autolit:screening:model|Using and Interpreting the Screening Model.]]
-===== Robot Screener =====
-The Screening Model can be used to power AI-assisted screening, replacing one expert in Dual Screening processes:
-{{youtube>9bsA4DMF4aE}}
+----
-When selecting a mode, note that in most cases, when employing Dual Two Pass Mode, **the Robot Screener should replace an expert reviewer only for the Abstract stage of screening **, as the model itself is trained on and screens based on Abstract content. Using the model in this way provides Advancement probabilities (in effect, relevancy scores) to each record.
+===== Robot Screener =====
-When used in "Robot Screener" mode, this serves as an AI alternative to a second reviewer in Dual Screening modes.
+The Screening Model can be used to power AI-assisted screening, replacing one expert in Dual Screening processes.
-===== Robot Screener Validation Studies =====
+**Robot Screener may only be turned on in Dual Screening modes** and it's important to note at what stage they are generated and the language used:
+  * Dual Standard Mode: Robot Screener replaces a reviewer in the singular round of Screening based on //Inclusion Probabilities.//
+  * Dual Two Pass Mode: Robot Screener replaces a reviewer in the Title/Abstract round of Screening only, based on //Advancement Probabilities.//
-Robot Screener has been validated in several published studies assessing its decisions in comparison to human decisions across multiple reviews and review types.
+{{youtube>9bsA4DMF4aE}}
-  * [[https://www.ispor.org/heor-resources/presentations-database/presentation/intl2024-3900/136092|Internal]] validation: Nested Knowledge assessed the Recall and Precision of Robot Screener in 19 projects with over 100,000 cumulative decisions, finding significantly lower Precision than humans (that is, humans correctly exclude studies more often) but significantly higher Recall-- meaning that the Robot Screener misses fewer includable records. In this analysis, Robot Screener was found to have 97.1% Recall.
-  * [[https://www.ispor.org/heor-resources/presentations-database/presentation/intl2024-3896/139709|External]] validation: Cichewicz et al. assessed diagnostic accuracy across many metrics in 15 projects, finding Robot Screener had significantly lower Precision than humans and no statistical differences in Recall between Robot Screener and humans. Robot Screener had fewer overall False Negatives, but no significant differences were found.
-  * Estimates of [[https://about.nested-knowledge.com/2023/11/10/the-data-is-in-deciding-when-to-automate-screening-in-your-slr/|time savings]] using different modes of Robot Screener have been previously published online.
-You can see a deeper summary of the Validation Studies and their implications [[https://about.nested-knowledge.com/2024/05/28/validation-summary-robot-screeners-performance-in-screening-records-for-systematic-literature-review-and-health-technology-assessment/|here]].
+----
-===== User Guide =====
-==== Running the Screening Model ====
-To learn about configuration settings, which enable you to toggle Manual updating vs. Automatic and Displayed vs. Hidden, see the [[:wiki:autolit:admin:configure#inclusion_prediction_model|Settings page]].
-In its default setting, the Screening Model must be run manually. To do so, click "Train Screening Model" on the Screening panel:
-{{  :undefined:4screen.png?nolink&  }}
-Once the modal opens, click "Train New Model."
+===== User Guide =====
-<WRAP center round important 60%>
+==== Settings ====
-To provide the model with sufficient information to begin understanding your review, we require **50 total adjudicated screening decisions with 10 advancements or inclusions**  before the model can be trained. If there is insufficient evidence to train the model, complete more adjudicated screening (2 reviewers and 1 adjudicator) until the "Train New Model" button becomes available.
-</WRAP>
+To turn on Robot Screener, head to Nest Settings --> Screening Model, toggle on Robot Screener.
-It may take a minute to train, after which it will populate a histogram on the left. From then on, each record will show a probability of inclusion or advancement:
+{{ :undefined:screenshot_2024-07-24_at_17.53.25.png?nolink |}}
-{{  :undefined:2screen.png?nolink&  }}
+Not displayed? You must be in a Dual Screening mode to use Robot Screener.
-==== Interpreting the Model ====
+Want to use Automatic Training for use in manual screening instead? [[wiki:autolit:screening:model|Learn more here.]]
-Once the Model is trained, you should see a graph where Included or Advanced, Excluded, and Unscreened records are represented by green, red, and purple curves, respectively:
+----
-{{  :undefined:model.png?nolink&  }}
+==== Meeting the Threshold ====
-Odds of inclusion/advancement are presented on the x-axis (ranging from 0 to 1). Since the Model is trained on a nest-by-nest basis, its accuracy ranges based on how many records it can train on and how many patterns it can find in inclusion activities.
+When toggling Robot Screener, you'll be presented with an instructional modal:
-You can see the accuracy in the modal after the model is trained. In the Cross Validation tab, several statistics are shown. Scores of Recall and Accuracy can be used to interpret how the model will perform on the remaining records. High recall (0.7/70%+) indicates that the model will less frequently exclude relevant records, meaning higher performance. Similarly, accuracy indicates how correct the model's decisions are compared to already screened records, and thus how it is likely to fare on upcoming records. See below for an example of a relatively well trained model:
+{{ :undefined:screenshot_2024-07-24_at_17.57.44.png?nolink |}}
-{{  :undefined:mod.png?nolink&  }}
+Highlighted in red are the requirements for training the screening model, and the actual numbers based on the progress made in your nest. You will not be able to turn on Robot Screener until these minimum requirements are met.
-==== Implications for Screening ====
+<WRAP center round important 60%>
+Before Robot Screener can be turned on, 50 adjudicated screening decisions with 10 advancements/inclusions must be made. After this is met and the Robot is turned on, it will continue to train on further adjudicated screening decisions made.
+</WRAP>
-Inclusion Probability generated from the Screening model is also available as a filter in [[:wiki:autolit:utilities:inspector|Inspector]], which can assist with finding records based on their chance of inclusion/advancement. [[:wiki:autolit:utilities:inspector:bulk_actions#bulk_screening_status|Bulk Actions]] can also be taken at your discretion, but ensure that you are careful in excluding studies if you have not reviewed their Abstracts at least!
+Once trained and turned on, the Robot is assigning both inclusion probabilities and actual screening decisions to the remainder of records in the queue. Currently, Robot Screener does not assign exclusion reasons, so decisions are displayed as "Advance"/"Include" or "Robot Excluded". The records that Robot Screener makes a decision on will still need an additional human to screen these records as a second reviewer, and a human adjudicator to make the final decision on these records. This means that each record will always have two pairs of eyes to review.
-===== Model Performance =====
+----
-==== Our Philosophy ====
-Screening is a complex task that relies on human expertise. Our model may stumble due to:
+==== Interpreting Robot Screener ====
-  * Insufficient training examples (usually included/advanced records) to learn from
+At any time, you may wish to view how the screening model is performing. To view the model performance, navigate to Nested Settings --> Screening model --> View Screening model.
-  * Data not available to the model (e.g. screening with a full text article, missing abstract)
-  * Weak signal amongst available predictors against protocol
-**For these reasons, we recommend using the model to augment your screening workflow, not fully automate it. **
+{{ :undefined:screenshot_2024-07-24_at_18.13.41.png?nolink |}}
-How can it augment your screening?
+This will display a histogram under the "Predictions" tab, a table of various Cross Validation statistics displaying history of previous trainings of the model, and an explanation as to how to interpret these values. Note: the history of trained models is displayed for informational purposes only, and not versions that can be reverted back to. Retraining the model does not guarantee improved statistics and performance.
-  * Excluding clearly low-relevancy records
+You can also view the Robot Screener recommendations in the Screening model modal. Select "Advance"/"Include" to view studies the Robot has advanced/included, or "Exclude" for excluded studies-- these are both shortcuts that take you to Study Inspector to show you the corresponding subsets of studies. Otherwise, the filter can be manually added from Study Inspector. From this modal, you can also delete the model if you wish to start again from scratch.
-  * Raising high relevancy records to reviewers
-**Our model errs towards including/advancing irrelevant records over excluding relevant records.**  In statistical terminology, the model aims to achieve high recall. In a review, it is far more costly to exclude a relevant study. Once excluded, reviewers are unlikely to reconsider a record. In contrast, an included/advanced study will be revisited multiple times later in the review, more readily allowing an incorrect include/advance decision to be corrected.
+With Predictions toggled:
-==== Testing out the model ====
+{{ :undefined:screenshot_2024-09-23_at_12.56.51.png?nolink |}}
-In an internal study, Nested Knowledge ran the model across several hundred SLR projects, finding the following cumulative accuracy statistics:
+With Cross Validation toggled:
-=== Standard Screening ===
+{{ :undefined:screenshot_2024-09-23_at_12.57.00.png?nolink |}}
-  * Area Under the [Receiver Operating Characteristic] Curve (AUC): 0.88
+[[wiki:autolit:screening:model#interpreting_the_model|Learn more about interpreting the model, its performance and how it works here.]]
-  * Classification Accuracy: 0.92
-  * Recall: 0.76
-  * Precision: 0.40
-  * F1: 0.51
-=== Two Pass Screening ===
+----
-In two pass screening, the model predicts advancement of a record from abstract screening to full text screening. Given that advancement rates are typically higher than inclusion rates, the model has more positive training examples, and demonstrates improved recall.
-  * AUC: 0.88
+==== Improving Robot Screener ====
-  * Classification Accuracy: 0.93
-  * Recall: 0.81
-  * Precision: 0.44
-  * F1: 0.56
-Following our philosophy, recall is relatively higher than precision: the model suggests inclusion/advancement of a larger amount of relevant records, at the cost of suggesting inclusion of some irrelevant records. Due to class imbalance, the model scores a 90%+ classification accuracy, predominantly consisting of correct exclusion suggestions.
+The best way to improve Robot Screener, is to adjudicate records, since these are the decisions it trains on. We recommend, if you can, have your adjudicator make their final decisions on the Adjudicate Screening page after every 50 studies are screened for best model performance. For reference, the following is what adjudicators will see for records that have one human and one Robot Screener decision applied:
-For comparison purposes, our study found human reviewer recall (relative to the adjudicated decision) was 85% in the average nest. Our models are within 4 & 9 points of human performance on this most critical measure.
+{{ :undefined:screenshot_2024-07-24_at_18.18.39.png?nolink |}}
-==== Analyzing Your Nest ====
+----
+===== Robot Screener Validation Studies =====
-When you train a new model, we generate k-fold cross validation performance measures using the same model hyperparameters the final model is trained with. These performance measures typically provide a lower bound on the performance you can expect from the model on records not yet screened in your nest. High recall (70%+) suggests that your review is less likely to be missing relevant records at the end of screening. High AUC (.8+) suggests that your model is effectively discerning between included and excluded records.
+Robot Screener has been validated in several published studies assessing its decisions in comparison to human decisions across multiple reviews and review types.
-While we cannot guarantee performance improvement, below is some rough empirical data for how you might expect performance measures to improve as you screen more records in your nest.
+  * [[https://www.ispor.org/heor-resources/presentations-database/presentation/intl2024-3900/136092|Internal]] validation: Nested Knowledge assessed the Recall and Precision of Robot Screener in 19 projects with over 100,000 cumulative decisions, finding significantly lower Precision than humans (that is, humans correctly exclude studies more often) but significantly higher Recall-- meaning that the Robot Screener misses fewer includable records. In this analysis, Robot Screener was found to have 97.1% Recall.
+  * [[https://www.ispor.org/heor-resources/presentations-database/presentation/intl2024-3896/139709|External]] validation: Cichewicz et al. assessed diagnostic accuracy across many metrics in 15 projects, finding Robot Screener had significantly lower Precision than humans and no statistical differences in Recall between Robot Screener and humans. Robot Screener had fewer overall False Negatives, but no significant differences were found.
+  * Estimates of [[https://about.nested-knowledge.com/2023/11/10/the-data-is-in-deciding-when-to-automate-screening-in-your-slr/|time savings]] using different modes of Robot Screener have been previously published online.
-=== Timing of Model Training ===
+You can see a deeper summary of the Validation Studies and their implications [[https://about.nested-knowledge.com/2024/05/28/validation-summary-robot-screeners-performance-in-screening-records-for-systematic-literature-review-and-health-technology-assessment/|here]].
-In general, as you screen more records, the better the model will perform. Of course, you want to use the model before you’ve screened every record!
-To provide the model with sufficient information to begin understanding your review, we require 50 total screens and 10 inclusions/advancements. At that point, we recommend checking model performance (see above) to evaluate performance.
-As the graph below shows, AUC and recall can grow on a relatively sharp curve early in your review. The curve begins to flatten around 20-30% of records screened, which is where we typically begin to recommend the use of Robot Screener in Dual screening modes.
-{{  :undefined:auc.png?nolink&  }}
-===== How the Screening Model Works =====
-At a high level, the model is a Decision Tree- a series of Yes/No questions about characteristics of records that lead to different probabilities of inclusion/advancement.
-In more detail, the model is a gradient-boosted decision tree ensemble. Its hyperparameters, particularly around model complexity (number of trees, tree depth) are optimized using a cross validation grid search. The model produces posterior probabilities and is optimized on logistic loss. SMOTE oversampling is employed as a correction to highly imbalanced classes frequently seen in screening.
-==== What data does the model use? ====
-The model uses the following data from your records as inputs:
-  * Bibliographic data
-      * Time since publication of the record
-      * Page count
-      * Keywords/Descriptors
-  * Abstract Content
-      * N-grams
-      * OpenAI text embedding (ada-002)
-  * Citation Counts from Scite, accessed using the DOI
-      * Number of citing publications
-      * Number of supporting citation statements
-      * Number of contrasting citation statements
-Often some of this data will be missing for records; it is imputed as if the record is approximately typical to other records in the nest.

Nested Knowledge

User Tools

Site Tools

Differences

Page Tools