Identify diagnostic variants in children with rare disease from the Rare Genomes Project (RGP)

Challenge: RGP-vcf

Genome data: encrypted, for registered users only

Special considerations: institutional signature required for participation. See below for instructions.

Last updated: 20 October 2025

This challenge preview is posted below. The challenge closes on February 28, 2026. 

How to participate in CAGI7?                         Download data & submit predictions on Synapse 

Make sure you understand our Data Use Agreement and Anonymity Policy

Summary 

The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing for rare disease diagnosis and gene discovery, led by genomics experts and clinicians at the Broad Institute of MIT and Harvard. In this challenge, variants from short-read genome sequencing data and phenotype data from a subset of the solved and unsolved RGP families are provided. Participants in the challenge (predictors) will try to identify the causal variant(s) in each proband. For the unsolved probands, prioritized variants from the participating teams will be examined to see if additional genetic diagnoses can be made.

Background 

One major obstacle facing rare disease patients is simply obtaining a genetic diagnosis (Rehm, 2017). The average “diagnostic odyssey” for rare disease families lasts more than five years, and over 50% of rare disease patients still lack a genetic diagnosis (Wojcik et al., 2024). A well-recognized obstacle to diagnosis is the technical limitations of the testing method, which determines whether the disease-causing variants can be detected. Here, we present short-read genome sequencing data, which provides coverage of coding and noncoding regions and has better variant calling for frameshifts and variants in high GC content regions compared to exome sequencing. The focus of this challenge addresses the analytical obstacle to diagnosis and seeks to evaluate the ability of participants to recognize pathogenic variation within a sea of benign variation. Current standards for variant interpretation have been defined by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP; Richards et al., 2015) and refined by ClinGen.

The Rare Genomes Project 

The Rare Genomes Project (RGP) is a direct-to-participant research study launched to discover the impact of genome sequencing on rare disease diagnosis, improve access to genomic research, contribute to novel gene discovery, and assess the impact of a genetic diagnosis. The RGP team consists of molecular diagnosticians, genetic counselors, clinicians, genomic analysts, computational biologists, software engineers, and project managers. Families in the United States with undiagnosed, but suspected, Mendelian/monogenic diseases apply directly online to the study. Applications are reviewed by the clinical team to confirm a reasonable suspicion for monogenic disease. Research subjects are consented for genomic sequencing and the sharing of their sequence and phenotype information with researchers working to understand the molecular causes of rare disease. When variants of clear or potential diagnostic relevance are identified, the candidate variants are clinically validated and returned to participants via their local physicians.

Prediction challenge

The prediction challenge involves 64 families split into three sets: an Example Set (14 families), a Test Set (30 families), and a Discovery Set (20 families).  Descriptions of these sets are included below. The majority of families are trios or quads, which consist of a proband, both biological parents, and possibly an affected sibling. This challenge also includes several duos or proband-only families. The clinical phenotype for each proband is provided in the form of Human Phenotype Ontology (HPO) nomenclature (Robinson & Mundlos, 2010). A machine-readable compilation of HPO terms and basic patient information is provided in GA4GH phenopacket format (Jacobsen et al., 2022). The diversity of phenotypes in the dataset represents the range of clinical presentations routinely seen in patients referred for genetic testing and most of the individuals who participate in RGP have had prior negative genetic testing. Participants in this CAGI challenge are asked to provide a genetic diagnosis for as many probands as possible; that is, to identify one or more causal variants for each proband. 

  • The Example Set of 14 families in the challenge consists of “solved” probands and available family members, where causal variants are provided. This set can be used to assess pipeline performance in preparing for making predictions on the Test and Discovery Sets. 
  • The Test Set of 30 families in the challenge consists of a mix of “solved” and “unsolved” probands as determined by the Broad Institute’s RGP team and, when possible, also confirmed by the local clinician. The CAGI organizers are not disclosing the numbers of solved and unsolved families to allow participants to perform the task in a manner that accurately reflects real clinical case analysis. The “solved” probands in the Test Set will be used to assess the performance of each challenge participant. Causal variants span a range of difficulty and both dominant and recessive inheritance patterns. 
  • The Discovery Set of 20 families in the challenge consists of “unsolved” probands that have been analyzed by the Broad Institute’s RGP team. It is optional to assess this set of cases towards the goal of identifying new diagnoses.  

Unsolved families have been included in the challenge with the objective of enabling the CAGI community to identify new potentially causal variants. Top variant candidates in both the Test Set and Discovery Set are likely to undergo further experimental and clinical evaluation, potentially leading to results that will be returned to the patients. This process led to establishing genetic diagnoses for two “unsolved” cases in the RGP challenge of CAGI6 (Stenton et al., 2024).  

The genomic data were obtained by sequencing DNA purified from blood. Sequencing was performed by the Broad Institute Genomics Platform on an Illumina sequencer to ~30x depth. Raw sequence reads were reassembled against a reference genome (GRCh38) and variant calling was completed with DRAGEN v3.7.8. Sequence results consist of variant calls in the form of single nucleotide variants (SNVs) and small insertion/deletions (indels) within a joint variant call file (VCF). Structural variants, mitochondrial genome variants, and tandem repeat expansions are not included in this dataset/challenge.

Prediction submission format 

The prediction submission is a tab-delimited text file. For each proband, rank the proposed causal variant(s) (CVs) responsible for the phenotype, one variant or pair of proposed compound heterozygous variants per line. Each CV should be associated with an estimated probability of causal relationship (EPCR) value, a real number ranging from 0 to 1 indicating the degree of certainty in the causality of the variant(s). For each proband, variants should be ranked in descending order of provided EPCR values.

Each line should be in the following format: PROBANDID:CHROM:POS:REF:ALT:EPCR for single causative variants (including homozygous variants) or PROBANDID:CHROM:POS:REF:ALT;CHROM:POS:REF:ALT:EPCR for proposed compound heterozygous recessive etiologies, where PROBANDID = proband identifier that includes family ID. 

The EPCR values provided are important for assessment (see section Assessment below). The EPCR scores should as much as possible reflect the probability that the variant or variants are causal, and should be in the (0,1] range, meaning greater than zero and less than or equal to one. 

Each variant CHROM:POS:REF:ALT must be listed using the human genome build GRCh38. Up to 100 CV candidates (lines) can be listed for each proband from the Test Set and Discovery Set, provided in a combined submission file for each model or tool used by the predictors. Only the Test Set CV candidates will be used for assessment. An example submission file and a validation script are provided for the predictors to check the correctness of the format before submitting predictions.

File naming

CAGI allows submission of up to six models per team, of which model number 1 is considered primary. You can upload predictions for each model multiple times during the submission window; the last submission before the deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.

Each model must include a detailed description of your method(s). Results for a model will not be assessed without adequate information on how the model works. Please use the following filename for model descriptions: <teamname>_desc_model_(1|2|3|4|5|6).* If applicable, include any publicly available databases used (with version or access date) and selection thresholds (e.g., for variant quality, frequency, functional consequence, and so on). 

Additionally, upon submission predictors will be asked to complete a brief survey about your method(s) and submission file(s) at the time of submission. This will include: (i) reporting if the submission file is the automated output of the computational model (preferred) or whether the file has undergone downstream manual review and curation (allowed, but must be reported), (ii) reporting if training of the model only used publicly available data (preferred) or if proprietary data were also used (allowed, but must be reported). If proprietary data were used, where possible, please also include a submission file from your model when trained on only publicly available data, (iii) reporting if the model is able to output proposed pairs of compound heterozygous CV candidates or only single CV candidates (this will be taken into consideration in the assessment of performance in detecting true positive recessive diagnoses), (iv) reporting an estimate of run time and cost for the model (if possible), and (v) the model/platform support for different inputs and analyses. Additional data may be requested concerning the method or features of the tools used in developing predictions. 

The results from this challenge must remain confidential until the challenge is completed. Please do not mark these cases as solved in any systems that share variant data to avoid introducing information leak across teams and bias in the challenge assessment.

Example set 

Short-read genome sequence data with phenotype data and genetic diagnosis from 14 solved families from the RGP CAGI6 challenge are provided for training purposes (joint VCF and metadata csv file or GA4GH phenopacket). The correct CVs can be found in the Example Set metadata or the example submission file. Take note that this example dataset will have a higher proportion of publicly accessible causal variants (e.g., variants in ClinVar) due to the extended time since discovery.

Test set 

Short-read genome sequencing data with phenotype data from a mix of 30 solved and unsolved RGP families. The prediction submission file on these 30 families will be used for assessment.

Discovery set 

Short-read genome sequence data with phenotype data from 20 unsolved rare disease families from RGP are provided as an optional challenge. CVs submitted on this set may be assessed for diagnostic fit and returned to the family. Providing a prediction submission on this set will improve access to genomic research for many undiagnosed families.

Handling VCF files 

A number of tools and libraries exist for navigating and manipulating VCF files. To split joint-called VCF files into individual samples, options include the bcftools -view or GATK SelectVariants commands.

Assessment 

Evaluation of the Test Set submission file will resemble that from the CAGI6 Rare Genomes Project Challenge (Stenton et al., 2024), but additional strategies will also evaluate the quality of calibrated EPCR values in each submission.

Predictions will be assessed by independent assessors, blinded to the identity of the teams and the methods. It is important to mention that causal variants in the solved probands may not be known with absolute certainty. The answer key used in our assessment therefore reflects the best of our team’s abilities to identify causal variants by applying available evidence and following current clinical field standards. 

For previously identified CVs (true positives in the test set), assessors will review how often these were the top ranked variant(s). A weighted score based on these rank positions will be used as an indicator of model performance. The EPCR values will also be used to assess how often a model identifies true positive CVs versus false positive CVs across a set of EPCR thresholds, to provide a measure of model sensitivity, specificity, and positive predictive value. Partial credit will be given for probands where only one of two causal variants in a gene are identified for a proband. Additional assessments will be made in an informed phase in order to appropriately weigh rankings with the information collected in the methods survey. The combined assessments will determine the top-performing teams. For the top-performing teams, genomic analysts on the RGP team will re-review the variants for undiagnosed probands in both the Test Set and, when provided, the Discovery Set with EPCR calibrated values ≥0.1 to see if they are diagnostic or merit further evaluation.

The participating teams may want to ensure that at least one of their submissions normalizes the output scores over the listed variants (per proband) to sum to a value no greater than 1. Such predictions are likely to contain a smaller number of potentially causal variants and may be of interest for both the assessment and followup by the RGP team.

Ethical considerations 

The data in this challenge are derived from individuals with rare, potentially undiagnosed, diseases and their close biological relatives, and may include families who are medically underserved. Participants sign up for RGP because they are interested in research to improve rare disease diagnosis and activities like CAGI challenges are consistent with this goal. Identification of putative pathogenic variants (i.e., causal with respect to the clinical phenotype under investigation) may, if confirmed, be important for tailoring clinical interventions and obtaining services. Families have consented that data will be analyzed for findings related to the stated phenotype and that no active searching is conducted for variants unrelated to the rare condition in the family. Predictors are reminded to only submit causal variant candidates considered to be of relevance to the proband’s provided disease phenotype (i.e., do not submit secondary findings). We also remind all teams to ensure the ethical stewardship of challenge data by limiting the use of this data to CAGI challenge participants only. This data should be destroyed after the challenge and no efforts shall be made by any parties to try to identify the source of the data.

Data Use Agreement: special considerations 

All teams participating in any CAGI challenge are required to sign and adhere to the CAGI Data Use Agreement. Additionally, in this challenge, an institutional signature must also be provided in order to obtain individual-level de-identified data. All communication for providing institutional signatures must come from valid institutional email addresses and the signed form must be emailed by a person who attests the authority to provide such a signature on behalf of the institution. To do so, please complete this form and email it to cagi@genomeinterpretation.org from a valid institutional email address (gmail, hotmail, etc. cannot be accepted). Finally, CAGI must follow the United States export control laws when providing data to entities outside of the United States of America or those that are majority-owned by the countries other than the United States.

Due to the listed special considerations, challenge data is provided in an encrypted form to approved participants. We expect to provide decryption passwords to approved participating teams using encrypted email communication (e.g., PGP), via end-to-end encrypted apps (e.g., Signal) or verbally over a phone. Please allow for an extended time to complete institutional signatures and obtain approvals from CAGI organizers. In previous CAGI challenges, such a process could take between a few days to several weeks, depending on the preparedness of an institution to handle such requests. 

Related challenges

Data provided by the RGP team including

Heidi Rehm, PhD, FACMG, RGP Principal investigator; Anne O’Donnell-Luria, MD, PhD, RGP Medical Director; Melanie O’Leary, CGC, RGP Project Lead; Stephanie DiTroia, PhD, Principal Genomic Analyst 

References 

Jacobsen JOB, et al. The GA4GH Phenopacket schema defines a computable representation of clinical data. Nat Biotechnol (2022) 40(6):817-820. PubMed 

Rehm HL. Evolving health care through personal genomics. Nat Rev Genet (2017) 18(4):259-267. PubMed 

Richards S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med (2015) 17(5):405-424. PubMed

Robinson PN, Mundlos S. The human phenotype ontology. Clin Genet (2010) 77(6): 525-534. PubMed 

Stenton SL, et al. Critical assessment of variant prioritization methods for rare disease diagnosis within the rare genomes project. Hum Genomics (2024) 18(1):44. PubMed 

Wojcik MH, et al. Genome sequencing for diagnosing rare diseases. N Engl J Med (2024) 390(21):1985-1997. PubMed 

Revision history 

20 October 2025: challenge preview posted