Lookup Table Generation for FANTASIA
This document describes the available pipelines to populate the functional lookup table used in FANTASIA. Two complementary pathways are supported:
UniProt-based standard import, which pulls accessions and metadata using APIs or curated CSVs.
Custom annotation ingestion, useful for local or third-party datasets with manually curated annotations and FASTA sequences.
Standard Accession Import
The AccessionManager class provides two standard methods to initialize protein accessions:
AccessionManager(conf).fetch_accessions_from_api()
Fetches accessions directly from UniProtKB using the search_criteria defined in the YAML configuration. The query must be a valid UniProt search string. For example:
search_criteria: '(structure_3d:true)'
To restrict results to experimentally validated proteins with GO annotations, use:
search_criteria: '(go_exp:* OR go_ida:* OR go_ipi:* OR go_imp:* OR go_igi:* OR go_iep:* OR go_tas:* OR go_ic:*)'
Alternatively, accessions can be loaded from a user-provided CSV:
AccessionManager(conf).load_accessions_from_csv()
This requires the configuration to define the file path and the relevant column:
load_accesion_csv: ../data/sample.csv
load_accesion_column: uniprot_id
This mode is recommended for predefined accession lists or curated datasets.
Post-processing Steps
Once accessions are available, metadata and protein representations are generated using:
UniProtExtractor(conf).start()
SequenceEmbeddingManager(conf).start()
These modules:
Download protein sequence and metadata from UniProt.
Generate embeddings using selected protein language models.
Embedding Model Selection
Available embedding models are defined under embedding.types in the YAML configuration:
embedding:
types:
- 1 # ESM: Evolutionary Scale Modeling (Meta AI)
- 2 # ProSTT5: Structural Transformer T5-based (Ana Rojas Lab)
- 3 # ProtT5: Protein Transformer T5-based (EMBL/UniProt)
- 4 # Ankh3: Contextual residue embedding model (Ankh v3)
Multiple models may be activated simultaneously. Batch sizes for queueing and inference are controlled via:
batch_size: 1
batch_size_embedding: 1
Annotation Filtering by Evidence
FANTASIA supports filtering GO annotations based on UniProt evidence codes. To retain only experimentally supported annotations:
allowed_evidences:
- EXP # Inferred from Experiment
- IDA # Inferred from Direct Assay
- IPI # Inferred from Physical Interaction
- IMP # Inferred from Mutant Phenotype
- IGI # Inferred from Genetic Interaction
- IEP # Inferred from Expression Pattern
- TAS # Traceable Author Statement
- IC # Inferred by Curator
If the list is left empty ([]), all annotations will be imported regardless of quality.
Custom Annotation via GOAnnotationsQueueProcessor
FANTASIA also supports local datasets or third-party annotations via the GOAnnotationsQueueProcessor class.
Requirements:
A tab-separated annotation file (goa_annotations_file) with format:
PROT_ID_001 GO:0008150,GO:0003674,GO:0005575
Execution:
GOAnnotationsQueueProcessor(conf).start()
This module performs the following steps internally:
Parses each protein entry and its GO terms.
Retrieves the protein sequence from UniProt.
Stores or updates the protein, sequence, GO terms, and assigns a default evidence code (“UNKNOWN”).
Configuration Summary
Depending on the selected mode, the YAML configuration must include the appropriate keys. Only one mode should be active per execution.
# --- Mode 1: Standard UniProt Search (API query) ---
# Triggered by: AccessionManager(conf).fetch_accessions_from_api()
search_criteria: '(go_exp:* OR go_ida:* OR go_ipi:* OR go_imp:*)'
tag: HUMAN_SEARCH
allowed_evidences:
- EXP
- IDA
- IPI
- IMP
embedding:
types: [3, 4] # e.g. ProtT5, Ankh3
batch_size: 1
# --- Mode 2: CSV-based Custom Dataset ---
# Triggered by: AccessionManager(conf).load_accessions_from_csv()
load_accesion_csv: ../data/sample.csv
load_accesion_column: uniprot_id
fasta_path: ../data/sequences.fasta
tag: CUSTOM_DATASET
allowed_evidences: [EXP, IDA, IPI, IMP]
embedding:
types: [3, 4]
batch_size: 1
# --- Mode 3: GOA File with Local Annotations ---
# Triggered by: GOAnnotationsQueueProcessor(conf).start()
goa_annotations_file: ../data/custom_go_annotations.tsv
limit_execution: 1000 # Optional
Execution Flow
The following illustrates the high-level execution logic, depending on the selected mode:
# --- Mode 1 ---
AccessionManager(conf).fetch_accessions_from_api()
UniProtExtractor(conf).start()
# --- Mode 2 ---
AccessionManager(conf).load_accessions_from_csv()
UniProtExtractor(conf).start()
# --- Mode 3 ---
GOAnnotationsQueueProcessor(conf).start()
# Common to all modes
SequenceEmbeddingManager(conf).start()
Each configuration block must be properly defined in your YAML file. Do not mix multiple modes in a single execution context.