Lookup Table Generation for FANTASIA

This document describes the available pipelines to populate the functional lookup table used in FANTASIA. Two complementary pathways are supported:

  1. UniProt-based standard import, which pulls accessions and metadata using APIs or curated CSVs.

  2. Custom annotation ingestion, useful for local or third-party datasets with manually curated annotations and FASTA sequences.

Standard Accession Import

The AccessionManager class provides two standard methods to initialize protein accessions:

AccessionManager(conf).fetch_accessions_from_api()

Fetches accessions directly from UniProtKB using the search_criteria defined in the YAML configuration. The query must be a valid UniProt search string. For example:

search_criteria: '(structure_3d:true)'

To restrict results to experimentally validated proteins with GO annotations, use:

search_criteria: '(go_exp:* OR go_ida:* OR go_ipi:* OR go_imp:* OR go_igi:* OR go_iep:* OR go_tas:* OR go_ic:*)'

Alternatively, accessions can be loaded from a user-provided CSV:

AccessionManager(conf).load_accessions_from_csv()

This requires the configuration to define the file path and the relevant column:

load_accesion_csv: ../data/sample.csv
load_accesion_column: uniprot_id

This mode is recommended for predefined accession lists or curated datasets.

Post-processing Steps

Once accessions are available, metadata and protein representations are generated using:

UniProtExtractor(conf).start()
SequenceEmbeddingManager(conf).start()

These modules:

  • Download protein sequence and metadata from UniProt.

  • Generate embeddings using selected protein language models.

Embedding Model Selection

Available embedding models are defined under embedding.types in the YAML configuration:

embedding:
  types:
    - 1  # ESM: Evolutionary Scale Modeling (Meta AI)
    - 2  # ProSTT5: Structural Transformer T5-based (Ana Rojas Lab)
    - 3  # ProtT5: Protein Transformer T5-based (EMBL/UniProt)
    - 4  # Ankh3: Contextual residue embedding model (Ankh v3)

Multiple models may be activated simultaneously. Batch sizes for queueing and inference are controlled via:

batch_size: 1
batch_size_embedding: 1

Annotation Filtering by Evidence

FANTASIA supports filtering GO annotations based on UniProt evidence codes. To retain only experimentally supported annotations:

allowed_evidences:
  - EXP  # Inferred from Experiment
  - IDA  # Inferred from Direct Assay
  - IPI  # Inferred from Physical Interaction
  - IMP  # Inferred from Mutant Phenotype
  - IGI  # Inferred from Genetic Interaction
  - IEP  # Inferred from Expression Pattern
  - TAS  # Traceable Author Statement
  - IC   # Inferred by Curator

If the list is left empty ([]), all annotations will be imported regardless of quality.

Custom Annotation via GOAnnotationsQueueProcessor

FANTASIA also supports local datasets or third-party annotations via the GOAnnotationsQueueProcessor class.

Requirements:

  • A tab-separated annotation file (goa_annotations_file) with format:

    PROT_ID_001    GO:0008150,GO:0003674,GO:0005575
    

Execution:

GOAnnotationsQueueProcessor(conf).start()

This module performs the following steps internally:

  1. Parses each protein entry and its GO terms.

  2. Retrieves the protein sequence from UniProt.

  3. Stores or updates the protein, sequence, GO terms, and assigns a default evidence code (“UNKNOWN”).

Configuration Summary

Depending on the selected mode, the YAML configuration must include the appropriate keys. Only one mode should be active per execution.

# --- Mode 1: Standard UniProt Search (API query) ---
# Triggered by: AccessionManager(conf).fetch_accessions_from_api()
search_criteria: '(go_exp:* OR go_ida:* OR go_ipi:* OR go_imp:*)'
tag: HUMAN_SEARCH
allowed_evidences:
  - EXP
  - IDA
  - IPI
  - IMP
embedding:
  types: [3, 4]     # e.g. ProtT5, Ankh3
  batch_size: 1

# --- Mode 2: CSV-based Custom Dataset ---
# Triggered by: AccessionManager(conf).load_accessions_from_csv()
load_accesion_csv: ../data/sample.csv
load_accesion_column: uniprot_id
fasta_path: ../data/sequences.fasta
tag: CUSTOM_DATASET
allowed_evidences: [EXP, IDA, IPI, IMP]
embedding:
  types: [3, 4]
  batch_size: 1

# --- Mode 3: GOA File with Local Annotations ---
# Triggered by: GOAnnotationsQueueProcessor(conf).start()
goa_annotations_file: ../data/custom_go_annotations.tsv
limit_execution: 1000  # Optional

Execution Flow

The following illustrates the high-level execution logic, depending on the selected mode:

# --- Mode 1 ---
AccessionManager(conf).fetch_accessions_from_api()
UniProtExtractor(conf).start()

# --- Mode 2 ---
AccessionManager(conf).load_accessions_from_csv()
UniProtExtractor(conf).start()

# --- Mode 3 ---
GOAnnotationsQueueProcessor(conf).start()

# Common to all modes
SequenceEmbeddingManager(conf).start()

Each configuration block must be properly defined in your YAML file. Do not mix multiple modes in a single execution context.