Adding Embedding Models

This document explains how to add a new sequence embedding model to the FANTASIA system. The system is modular and dynamically integrates new models by simply updating a constants file and implementing a compatible module.

Overview

Sequence embedding models are declared in the sequence_embedding_types section of the constants.yaml file. The system automatically syncs this configuration with the database table sequence_embedding_type, so you do not need to modify the database manually.

At runtime, the SequenceEmbeddingManager dynamically loads all declared models using Python’s importlib and the task_name defined for each embedding model.

Integration Steps

1. Declare the Model in constants.yaml

Edit the file config/constants.yaml and add a new entry under sequence_embedding_types:

sequence_embedding_types:
  - name: "MyModel"
    description: "Description of the new model"
    task_name: "mymodel"
    model_name: "my-huggingface-org/model"

name: Friendly name used in logs or interfaces.
description: Human-readable explanation.
task_name: Must match the Python filename (without .py) that implements the logic.
model_name: Passed to transformers.from_pretrained(…) or your custom loader.

2. Implement the Python Module

Create a file in:

protein_information_system/operation/embedding/proccess/sequence/mymodel.py

This module must implement the following three functions:

def load_model(model_name, conf):
    # Load and return the model
    ...

def load_tokenizer(model_name):
    # Load and return the tokenizer
    ...

def embedding_task(batch, model, tokenizer, device, batch_size=8, embedding_type_id=None):
    # Apply model and return a list of dictionaries
    ...

Each item returned by embedding_task() must be a dictionary with the following keys:

{
    "sequence_id": <int>,
    "embedding_type_id": <int>,
    "sequence": <str>,
    "embedding": <np.ndarray>,
    "shape": <tuple>
}

3. Add the Model ID to the Execution Configuration

In your main config file (e.g., config.yaml), register the model ID under embedding.types. For example, if the new model was assigned ID 5:

embedding:
  types: [1, 2, 3, 4, 5]
  batch_size: 1
  batch_size_embedding: 1
  device: cuda

4. Execute the Pipeline

Run your main pipeline (e.g., main.py):

python main.py

This will trigger:

Loading and checking services.
Accession fetching (API or CSV).
UniProt data extraction.
Dynamic embedding model loading.
Embedding generation and storage.
Structure 3Di processing.

All models registered in the constants.yaml and activated via embedding.types will be applied to all sequences not yet embedded for that type.

Execution Context

The logic is orchestrated from:

SequenceEmbeddingManager(conf).start()

This class loads all active embedding_type_ids and executes:

enqueue() → batching and task creation
process() → model inference
store_entry() → database insertion

The embedding modules are dynamically imported from:

importlib.import_module("...embedding.proccess.sequence.<task_name>")

Summary Table

The following table summarizes the required elements to add a new embedding model:

Field / Method	Description
`name`	Display name for the model (for humans)
`task_name`	Name of the Python module to import dynamically
`model_name`	Model path or identifier (e.g. HuggingFace model name)
`load_model()`	Loads the actual model (Torch/Transformers)
`load_tokenizer()`	Loads tokenizer compatible with the model
`embedding_task()`	Performs inference and returns embedding records
`embedding.types`	List of model IDs to execute, defined in `config.yaml`

Notes

The system ensures embeddings are only computed if they do not already exist for the target model.
Models are isolated and modular: changes to one model do not affect others.
For debugging, you can set limit_execution in the config to restrict the number of sequences.