Adding Embedding Models

This document explains how to add a new sequence embedding model to the FANTASIA system. The system is modular and dynamically integrates new models by simply updating a constants file and implementing a compatible module.

Overview

Sequence embedding models are declared in the sequence_embedding_types section of the constants.yaml file. The system automatically syncs this configuration with the database table sequence_embedding_type, so you do not need to modify the database manually.

At runtime, the SequenceEmbeddingManager dynamically loads all declared models using Python’s importlib and the task_name defined for each embedding model.

Integration Steps

1. Declare the Model in constants.yaml

Edit the file config/constants.yaml and add a new entry under sequence_embedding_types:

sequence_embedding_types:
  - name: "MyModel"
    description: "Description of the new model"
    task_name: "mymodel"
    model_name: "my-huggingface-org/model"
  • name: Friendly name used in logs or interfaces.

  • description: Human-readable explanation.

  • task_name: Must match the Python filename (without .py) that implements the logic.

  • model_name: Passed to transformers.from_pretrained(…) or your custom loader.

2. Implement the Python Module

Create a file in:

protein_information_system/operation/embedding/proccess/sequence/mymodel.py

This module must implement the following three functions:

def load_model(model_name, conf):
    # Load and return the model
    ...

def load_tokenizer(model_name):
    # Load and return the tokenizer
    ...

def embedding_task(batch, model, tokenizer, device, batch_size=8, embedding_type_id=None):
    # Apply model and return a list of dictionaries
    ...

Each item returned by embedding_task() must be a dictionary with the following keys:

{
    "sequence_id": <int>,
    "embedding_type_id": <int>,
    "sequence": <str>,
    "embedding": <np.ndarray>,
    "shape": <tuple>
}

3. Add the Model ID to the Execution Configuration

In your main config file (e.g., config.yaml), register the model ID under embedding.types. For example, if the new model was assigned ID 5:

embedding:
  types: [1, 2, 3, 4, 5]
  batch_size: 1
  batch_size_embedding: 1
  device: cuda

4. Execute the Pipeline

Run your main pipeline (e.g., main.py):

python main.py

This will trigger:

  • Loading and checking services.

  • Accession fetching (API or CSV).

  • UniProt data extraction.

  • Dynamic embedding model loading.

  • Embedding generation and storage.

  • Structure 3Di processing.

All models registered in the constants.yaml and activated via embedding.types will be applied to all sequences not yet embedded for that type.

Execution Context

The logic is orchestrated from:

SequenceEmbeddingManager(conf).start()

This class loads all active embedding_type_ids and executes:

  • enqueue() → batching and task creation

  • process() → model inference

  • store_entry() → database insertion

The embedding modules are dynamically imported from:

importlib.import_module("...embedding.proccess.sequence.<task_name>")

Summary Table

The following table summarizes the required elements to add a new embedding model:

Field / Method

Description

name

Display name for the model (for humans)

task_name

Name of the Python module to import dynamically

model_name

Model path or identifier (e.g. HuggingFace model name)

load_model()

Loads the actual model (Torch/Transformers)

load_tokenizer()

Loads tokenizer compatible with the model

embedding_task()

Performs inference and returns embedding records

embedding.types

List of model IDs to execute, defined in config.yaml

Notes

  • The system ensures embeddings are only computed if they do not already exist for the target model.

  • Models are isolated and modular: changes to one model do not affect others.

  • For debugging, you can set limit_execution in the config to restrict the number of sequences.