Sequence Embeddings Module

class protein_information_system.operation.embedding.sequence_embedding.SequenceEmbeddingManager(conf)

Bases: GPUTaskInitializer

Manages the sequence embedding process, including model loading, task enqueuing, and result storing.

This class initializes GPU tasks, retrieves model configuration, and processes batches of sequences for embedding generation.

reference_attribute

Name of the attribute used as the reference for embedding (default: ‘sequence’).

Type:

str

model_instances

Dictionary of loaded models keyed by embedding type ID.

Type:

dict

tokenizer_instances

Dictionary of loaded tokenizers keyed by embedding type ID.

Type:

dict

base_module_path

Base module path for dynamic imports of embedding tasks.

Type:

str

batch_size

Number of sequences processed per batch. Defaults to 40.

Type:

int

types

Configuration dictionary for embedding types.

Type:

dict

enqueue()

Enqueue sequence-embedding tasks for all models, requesting only the missing layers.

Behavior

For each (sequence, embedding model type):
  1. Read the desired layer indices from configuration (e.g., [0, 1, 2]).

  2. Query the database for already-present layers for (sequence_id, embedding_type_id).

  3. Compute the set difference → ‘missing_layers’.

  4. If any layers are missing, publish a single task payload for that sequence/model that includes only those missing layer indices.

Batching

Sequences are chunked into batches of size self.queue_batch_size to control memory and message size. For each batch, messages are grouped per model (backend) to minimize queue traffic.

Notes

  • This function assumes the DB schema has a layer_index column on the sequence_embeddings table and that downstream storage (store_entry) includes this value when inserting.

  • It is recommended to add a UNIQUE constraint on

    (sequence_id, embedding_type_id, layer_index)

    to prevent duplicates in concurrent/parallel workers.

raises Exception:

Re-raises any unexpected error after logging.

process(batch_data)

Processes a batch of sequences to generate embeddings.

Parameters:

batch_data (list[dict]) – List of dictionaries, each containing sequence data.

Returns:

List of dictionaries with embedding results.

Return type:

list[dict]

Raises:

Exception – If there’s an error during embedding generation.

Example

>>> batch_data = [{"sequence": "ATCG", "sequence_id": 1, "embedding_type_id": 2}]
>>> results = manager.process(batch_data)
store_entry(records)

Abstract method to store processed entries. Must be overridden by subclasses.