Adding Embedding Models ======================== This document explains how to add a new sequence embedding model to the FANTASIA system. The system is modular and dynamically integrates new models by simply updating a constants file and implementing a compatible module. Overview -------- Sequence embedding models are declared in the `sequence_embedding_types` section of the `constants.yaml` file. The system automatically syncs this configuration with the database table `sequence_embedding_type`, so **you do not need to modify the database manually**. At runtime, the `SequenceEmbeddingManager` dynamically loads all declared models using Python's `importlib` and the `task_name` defined for each embedding model. Integration Steps ----------------- 1. Declare the Model in `constants.yaml` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Edit the file `config/constants.yaml` and add a new entry under `sequence_embedding_types`: .. code-block:: yaml sequence_embedding_types: - name: "MyModel" description: "Description of the new model" task_name: "mymodel" model_name: "my-huggingface-org/model" - `name`: Friendly name used in logs or interfaces. - `description`: Human-readable explanation. - `task_name`: Must match the Python filename (without `.py`) that implements the logic. - `model_name`: Passed to `transformers.from_pretrained(...)` or your custom loader. 2. Implement the Python Module ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create a file in: .. code-block:: protein_information_system/operation/embedding/proccess/sequence/mymodel.py This module must implement the following three functions: .. code-block:: python def load_model(model_name, conf): # Load and return the model ... def load_tokenizer(model_name): # Load and return the tokenizer ... def embedding_task(batch, model, tokenizer, device, batch_size=8, embedding_type_id=None): # Apply model and return a list of dictionaries ... Each item returned by `embedding_task()` must be a dictionary with the following keys: .. code-block:: python { "sequence_id": , "embedding_type_id": , "sequence": , "embedding": , "shape": } 3. Add the Model ID to the Execution Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In your main config file (e.g., `config.yaml`), register the model ID under `embedding.types`. For example, if the new model was assigned ID 5: .. code-block:: yaml embedding: types: [1, 2, 3, 4, 5] batch_size: 1 batch_size_embedding: 1 device: cuda 4. Execute the Pipeline ^^^^^^^^^^^^^^^^^^^^^^^ Run your main pipeline (e.g., `main.py`): .. code-block:: bash python main.py This will trigger: - Loading and checking services. - Accession fetching (API or CSV). - UniProt data extraction. - **Dynamic embedding model loading**. - Embedding generation and storage. - Structure 3Di processing. All models registered in the `constants.yaml` and activated via `embedding.types` will be applied to all sequences not yet embedded for that type. Execution Context ----------------- The logic is orchestrated from: .. code-block:: python SequenceEmbeddingManager(conf).start() This class loads all active ``embedding_type_ids`` and executes: - `enqueue()` → batching and task creation - `process()` → model inference - `store_entry()` → database insertion The embedding modules are dynamically imported from: .. code-block:: python importlib.import_module("...embedding.proccess.sequence.") Summary Table ------------- The following table summarizes the required elements to add a new embedding model: .. list-table:: :widths: 25 75 :header-rows: 1 * - Field / Method - Description * - ``name`` - Display name for the model (for humans) * - ``task_name`` - Name of the Python module to import dynamically * - ``model_name`` - Model path or identifier (e.g. HuggingFace model name) * - ``load_model()`` - Loads the actual model (Torch/Transformers) * - ``load_tokenizer()`` - Loads tokenizer compatible with the model * - ``embedding_task()`` - Performs inference and returns embedding records * - ``embedding.types`` - List of model IDs to execute, defined in ``config.yaml`` Notes ----- - The system ensures embeddings are only computed if they do not already exist for the target model. - Models are isolated and modular: changes to one model do not affect others. - For debugging, you can set `limit_execution` in the config to restrict the number of sequences.