Protein annotation files
GO Annotations Queue Processor
This module defines GOAnnotationsQueueProcessor, a queue-integrated
component of the Protein Information System (PIS) that:
Parses CAFA-formatted GO annotation files (TSV).
Loads and indexes protein sequences from a FASTA file.
Publishes per-protein tasks to the internal queue.
Persists proteins, accessions, sequences, GO terms (with category), and protein–GO associations into the relational database.
The implementation relies on ORM entities (Protein, Accession, Sequence,
GOTerm, ProteinGOTermAnnotation) and a task-queue base class
(QueueTaskInitializer).
Notes
CAFA TSV format expected per line:
UniProtKB:ID<TAB>GO:XXXXXXX<TAB>Category. The category is optional in the file, but the DB model requires a category to create a new GO term; missing categories are therefore skipped.FASTA headers are expected to follow UniProt convention like
sp|P12345|...ortr|Q8XYZ1|...; accession is extracted from the second field split by'|'. If not present, the full record ID is used.Evidence codes for associations default to
"UNKNOWN"unless provided by upstream sources.
- class protein_information_system.operation.extraction.protein_annotations_file.GOAnnotationsQueueProcessor(conf: dict)
Bases:
QueueTaskInitializerQueue processor for Gene Ontology annotations.
- Parameters:
conf (dict) – Configuration mapping. Requires the following keys: -
goa_annotations_file: Path to the CAFA-formatted TSV file. -goa_sequences_fasta: Path to the FASTA file with protein sequences. -limit_execution(optional): Integer limit for number of TSV lines to process.Effects (Side)
------------
initialization (- Loads all sequences from FASTA into an in-memory dictionary at) – (
self.sequences) to enable fast lookups during processing.
- enqueue() None
Enqueue per-protein tasks parsed from a CAFA-formatted TSV file.
Expected TSV columns per line:
UniProtKB:ID<TAB>GO:XXXXXXX<TAB>Category
Lines beginning with
#or blank lines are ignored.The third column (
Category) is optional in the file; if absent or blank,"UNKNOWN"is assigned. Unknown categories are allowed at enqueue time but may later cause GO term creation to be skipped during storage.Entries are grouped by protein accession before publishing to the queue.
- get_or_create_accession(code: str, protein_id: str, primary: bool = True, tag: str | None = None) Accession
Create or retrieve a UniProt
Accessionlinked to a protein.- Parameters:
code (str) – Accession code (e.g.,
P12345) used as the primary key inAccession.protein_id (str) – Identifier of the linked
Protein(should match UniProt accession).primary (bool, optional) – Whether this accession is the primary one for the protein (default
True).tag (Optional[str], optional) – Optional qualifier/tag for the accession.
- Returns:
ORM instance corresponding to the accession.
- Return type:
Accession
- get_or_create_association(protein_id: str, go_id: str, evidence_code: str = 'UNKNOWN') ProteinGOTermAnnotation | None
Create or retrieve a protein–GO association.
- Parameters:
protein_id (str) – Protein identifier (UniProt accession).
go_id (str) – GO term identifier (e.g.,
GO:0008150).evidence_code (str, optional) – Evidence code for the association (default
"UNKNOWN").
- Returns:
Existing ORM association if found, otherwise
None(a new one is queued in the session but not yet flushed when created).- Return type:
Optional[ProteinGOTermAnnotation]
- get_or_create_go_term(go_id: str, category: str) GOTerm
Create or update a
GOTermwith its category.- Parameters:
go_id (str) – GO term identifier (e.g.,
GO:0008150).category (str) – GO category label (
BP,MF, orCC). The DB schema enforces non-null constraints; empty/Nonecategories are rejected.
- Returns:
ORM instance corresponding to the GO term.
- Return type:
GOTerm
- Raises:
ValueError – If
categoryis empty orNone.
- get_or_create_protein(protein_entry_id: str) Protein
Create or retrieve a
Proteinby UniProt accession.- Parameters:
protein_entry_id (str) – UniProt accession used as the primary key in the
Proteintable.- Returns:
ORM instance for the protein.
- Return type:
Protein
- get_or_create_sequence(sequence: str) Sequence
Create or retrieve a
Sequenceentity by raw sequence value.- Parameters:
sequence (str) – Amino acid sequence string.
- Returns:
ORM instance corresponding to the stored sequence.
- Return type:
Sequence
- Raises:
ValueError – If
sequenceis empty orNone.
- get_sequence_from_external_source(protein_entry_id: str) str | None
Retrieve the sequence for a protein from the in-memory FASTA index.
- Parameters:
protein_entry_id (str) – UniProt accession for which to retrieve the sequence.
- Returns:
The amino acid sequence if found; otherwise
None.- Return type:
Optional[str]
- load_sequences() Dict[str, str]
Load sequences from the configured FASTA file into memory.
FASTA records are indexed by UniProt accession (preferred) or by the raw record ID when an accession cannot be parsed.
- Returns:
Mapping
{uniprot_accession: amino_acid_sequence}.- Return type:
dict
- process(data: dict) dict
Resolve sequence and return a normalized task result.
- Parameters:
data (dict) – Task payload with keys: -
protein_entry_id(str): Protein accession. -go_terms(list[tuple[str, str]]): GO ID and category pairs.- Returns:
Result payload with keys:
protein,go_terms,sequence.- Return type:
dict
- store_entry(data: dict) None
Persist a processed protein entry into the database.
This method performs the following steps in a transactional manner:
Ensure the existence of the Protein and its corresponding Accession record.
Link the Protein to a Sequence if available, creating the sequence record on demand.
Ensure the existence of all referenced GO terms (one by one).
Collect the set of GO associations (protein_id, go_id) to be created.
Query in bulk which associations already exist for this protein.
Insert only the missing associations in a single bulk statement (multi-values INSERT).
Commit all changes once at the end.
Compared to the previous row-by-row approach, this implementation eliminates the N+1 query pattern and reduces overhead by performing association inserts in bulk. This significantly improves throughput when handling proteins with a large number of GO annotations.
- Parameters:
data (dict) –
- Parsed entry with at least:
”protein”: str, protein identifier
”sequence”: Optional[str], raw protein sequence
”go_terms”: List[Tuple[str, str]], list of (go_id, category)