Welcome to highentDCA Documentation

highentDCA is a specialized Python package for training entropy-decimated Direct Coupling Analysis (edDCA) models on biological sequence data. This package extends the powerful adabmDCA framework to enable efficient training of sparse Potts models while tracking their entropy evolution during the decimation process.

About this Documentation

This documentation provides a comprehensive guide to using highentDCA for training sparse DCA models with entropy tracking. It complements the adabmDCA documentation and focuses specifically on the entropy decimation features.

What is highentDCA?

highentDCA implements the entropy-decimated DCA (edDCA) algorithm, a method that:

Progressively prunes couplings from a fully-connected Boltzmann Machine (bmDCA)
Maintains model accuracy while reducing coupling density to target levels
Computes entropy at key decimation checkpoints using thermodynamic integration
Provides insights into the relationship between model complexity and information content

This approach is particularly valuable for:

Understanding which interactions are essential for capturing sequence statistics
Building interpretable sparse models for protein families
Studying the thermodynamics of statistical models
Reducing computational requirements for downstream applications

Core Features

🔬 Entropy Decimation (edDCA)

The main feature of highentDCA is the ability to train edDCA models that:

Start from a converged bmDCA model (or train one automatically)
Iteratively remove the least important couplings based on empirical statistics
Re-equilibrate after each decimation step to maintain accuracy
Track coupling density, Pearson correlation, and model entropy throughout the process

📊 Thermodynamic Integration

At pre-defined density checkpoints, highentDCA computes model entropy using thermodynamic integration:

Introduces a bias parameter θ towards a target sequence
Integrates average sequence identity from θ=0 to θ_max
Uses careful equilibration to ensure accurate entropy estimates
Saves entropy values for downstream analysis

💾 Flexible Checkpointing

Multiple checkpoint strategies for saving model state:

Linear checkpointing: Save at regular intervals
Density-based checkpointing: Save at specific coupling densities
Automatic saving of model parameters, chains, and statistics
Optional integration with Weights & Biases for experiment tracking

🚀 GPU Acceleration

Built on PyTorch for efficient computation:

GPU-accelerated training and sampling
Efficient parallel Markov chain Monte Carlo
Automatic device management (CUDA/CPU)
Support for mixed precision (float32/float64)

How edDCA Works

The entropy decimation algorithm follows these steps:

Initialization: Start with a converged bmDCA model or train one
Decimation: Remove a fraction of couplings with smallest empirical two-point correlations
Equilibration: Run MCMC to equilibrate chains on the decimated graph
Re-convergence: Perform gradient descent to match data statistics
Entropy Computation: At checkpoints, compute entropy via thermodynamic integration
Iteration: Repeat steps 2-5 until target density is reached

Decimation Process

Key Advantages

Sparsity with Accuracy

edDCA achieves high sparsity while maintaining model accuracy:

Typical models retain only 2-5% of couplings
Pearson correlation with data statistics remains >0.95
Essential interactions are preserved

Entropy Tracking

Understanding model information content:

Entropy decreases as couplings are removed
Rate of decrease reveals importance of interactions
Provides thermodynamic insights into model complexity

Computational Efficiency

Sparse models are faster for downstream applications:

Reduced memory footprint
Faster sampling and energy computation
Easier interpretation and visualization

Use Cases

Protein Contact Prediction

edDCA models can identify essential residue-residue contacts:

highentDCA train \
    --data protein_family.fasta \
    --model edDCA \
    --density 0.05 \
    --output contact_prediction

Mutational Effect Prediction

Sparse models capture key constraints for sequence function:

highentDCA train \
    --data enzyme_family.fasta \
    --model edDCA \
    --density 0.03 \
    --alphabet protein

Thermodynamic Analysis

Study entropy evolution during decimation:

highentDCA train \
    --data sequence_data.fasta \
    --model edDCA \
    --theta_max 5.0 \
    --nsteps 100

Getting Started

New to highentDCA? Follow these steps:

Installation: Set up the package and dependencies
Quick Start: Train your first edDCA model
CLI Reference: Explore all available options
API Documentation: Use highentDCA in Python scripts
Examples: Learn from practical examples

Comparison with bmDCA

Feature	bmDCA	edDCA (highentDCA)
Coupling density	100% (fully connected)	2-10% (sparse)
Training time	Standard	Longer (includes decimation)
Memory usage	High	Lower (sparse parameters)
Interpretability	Complex	Better (fewer interactions)
Entropy tracking	No	Yes
Use case	General-purpose	Sparsity & thermodynamics

Integration with adabmDCA

highentDCA is built on top of adabmDCA and shares:

Data formats: Compatible FASTA input and parameter formats
Sampling methods: Same Gibbs and Metropolis samplers
Statistics functions: Identical frequency and correlation computations
Utilities: Common helper functions for encoding, I/O, etc.

You can use adabmDCA models as starting points for edDCA training, and edDCA outputs are compatible with adabmDCA analysis tools.

Technical Requirements

Software Dependencies

Python ≥ 3.10
PyTorch ≥ 2.1.0 (with CUDA recommended)
adabmDCA == 0.5.0
NumPy, Pandas, Matplotlib, BioPython

Hardware Recommendations

GPU: NVIDIA GPU with CUDA support (recommended)
Minimum 4GB VRAM for small datasets
8GB+ VRAM for large protein families
CPU: Multi-core processor for data preprocessing
RAM: 8GB+ depending on dataset size

Dataset Requirements

Multiple sequence alignment in FASTA format
Minimum ~1000 sequences (more is better)
Quality-controlled alignment (gaps, truncations handled)
Compatible alphabets: protein, RNA, DNA, or custom

Support and Community

Getting Help

Documentation: Read the full documentation in the docs/ folder
Issues: Report bugs on GitHub Issues
Questions: Contact robertonetti3@gmail.com

Contributing

Contributions are welcome! Areas for improvement:

Additional checkpoint strategies
Alternative decimation algorithms
Visualization tools for entropy analysis
Extended entropy computation methods
Documentation and examples

Papers

Entropy Decimation: Barrat-Charlaix et al., 2021
Adaptive bmDCA: Muntoni et al., 2021
DCA for Contact Prediction: Ekeberg et al., 2013

Software

adabmDCApy: Python implementation
adabmDCA.jl: Julia implementation
adabmDCAc: C++ implementation

Tutorials

License

highentDCA is released under the Apache License 2.0. See LICENSE for details.

Citation

If you use highentDCA in your research, please cite:

@software{highentDCA2024,
  author = {Netti, Roberto and Weigt, Martin},
  title = {highentDCA: Entropy-decimated Direct Coupling Analysis},
  year = {2024},
  url = {https://github.com/robertonetti/highentropyDCA}
}

Ready to get started? Head to the Installation Guide →