Welcome to highentDCA Documentation
highentDCA is a specialized Python package for training entropy-decimated Direct Coupling Analysis (edDCA) models on biological sequence data. This package extends the powerful adabmDCA framework to enable efficient training of sparse Potts models while tracking their entropy evolution during the decimation process.
About this Documentation
This documentation provides a comprehensive guide to using highentDCA for training sparse DCA models with entropy tracking. It complements the adabmDCA documentation and focuses specifically on the entropy decimation features.
What is highentDCA?
highentDCA implements the entropy-decimated DCA (edDCA) algorithm, a method that:
- Progressively prunes couplings from a fully-connected Boltzmann Machine (bmDCA)
- Maintains model accuracy while reducing coupling density to target levels
- Computes entropy at key decimation checkpoints using thermodynamic integration
- Provides insights into the relationship between model complexity and information content
This approach is particularly valuable for:
- Understanding which interactions are essential for capturing sequence statistics
- Building interpretable sparse models for protein families
- Studying the thermodynamics of statistical models
- Reducing computational requirements for downstream applications
Core Features
🔬 Entropy Decimation (edDCA)
The main feature of highentDCA is the ability to train edDCA models that:
- Start from a converged bmDCA model (or train one automatically)
- Iteratively remove the least important couplings based on empirical statistics
- Re-equilibrate after each decimation step to maintain accuracy
- Track coupling density, Pearson correlation, and model entropy throughout the process
📊 Thermodynamic Integration
At pre-defined density checkpoints, highentDCA computes model entropy using thermodynamic integration:
- Introduces a bias parameter θ towards a target sequence
- Integrates average sequence identity from θ=0 to θ_max
- Uses careful equilibration to ensure accurate entropy estimates
- Saves entropy values for downstream analysis
💾 Flexible Checkpointing
Multiple checkpoint strategies for saving model state:
- Linear checkpointing: Save at regular intervals
- Density-based checkpointing: Save at specific coupling densities
- Automatic saving of model parameters, chains, and statistics
- Optional integration with Weights & Biases for experiment tracking
🚀 GPU Acceleration
Built on PyTorch for efficient computation:
- GPU-accelerated training and sampling
- Efficient parallel Markov chain Monte Carlo
- Automatic device management (CUDA/CPU)
- Support for mixed precision (float32/float64)
How edDCA Works
The entropy decimation algorithm follows these steps:
- Initialization: Start with a converged bmDCA model or train one
- Decimation: Remove a fraction of couplings with smallest empirical two-point correlations
- Equilibration: Run MCMC to equilibrate chains on the decimated graph
- Re-convergence: Perform gradient descent to match data statistics
- Entropy Computation: At checkpoints, compute entropy via thermodynamic integration
- Iteration: Repeat steps 2-5 until target density is reached

Key Advantages
Sparsity with Accuracy
edDCA achieves high sparsity while maintaining model accuracy:
- Typical models retain only 2-5% of couplings
- Pearson correlation with data statistics remains >0.95
- Essential interactions are preserved
Entropy Tracking
Understanding model information content:
- Entropy decreases as couplings are removed
- Rate of decrease reveals importance of interactions
- Provides thermodynamic insights into model complexity
Computational Efficiency
Sparse models are faster for downstream applications:
- Reduced memory footprint
- Faster sampling and energy computation
- Easier interpretation and visualization
Use Cases
Protein Contact Prediction
edDCA models can identify essential residue-residue contacts:
highentDCA train \
--data protein_family.fasta \
--model edDCA \
--density 0.05 \
--output contact_prediction
Mutational Effect Prediction
Sparse models capture key constraints for sequence function:
highentDCA train \
--data enzyme_family.fasta \
--model edDCA \
--density 0.03 \
--alphabet protein
Thermodynamic Analysis
Study entropy evolution during decimation:
highentDCA train \
--data sequence_data.fasta \
--model edDCA \
--theta_max 5.0 \
--nsteps 100
Getting Started
New to highentDCA? Follow these steps:
- Installation: Set up the package and dependencies
- Quick Start: Train your first edDCA model
- CLI Reference: Explore all available options
- API Documentation: Use highentDCA in Python scripts
- Examples: Learn from practical examples
Comparison with bmDCA
| Feature | bmDCA | edDCA (highentDCA) |
|---|---|---|
| Coupling density | 100% (fully connected) | 2-10% (sparse) |
| Training time | Standard | Longer (includes decimation) |
| Memory usage | High | Lower (sparse parameters) |
| Interpretability | Complex | Better (fewer interactions) |
| Entropy tracking | No | Yes |
| Use case | General-purpose | Sparsity & thermodynamics |
Integration with adabmDCA
highentDCA is built on top of adabmDCA and shares:
- Data formats: Compatible FASTA input and parameter formats
- Sampling methods: Same Gibbs and Metropolis samplers
- Statistics functions: Identical frequency and correlation computations
- Utilities: Common helper functions for encoding, I/O, etc.
You can use adabmDCA models as starting points for edDCA training, and edDCA outputs are compatible with adabmDCA analysis tools.
Technical Requirements
Software Dependencies
- Python ≥ 3.10
- PyTorch ≥ 2.1.0 (with CUDA recommended)
- adabmDCA == 0.5.0
- NumPy, Pandas, Matplotlib, BioPython
Hardware Recommendations
- GPU: NVIDIA GPU with CUDA support (recommended)
- Minimum 4GB VRAM for small datasets
- 8GB+ VRAM for large protein families
- CPU: Multi-core processor for data preprocessing
- RAM: 8GB+ depending on dataset size
Dataset Requirements
- Multiple sequence alignment in FASTA format
- Minimum ~1000 sequences (more is better)
- Quality-controlled alignment (gaps, truncations handled)
- Compatible alphabets: protein, RNA, DNA, or custom
Support and Community
Getting Help
- Documentation: Read the full documentation in the
docs/folder - Issues: Report bugs on GitHub Issues
- Questions: Contact robertonetti3@gmail.com
Contributing
Contributions are welcome! Areas for improvement:
- Additional checkpoint strategies
- Alternative decimation algorithms
- Visualization tools for entropy analysis
- Extended entropy computation methods
- Documentation and examples
Related Resources
Papers
- Entropy Decimation: Barrat-Charlaix et al., 2021
- Adaptive bmDCA: Muntoni et al., 2021
- DCA for Contact Prediction: Ekeberg et al., 2013
Software
- adabmDCApy: Python implementation
- adabmDCA.jl: Julia implementation
- adabmDCAc: C++ implementation
Tutorials
License
highentDCA is released under the Apache License 2.0. See LICENSE for details.
Citation
If you use highentDCA in your research, please cite:
@software{highentDCA2024,
author = {Netti, Roberto and Weigt, Martin},
title = {highentDCA: Entropy-decimated Direct Coupling Analysis},
year = {2024},
url = {https://github.com/robertonetti/highentropyDCA}
}
Ready to get started? Head to the Installation Guide →