Skip to content

Welcome to highentDCA Documentation

highentDCA is a specialized Python package for training entropy-decimated Direct Coupling Analysis (edDCA) models on biological sequence data. This package extends the powerful adabmDCA framework to enable efficient training of sparse Potts models while tracking their entropy evolution during the decimation process.

About this Documentation

This documentation provides a comprehensive guide to using highentDCA for training sparse DCA models with entropy tracking. It complements the adabmDCA documentation and focuses specifically on the entropy decimation features.

What is highentDCA?

highentDCA implements the entropy-decimated DCA (edDCA) algorithm, a method that:

  • Progressively prunes couplings from a fully-connected Boltzmann Machine (bmDCA)
  • Maintains model accuracy while reducing coupling density to target levels
  • Computes entropy at key decimation checkpoints using thermodynamic integration
  • Provides insights into the relationship between model complexity and information content

This approach is particularly valuable for:

  • Understanding which interactions are essential for capturing sequence statistics
  • Building interpretable sparse models for protein families
  • Studying the thermodynamics of statistical models
  • Reducing computational requirements for downstream applications

Core Features

🔬 Entropy Decimation (edDCA)

The main feature of highentDCA is the ability to train edDCA models that:

  • Start from a converged bmDCA model (or train one automatically)
  • Iteratively remove the least important couplings based on empirical statistics
  • Re-equilibrate after each decimation step to maintain accuracy
  • Track coupling density, Pearson correlation, and model entropy throughout the process

📊 Thermodynamic Integration

At pre-defined density checkpoints, highentDCA computes model entropy using thermodynamic integration:

  • Introduces a bias parameter θ towards a target sequence
  • Integrates average sequence identity from θ=0 to θ_max
  • Uses careful equilibration to ensure accurate entropy estimates
  • Saves entropy values for downstream analysis

💾 Flexible Checkpointing

Multiple checkpoint strategies for saving model state:

  • Linear checkpointing: Save at regular intervals
  • Density-based checkpointing: Save at specific coupling densities
  • Automatic saving of model parameters, chains, and statistics
  • Optional integration with Weights & Biases for experiment tracking

🚀 GPU Acceleration

Built on PyTorch for efficient computation:

  • GPU-accelerated training and sampling
  • Efficient parallel Markov chain Monte Carlo
  • Automatic device management (CUDA/CPU)
  • Support for mixed precision (float32/float64)

How edDCA Works

The entropy decimation algorithm follows these steps:

  1. Initialization: Start with a converged bmDCA model or train one
  2. Decimation: Remove a fraction of couplings with smallest empirical two-point correlations
  3. Equilibration: Run MCMC to equilibrate chains on the decimated graph
  4. Re-convergence: Perform gradient descent to match data statistics
  5. Entropy Computation: At checkpoints, compute entropy via thermodynamic integration
  6. Iteration: Repeat steps 2-5 until target density is reached

Decimation Process

Key Advantages

Sparsity with Accuracy

edDCA achieves high sparsity while maintaining model accuracy:

  • Typical models retain only 2-5% of couplings
  • Pearson correlation with data statistics remains >0.95
  • Essential interactions are preserved

Entropy Tracking

Understanding model information content:

  • Entropy decreases as couplings are removed
  • Rate of decrease reveals importance of interactions
  • Provides thermodynamic insights into model complexity

Computational Efficiency

Sparse models are faster for downstream applications:

  • Reduced memory footprint
  • Faster sampling and energy computation
  • Easier interpretation and visualization

Use Cases

Protein Contact Prediction

edDCA models can identify essential residue-residue contacts:

highentDCA train \
    --data protein_family.fasta \
    --model edDCA \
    --density 0.05 \
    --output contact_prediction

Mutational Effect Prediction

Sparse models capture key constraints for sequence function:

highentDCA train \
    --data enzyme_family.fasta \
    --model edDCA \
    --density 0.03 \
    --alphabet protein

Thermodynamic Analysis

Study entropy evolution during decimation:

highentDCA train \
    --data sequence_data.fasta \
    --model edDCA \
    --theta_max 5.0 \
    --nsteps 100

Getting Started

New to highentDCA? Follow these steps:

  1. Installation: Set up the package and dependencies
  2. Quick Start: Train your first edDCA model
  3. CLI Reference: Explore all available options
  4. API Documentation: Use highentDCA in Python scripts
  5. Examples: Learn from practical examples

Comparison with bmDCA

Feature bmDCA edDCA (highentDCA)
Coupling density 100% (fully connected) 2-10% (sparse)
Training time Standard Longer (includes decimation)
Memory usage High Lower (sparse parameters)
Interpretability Complex Better (fewer interactions)
Entropy tracking No Yes
Use case General-purpose Sparsity & thermodynamics

Integration with adabmDCA

highentDCA is built on top of adabmDCA and shares:

  • Data formats: Compatible FASTA input and parameter formats
  • Sampling methods: Same Gibbs and Metropolis samplers
  • Statistics functions: Identical frequency and correlation computations
  • Utilities: Common helper functions for encoding, I/O, etc.

You can use adabmDCA models as starting points for edDCA training, and edDCA outputs are compatible with adabmDCA analysis tools.

Technical Requirements

Software Dependencies

  • Python ≥ 3.10
  • PyTorch ≥ 2.1.0 (with CUDA recommended)
  • adabmDCA == 0.5.0
  • NumPy, Pandas, Matplotlib, BioPython

Hardware Recommendations

  • GPU: NVIDIA GPU with CUDA support (recommended)
  • Minimum 4GB VRAM for small datasets
  • 8GB+ VRAM for large protein families
  • CPU: Multi-core processor for data preprocessing
  • RAM: 8GB+ depending on dataset size

Dataset Requirements

  • Multiple sequence alignment in FASTA format
  • Minimum ~1000 sequences (more is better)
  • Quality-controlled alignment (gaps, truncations handled)
  • Compatible alphabets: protein, RNA, DNA, or custom

Support and Community

Getting Help

  • Documentation: Read the full documentation in the docs/ folder
  • Issues: Report bugs on GitHub Issues
  • Questions: Contact robertonetti3@gmail.com

Contributing

Contributions are welcome! Areas for improvement:

  • Additional checkpoint strategies
  • Alternative decimation algorithms
  • Visualization tools for entropy analysis
  • Extended entropy computation methods
  • Documentation and examples

Papers

Software

Tutorials

License

highentDCA is released under the Apache License 2.0. See LICENSE for details.

Citation

If you use highentDCA in your research, please cite:

@software{highentDCA2024,
  author = {Netti, Roberto and Weigt, Martin},
  title = {highentDCA: Entropy-decimated Direct Coupling Analysis},
  year = {2024},
  url = {https://github.com/robertonetti/highentropyDCA}
}

Ready to get started? Head to the Installation Guide →