Usage Guide

This guide covers how to use highentDCA for training entropy-decimated DCA models, including command-line interface usage, parameter tuning, and practical examples.

Quick Start

The simplest way to train an edDCA model:

highentDCA train \
    --data your_alignment.fasta \
    --output results \
    --model edDCA \
    --density 0.02

This command trains a sparse edDCA model with 2% coupling density using default parameters.

Command-Line Interface

highentDCA provides a command-line interface through the highentDCA command. Currently, the main command is train.

Basic Syntax

highentDCA train [OPTIONS]

Required Arguments

`--data` / `-d`

Path to the input multiple sequence alignment (MSA) in FASTA format.

--data example_data/PF00072.fasta

Requirements: - FASTA format with aligned sequences - Minimum ~1000 sequences recommended - All sequences must have the same length - Quality-controlled alignment (remove fragments, handle gaps)

Model Selection

`--model` / `-m`

Choose the type of DCA model to train. For highentDCA, use edDCA.

--model edDCA

Options: - bmDCA: Fully-connected Boltzmann Machine (standard DCA) - eaDCA: Edge-adding DCA (progressive sparsity) - edDCA: Entropy-decimated DCA (progressive decimation with entropy tracking)

Output Options

`--output` / `-o`

Directory where model outputs will be saved. Default: DCA_model

--output my_results

Output structure:

my_results/
├── params.dat                  # Final model parameters
├── chains.fasta               # Final Markov chains
├── adabmDCA_highent.log      # Training log
├── entropy_decimation/        # Checkpoints at different densities
│   ├── density_0.980.fasta
│   ├── density_0.587.fasta
│   └── ...
└── entropy_values.txt         # Entropy vs density data

`--label` / `-l`

Add a custom label to output files. Optional.

--label PF00072_run1

Output files will be named: PF00072_run1_params.dat, PF00072_run1_chains.fasta, etc.

edDCA-Specific Parameters

Decimation Parameters

`--density`

Target coupling density to reach (fraction of couplings to keep). Default: 0.02 (2%)

--density 0.05  # Keep 5% of couplings

Typical values: - 0.02 - Very sparse (2% of couplings) - 0.05 - Sparse (5% of couplings) - 0.10 - Moderately sparse (10% of couplings)

Guidelines: - Lower density = sparser model, faster inference - Too low may lose important information - Protein families: 2-5% usually sufficient - RNA families: 5-10% may be needed

`--drate`

Decimation rate: fraction of remaining couplings to prune at each step. Default: 0.01 (1%)

--drate 0.02  # Prune 2% of remaining couplings per step

Trade-offs: - Smaller drate (e.g., 0.005): Slower but more gradual decimation - Larger drate (e.g., 0.05): Faster but less refined - Recommended: 0.01-0.02 for most applications

`--nsweeps_dec`

Number of Monte Carlo sweeps per gradient update during decimation. Default: 10

--nsweeps_dec 20

Guidelines: - Increase for better equilibration (slower training) - Decrease for faster training (may reduce accuracy) - Typical range: 5-50

Entropy Computation Parameters

At pre-defined density checkpoints, highentDCA computes model entropy using thermodynamic integration.

`--theta_max`

Maximum integration strength for thermodynamic integration. Default: 5.0

--theta_max 10.0

Higher values provide better integration range but require more sampling.

`--nsteps`

Number of integration steps for entropy computation. Default: 100

--nsteps 200

More steps = more accurate entropy estimate (but slower).

`--nsweeps_step`

Number of MC sweeps per integration step. Default: 100

--nsweeps_step 50

`--nsweeps_theta`

Number of sweeps to equilibrate at θ_max. Default: 100

--nsweeps_theta 200

`--nsweeps_zero`

Number of sweeps to equilibrate at θ=0. Default: 100

--nsweeps_zero 200

Entropy computation tips: - Increase all nsweeps values for better accuracy - Decrease for faster (but less accurate) entropy estimates - Default values are usually sufficient

General Training Parameters

These parameters control the overall training process.

Convergence Criteria

`--target` / `-t`

Target Pearson correlation between model and data statistics. Default: 0.95

--target 0.98  # Stricter convergence

Guidelines: - 0.90-0.95: Standard convergence - 0.95-0.98: High accuracy (slower training) - >0.98: Very strict (may not converge for complex families)

`--nepochs`

Maximum number of training epochs. Default: 50000

--nepochs 100000

Training stops when either --target or --nepochs is reached.

Sampling Parameters

`--sampler`

MCMC sampling method. Default: gibbs

--sampler metropolis

Options: - gibbs: Gibbs sampling (default, usually faster) - metropolis: Metropolis-Hastings sampling

`--nsweeps`

Number of MC sweeps per gradient update (before decimation starts). Default: 10

--nsweeps 20

More sweeps = better gradient estimates but slower training.

`--nchains`

Number of parallel Markov chains for sampling. Default: 10000

--nchains 5000   # Use fewer chains (less memory)
--nchains 20000  # Use more chains (better statistics)

Guidelines: - More chains = better statistics, more GPU memory - Fewer chains = faster, less memory - Typical range: 5000-20000

Optimization Parameters

`--lr`

Learning rate for gradient descent. Default: 0.01

--lr 0.005  # Slower, more stable
--lr 0.02   # Faster, may be less stable

Guidelines: - Start with 0.01 - Decrease if training is unstable - Increase for faster convergence (if stable)

Regularization

`--pseudocount`

Pseudocount for smoothing empirical frequencies. Default: None (automatic: 1/Meff)

--pseudocount 0.5

Acts as regularization to prevent overfitting.

Sequence Processing

Alphabet

`--alphabet`

Sequence alphabet/encoding. Default: protein

--alphabet protein  # Standard 20 amino acids + gap
--alphabet rna      # RNA: ACGU + gap
--alphabet dna      # DNA: ACGT + gap
--alphabet "ACDEFG" # Custom alphabet

Built-in alphabets: - protein: ACDEFGHIKLMNPQRSTVWY- - rna: ACGU- - dna: ACGT-

Custom alphabets must include all characters present in the alignment.

Sequence Reweighting

Sequence reweighting reduces phylogenetic bias in the dataset.

`--weights` / `-w`

Path to pre-computed sequence weights file. Optional.

--weights sequence_weights.txt

File format: one weight per line, same order as sequences in FASTA.

`--clustering_seqid`

Sequence identity threshold for automatic reweighting. Default: 0.8 (80%)

--clustering_seqid 0.9  # Cluster at 90% identity

Sequences with identity ≥ threshold are clustered, and cluster members share weight.

`--no_reweighting`

Disable automatic sequence reweighting.

--no_reweighting

Use if your alignment is already diversity-corrected or unweighted analysis is desired.

Checkpoint Options

Checkpoint Strategy

`--checkpoints`

Checkpoint strategy for saving model state. Default: linear

--checkpoints linear      # Save every N epochs
--checkpoints acceptance  # Save when acceptance rate changes

For edDCA, checkpoints are also triggered at pre-defined density thresholds for entropy computation.

`--target_acc_rate`

Target acceptance rate for acceptance-based checkpoints. Default: 0.5

--target_acc_rate 0.6

Only used when --checkpoints acceptance.

Experiment Tracking

Weights & Biases

`--wandb`

Enable Weights & Biases logging for experiment tracking.

--wandb

Requires W&B account and login (wandb login).

Logs: - Training metrics (Pearson, entropy, density) - System metrics (GPU usage, time) - Model parameters and outputs

Computational Settings

Device Selection

`--device`

Computation device. Default: cuda

--device cuda  # Use GPU
--device cpu   # Use CPU

GPU is strongly recommended for large datasets.

Data Type

`--dtype`

Numerical precision. Default: float32

--dtype float32  # Standard precision (faster)
--dtype float64  # Double precision (more accurate)

float32 is usually sufficient and faster.

Advanced Options

Restoration and Continuation

`--path_params` / `-p`

Path to existing model parameters to restore training.

--path_params previous_run/params.dat

`--path_chains` / `-c`

Path to existing chains for restoration.

--path_chains previous_run/chains.fasta

Use case: Continue training from a checkpoint or use a pre-trained bmDCA as starting point.

Test Set Evaluation

`--test`

Path to test set MSA for evaluation during training.

--test test_sequences.fasta

Test log-likelihood will be computed and logged (requires additional computation).

Random Seed

`--seed`

Random seed for reproducibility. Default: 0

--seed 42

Use different seeds for multiple independent runs.

Complete Example Commands

Example 1: Basic Training

Train a sparse edDCA model with default settings:

highentDCA train \
    --data protein_family.fasta \
    --output results/protein_edDCA \
    --model edDCA \
    --density 0.02 \
    --seed 42

Example 2: High-Accuracy Training

Train with stricter convergence and more sampling:

highentDCA train \
    --data protein_family.fasta \
    --output results/high_accuracy \
    --model edDCA \
    --density 0.03 \
    --target 0.98 \
    --nchains 20000 \
    --nsweeps 20 \
    --nsweeps_dec 20 \
    --lr 0.005 \
    --seed 42

Example 3: Fast Exploration

Quick training for exploratory analysis:

highentDCA train \
    --data protein_family.fasta \
    --output results/fast_run \
    --model edDCA \
    --density 0.05 \
    --drate 0.02 \
    --nchains 5000 \
    --nsweeps 5 \
    --nsweeps_dec 5 \
    --target 0.90 \
    --nsteps 50 \
    --nsweeps_step 50

Example 4: RNA Family

Train on RNA alignment with custom parameters:

highentDCA train \
    --data rna_family.fasta \
    --output results/rna_edDCA \
    --model edDCA \
    --alphabet rna \
    --density 0.08 \
    --drate 0.01 \
    --clustering_seqid 0.85 \
    --seed 123

Example 5: With Weights & Biases Tracking

Track experiments with W&B:

highentDCA train \
    --data protein_family.fasta \
    --output results/wandb_run \
    --model edDCA \
    --density 0.02 \
    --label experiment_001 \
    --wandb \
    --seed 42

Example 6: Continue from bmDCA

Start from pre-trained bmDCA model:

# First train bmDCA (or use existing)
highentDCA train \
    --data protein_family.fasta \
    --output bmdca_model \
    --model bmDCA \
    --target 0.95

# Then decimate it
highentDCA train \
    --data protein_family.fasta \
    --output eddca_model \
    --model edDCA \
    --path_params bmdca_model/params.dat \
    --path_chains bmdca_model/chains.fasta \
    --density 0.02

Analyzing Results

Reading Training Logs

The log file (adabmDCA_highent.log) contains training progress:

cat results/adabmDCA_highent.log

Example output:

Epochs     Pearson    Entropy    Density    Time      
0          0.950      125.456    1.000      0.000
50         0.955      120.123    0.587      120.450
100        0.953      115.678    0.359      250.890

Extracting Entropy Values

Entropy vs. density data is saved in entropy_values.txt:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('results/entropy_values.txt', sep='\t')

# Plot
plt.figure(figsize=(10, 6))
plt.plot(df['Density'], df['Entropy'], 'o-', linewidth=2, markersize=8)
plt.xlabel('Coupling Density', fontsize=14)
plt.ylabel('Model Entropy', fontsize=14)
plt.title('Entropy Evolution During Decimation', fontsize=16)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('entropy_evolution.png', dpi=300)
plt.show()

Loading Model Parameters

Load trained parameters for analysis:

from adabmDCA.io import load_params
import torch

# Load parameters
params, tokens, L, q = load_params('results/params.dat')

# Access fields and couplings
fields = params['bias']  # Shape: (L, q)
couplings = params['coupling_matrix']  # Shape: (L, q, L, q)

print(f"Sequence length: {L}")
print(f"Alphabet size: {q}")
print(f"Number of non-zero couplings: {(couplings != 0).sum().item()}")

Visualizing Coupling Matrix

Plot the sparse coupling matrix:

import numpy as np
import matplotlib.pyplot as plt
from adabmDCA.io import load_params

# Load parameters
params, tokens, L, q = load_params('results/params.dat')
couplings = params['coupling_matrix'].cpu().numpy()

# Compute Frobenius norm for each coupling
coupling_strength = np.linalg.norm(couplings.reshape(L, q, L, q), axis=(1, 3))

# Plot
plt.figure(figsize=(10, 10))
plt.imshow(coupling_strength, cmap='viridis', interpolation='nearest')
plt.colorbar(label='Coupling Strength')
plt.xlabel('Position j')
plt.ylabel('Position i')
plt.title('Sparse Coupling Matrix (edDCA)')
plt.tight_layout()
plt.savefig('coupling_matrix.png', dpi=300)
plt.show()

Troubleshooting

Training doesn't converge

Solutions: - Decrease learning rate: --lr 0.005 - Increase sweeps: --nsweeps 20 --nsweeps_dec 20 - Relax target: --target 0.93 - Check data quality: remove fragments, check alignment

Out of memory errors

Solutions: - Reduce number of chains: --nchains 5000 - Use float32: --dtype float32 (default) - Use smaller batch size for GPU - Monitor with: nvidia-smi -l 1

Training is too slow

Solutions: - Ensure GPU is being used: check --device cuda - Reduce accuracy requirements: --target 0.93 - Use fewer chains: --nchains 8000 - Increase decimation rate: --drate 0.02 - Reduce entropy computation accuracy

Entropy computation fails

Solutions: - Increase equilibration: --nsweeps_zero 100 --nsweeps_theta 100 - Reduce theta_max: --theta_max 3.0 - Check data quality and convergence

Best Practices

Start with defaults: Begin with default parameters and adjust based on results
Monitor convergence: Check Pearson correlation in logs
Use appropriate density: 2-5% for most protein families
Save checkpoints: Use --label to organize multiple runs
Validate results: Check entropy evolution makes sense
Use test sets: Evaluate generalization with --test
Set seeds: Use --seed for reproducibility
Track experiments: Use --wandb for complex parameter searches

Next Steps

API Reference: Use highentDCA in Python scripts
Checkpoint Documentation: Understand checkpoint strategies
edDCA Model Documentation: Deep dive into the algorithm

Getting Help

If you encounter issues:

Check the log file for error messages
Review GitHub Issues
Contact: robertonetti3@gmail.com