xmipp3.protocols.protocol_screen_deepConsensus module
Deep Consensus picking protocol
- class xmipp3.protocols.protocol_screen_deepConsensus.XmippProtDeepConsSubSet(**args)[source]
Bases:
ProtUserSubSetCreate subsets from the GUI for the Deep Consensus protocol. This protocol will be executed mainly calling the script ‘pw_create_image_subsets.py’ from the ShowJ gui. The enabled/disabled changes will be stored in a temporary sqlite file that will be read to create the new subset.
- class xmipp3.protocols.protocol_screen_deepConsensus.XmippProtScreenDeepConsensus(**args)[source]
Bases:
ProtParticlePicking,XmippProtocolProtocol to compute a smart consensus between different particle picking algorithms. The protocol takes several Sets of Coordinates calculated by different programs and/or different parameter settings. Let’s say: we consider N independent pickings. Then, a neural network is trained using different subset of picked and not picked cooridantes. Finally, a coordinate is considered to be a correct particle according to the neural network predictions. In streaming, the network is trained and used to predict in batches. The network is trained until the number of particles set is reached, meanwhile, a preliminary output is generated. Once the threshold is reached, the final output is produced by batches.
AI Generated
## Overview
The Deep Consensus Picking protocol combines several particle-picking results using a deep-learning classifier.
Different particle pickers, or the same picker with different parameters, may produce different coordinate sets. Some particles are selected by all pickers, some are selected by only one picker, and some false positives appear only in specific picking results. Classical consensus picking keeps coordinates based on voting rules, but Deep Consensus goes one step further: it uses the agreement and disagreement between coordinate sets to train a neural network that scores candidate particles.
The protocol first creates candidate coordinates from the union of the input picking results. It also creates highly reliable positive examples from coordinates supported by multiple pickers, and negative examples from background/noise regions. These examples are used to train a convolutional neural network. The trained network then assigns a score between 0 and 1 to candidate particles, and the protocol keeps those above the selected threshold.
The main outputs are filtered particle coordinates and the corresponding extracted particles.
## Inputs and General Workflow
The main input is a list of coordinate sets.
The protocol uses the micrographs associated with those coordinate sets. It preprocesses the micrographs, extracts candidate particles at a fixed internal size of 128 × 128 pixels, trains or loads a neural-network model, scores the candidate particles, and creates final outputs.
The workflow can be summarized as follows:
Collect input coordinate sets.
Build consensus coordinate groups.
Preprocess the associated micrographs.
Extract particles from candidate coordinates.
Generate positive and negative training examples.
Train, continue, or load a deep-learning model.
Score candidate particles.
Keep particles whose score passes the selected threshold.
Output both coordinates and particles.
The protocol also supports streaming. In streaming mode, training and prediction are performed in batches as new micrographs and coordinates arrive.
## Input Coordinates
The Input coordinates parameter contains the coordinate sets to be combined and screened.
These coordinate sets may come from different picking protocols, different algorithms, different parameter settings, or different manual/automatic strategies.
The protocol assumes that the coordinate sets refer to the same micrographs or to overlapping micrograph sets. It uses the first coordinate set as the main source for box-size and micrograph information.
For training a new model, more than one coordinate set is normally required. If only one coordinate set is provided and training is requested, the protocol reports a validation error, because it cannot derive a meaningful internal consensus between pickers.
## Model Type
The Select model type parameter controls how the neural network is initialized.
There are three options:
New starts from a randomly initialized model and trains it from the current input data.
Pretrained starts from a pretrained Deep Consensus model.
PreviousRun reuses a model trained in a previous Deep Consensus run within the same Scipion project.
The best choice depends on the amount and quality of available training data. A new model is appropriate when there are enough reliable positive and negative examples. A pretrained or previous model is useful when the user wants to reuse prior training or score coordinates directly.
## Previous Run
When PreviousRun is selected, the Select previous run parameter defines which earlier Deep Consensus run provides the model.
This allows the user to continue from a previous model or apply a model already trained in the same project.
This option is useful when a dataset is processed in several stages, when training was already performed in a previous run, or when the same picking behavior should be applied consistently to new micrographs.
## Skip Training
The Skip training and score directly with pretrained model? option is available when a pretrained or previous model is used.
If enabled, the protocol does not train the network again. It directly scores the candidate particles using the selected model.
This is useful when the model is already considered appropriate for the data. If disabled, the protocol continues training using the current training data.
Users should skip training only when the input data are compatible with the model being reused. A model trained on a different specimen, contrast convention, box size, or preprocessing strategy may not score particles reliably.
## Relative Radius
The Relative Radius parameter defines how close coordinates from different input sets must be to be considered the same particle.
The value is expressed as a fraction of the particle size. For example, a value of 0.1 means that coordinates within 10% of the particle box size are treated as corresponding to the same candidate particle.
This radius is used when creating consensus coordinate groups. If the radius is too small, coordinates that correspond to the same particle may fail to merge. If it is too large, nearby distinct particles may be merged incorrectly.
The default value is intended to capture small picker-to-picker differences without merging clearly separate particles.
## Tolerance Threshold
The Tolerance threshold parameter defines the neural-network score required for a candidate particle to be accepted.
The network assigns each candidate particle a score between 0 and 1. A score near 1 indicates that the network considers the candidate more likely to be a good particle. A score near 0 indicates that it is more likely to be a bad particle or false positive.
Particles with scores above the threshold are included in the final outputs.
If the threshold is set to -1, all scored particles are allowed to pass. This is useful when the user wants to inspect the scores manually or create a subset later using the analysis tools.
## Micrograph Preprocessing
Before extracting particles for training and prediction, the protocol preprocesses the micrographs internally.
The preprocessing is designed to make the extracted particle boxes compatible with the Deep Consensus neural network. It includes:
downsampling micrographs so that extracted particles become 128 × 128 pixels;
normalizing micrographs to approximately zero mean and unit standard deviation;
inverting contrast when needed so that particles are white;
optionally applying CTF phase flipping;
extracting particle boxes.
This internal preprocessing is important because the neural network expects a standardized particle representation.
## Contrast Inversion
The Did you invert the micrographs contrast? option tells the protocol whether the input micrographs have already been contrast-inverted.
Deep Consensus expects particles to be white on a darker background.
If the micrographs have not already been inverted, the protocol can invert the contrast during preprocessing. If they have already been inverted, the user should indicate this so that the protocol does not invert them again.
Using the wrong contrast convention can seriously affect the neural-network scores.
## Ignore CTF
The Ignore CTF option controls whether CTF information is used during particle preprocessing.
If CTF is ignored, particles are extracted without phase flipping.
If CTF is not ignored, the user must provide a CTF estimation relation. The protocol uses the CTF information to perform phase flipping during preprocessing.
Phase flipping can make particle images more consistent, but it should only be used when reliable CTF estimates are available and when this preprocessing is appropriate for the intended workflow.
## CTF Estimation
The CTF estimation parameter is required when Ignore CTF is disabled.
It provides the CTF information associated with the input micrographs. The protocol converts this information into Xmipp CTF-parameter files and uses it during micrograph preprocessing and particle extraction.
If CTF correction is requested but no CTF relation is provided, the protocol reports a validation error.
## Training Examples
Deep Consensus builds internal training examples from the input coordinate sets.
Positive examples are obtained from strict or high-confidence consensus coordinates, such as coordinates supported by several pickers. Negative examples are generated by selecting noise coordinates away from the candidate particle coordinates.
The protocol also creates a broader OR set containing coordinates selected by at least one picker. These OR candidates are the particles that are later scored by the neural network.
This strategy allows the protocol to learn from the agreement and disagreement between picking methods.
## Additional Training Data
The Additional training data parameter allows the user to supplement the internal training examples.
There are three options:
None uses only the internal positive and negative examples derived from the input coordinate sets.
Precompiled adds a precompiled negative training set distributed with the Deep Consensus model resources.
Custom allows the user to provide additional positive and negative training data.
Additional data can improve training when the internal examples are limited, but they must be compatible with the current preprocessing and specimen.
## Custom Additional Training Data
When Custom additional training data are selected, the user can provide either particles or coordinates.
If particles are provided, they must already be preprocessed in the format expected by the network: 128 × 128 pixels, white particles, and optionally CTF corrected in the same way as the protocol.
If coordinates are provided, they should come from the same micrographs as the input coordinates. The protocol will preprocess and extract them internally.
The user can provide positive and negative custom examples and assign weights to control how much they contribute during training.
## Positive and Negative Weights
The Weight of positive additional train data and Weight of negative additional train data parameters control the relative contribution of custom training examples.
A weight of 1 means that additional examples are weighted similarly to internal examples.
If the weight is set to -1, the protocol estimates a weight so that the additional data contribute approximately as much as the internal particles.
These weights are useful when the custom training set is much larger or much smaller than the internally generated training set.
## Number of Epochs
The Number of epochs parameter defines how many training epochs are used for the neural network.
More epochs allow the model to learn longer from the training data but increase runtime and may increase overfitting if the dataset is small or biased.
The default value is intended as a practical starting point. Training can also stop automatically when convergence is detected, depending on the auto-stopping option.
## Learning Rate
The Learning rate controls how strongly the neural network weights are updated during training.
A larger learning rate may train faster but can become unstable. A smaller learning rate is more conservative but may train more slowly.
Most users should keep the default value unless they have experience tuning deep-learning training.
## Auto Stopping
The Auto stop training when convergence is detected? option enables automatic stopping based on validation behavior.
When enabled, the protocol can reduce the learning rate if improvement stops and eventually stop training if the learning rate becomes too small. It can also stop if the validation accuracy reaches the selected threshold.
This option is generally useful, but it may stop too early in very small training sets. The protocol help notes that it is not recommended for very small datasets with fewer than about 100 true particles.
## Training Accuracy Threshold
The Training mean val_acc threshold parameter defines a validation-accuracy level at which training can stop.
If the mean validation accuracy surpasses this threshold, the protocol considers the training sufficiently good and stops further training.
The default value is high, reflecting the fact that the network should separate positive and negative examples clearly before being used for final scoring.
## Regularization Strength
The Regularization strength parameter controls L2 regularization of the neural-network weights.
Regularization helps reduce overfitting. If the training accuracy improves but validation accuracy decreases, increasing regularization may help.
Typical values span several orders of magnitude. This is an advanced parameter and should normally be left at its default unless overfitting is observed.
## Number of Models for Ensemble
The Number of models for ensemble parameter controls how many neural network models are trained and combined.
Training several models can make the prediction more robust, because the final score benefits from an ensemble rather than a single network. However, runtime increases approximately linearly with the number of models.
Typical values are between 1 and 5. The default provides a compromise between robustness and computation time.
## Expected Number of Particles for Training
The Expected number of particles to use for training parameter controls how many positive particles are used before training is considered complete.
If the value is -1, the protocol uses all particles found for training.
This parameter also affects the effective network size used by the protocol. The code distinguishes small, medium, and large training regimes according to the number of training examples.
Larger training sets generally improve robustness, but they require more time and memory.
## Testing After Training
The Perform testing after training? option allows the user to provide independent positive and negative test particle sets.
If enabled, the protocol scores these test sets after training. This can help assess whether the trained model generalizes beyond the internal training examples.
The test particles must be preprocessed in the same expected format: 128 × 128 pixels and compatible contrast and CTF treatment.
## Streaming Behavior
The protocol is designed for streaming workflows.
As micrographs and coordinate sets arrive, the protocol preprocesses micrographs, computes consensus coordinates, extracts particles, trains the network in batches, and predicts candidate particles in batches.
The relevant streaming parameters are:
Extraction batch size;
Training batch size;
Perform preliminar predictions with on training CNN.
During streaming, preliminary outputs can be produced while the network is still being trained. After training is complete, the final network is used to produce final scored outputs.
## Preliminary Predictions
The Perform preliminar predictions with on training CNN option enables temporary predictions before the final model is fully trained.
These preliminary predictions are stored in separate preliminary output sets. They are useful in streaming workflows where the user wants early feedback before all training data have arrived.
Preliminary outputs should be interpreted cautiously because the network is still being trained. Final outputs should be preferred for downstream processing.
## Output Coordinates
The main coordinate output is outputCoordinates.
This set contains candidate coordinates whose Deep Consensus score passes the selected threshold. The coordinates are scaled back to the original micrograph coordinate system and annotated with the deep-learning score.
The score is stored as an Xmipp attribute corresponding to zScoreDeepLearning1.
These coordinates can be used for particle extraction or subset selection in later workflows.
## Output Particles
The protocol also creates outputParticles.
These are the extracted particle images corresponding to the accepted coordinates. The particles carry the Deep Consensus score and are scaled to the appropriate sampling rate after internal preprocessing.
This output can be inspected directly or used as a starting point for downstream classification and cleaning.
## Preliminary Outputs
When preliminary prediction is enabled, the protocol may also produce:
preliminarOutputCoordinates;
preliminarOutputParticles.
These outputs are generated while training is still ongoing. They can provide early information in streaming workflows but should not be considered as final screening results.
## Validation and Requirements
The protocol performs several validation checks.
The input coordinate box size must be at least 128 pixels, because the internal Deep Consensus particle size is 128 × 128 pixels.
If CTF phase flipping is requested, CTF information must be provided.
Additional training or testing particle sets must also have 128-pixel box size.
If only one coordinate set is provided and training is requested, the protocol reports an error. In that case, the user should use a pretrained or previous model for direct scoring, or provide additional coordinate sets.
The protocol also checks that the required deep-learning toolkit and model resources are available.
## Practical Recommendations
Use several complementary coordinate sets as input. Deep Consensus is most useful when different pickers provide partially overlapping but not identical results.
Make sure the input coordinate box size is at least 128 pixels.
Use a new model when enough data are available for training. Use a pretrained or previous model when the current data are similar to previous training data or when only one coordinate set is available.
Keep the default threshold at first. Lower it to retain more candidate particles; raise it to be stricter. Use -1 if you want to keep all candidates and inspect the scores later.
Check the contrast convention carefully. The network expects white particles.
Use CTF phase flipping only when reliable CTF estimates are available and the workflow expects phase-flipped particles.
Inspect both accepted and rejected particles before committing to downstream classification or reconstruction.
In streaming workflows, treat preliminary outputs as provisional and use final outputs once training is complete.
## Final Perspective
Deep Consensus Picking is a neural-network-based particle-screening protocol built on top of multiple picking results.
For biological users, its main value is that it converts agreement between pickers into a learned particle-quality score. It can keep particles that look convincing to the network even if not all pickers agree, and it can reject false positives that appear in the broad union of picks.
The protocol is especially useful when several picking strategies are available, when manual inspection of all candidates is impractical, or when a streaming workflow needs progressively improved particle selection.
As with any learned screening method, the result should be checked visually and validated downstream by 2D classification, particle cleaning, and final reconstruction behavior.
- ADD_DATA_TRAIN_CUST = 2
- ADD_DATA_TRAIN_CUSTOM_OPT = ['Particles', 'Coordinates']
- ADD_DATA_TRAIN_CUSTOM_OPT_COORS = 1
- ADD_DATA_TRAIN_CUSTOM_OPT_PARTS = 0
- ADD_DATA_TRAIN_NONE = 0
- ADD_DATA_TRAIN_PRECOMP = 1
- ADD_DATA_TRAIN_TYPES = ['None', 'Precompiled', 'Custom']
- ADD_MODEL_TRAIN_NEW = 0
- ADD_MODEL_TRAIN_PRETRAIN = 1
- ADD_MODEL_TRAIN_PREVRUN = 2
- ADD_MODEL_TRAIN_TYPES = ['New', 'Pretrained', 'PreviousRun']
- CONSENSUS_COOR_PATH_TEMPLATE = 'consensus_coords_%s'
- CONSENSUS_PARTS_PATH_TEMPLATE = 'consensus_parts_%s'
- ENDED = False
- EXTRACTING = {'AND': False, 'NOISE': False, 'OR': False}
- LAST_ROUND = False
- NET_TEMPLATE = 'nnetData{}'
- PARTICLES_TEMPLATE = 'particles{}.xmd'
- PREDICTING = False
- PREDICT_BATCH_MAX = 20
- PREPROCESSING = False
- PREPROCESS_BATCH_MAX = 200
- PRE_PROC_MICs_PATH = 'preProcMics'
- TO_EXTRACT_MICFNS = {'ADDITIONAL_COORDS_FALSE': [], 'ADDITIONAL_COORDS_TRUE': [], 'AND': [], 'NOISE': [], 'OR': []}
- TO_TRAIN_MICFNS = []
- TRAINED_PARAMS_PATH = 'trainedParams.pickle'
- TRAINING = False
- TRAIN_BATCH_MAX = 20
- USING_INPUT_COORDS = False
- USING_INPUT_MICS = False
- allFree()[source]
Kind of “traficlight” that specifies if there is not extraction, training or prediction going on, which would alterate the states of the protocol
- calculateCoorConsensusStep(outCoordsDataPath, mode)[source]
Calculates the consensus coordinates from micrographs whose particles haven’t been extracted yet in “mode”
- checkIfParentsFinished()[source]
Check the streamState of the coordinates input to check if the parent protocols are finsihed
- counter = 0
- doTraining()[source]
Prepares the positive (AND) and negative (NOISE) coordinates for the training and executes it
- extractParticles(mode)[source]
Extract the particles from a set of micrographs with their corresponding coordinates
- getAllCoordsInputMicrographs(shared=False)[source]
Returns a dic {micFn: mic} with the input micrographs present associated with all the input coordinates sets. If shared, the list contains only those micrographs present in all input coordinates sets, else the list contains all microgrpah present in any set (Intersection vs Union) Do not create a set, because of concurrency in the database
- getExtractedMicFns(mode)[source]
Return the list of extracted micrograph filenames (micrographs where particles of type “mode” have been extracted)
- getMicrographFnsWithCoordinates(shared=True)[source]
Return a list with the filenames of those microgrpahs which already have coordinates associated in the input sets. If shared, it must be in all the sets, if not shared, at least in one
- getPredictedMicFns()[source]
Return the list of microgrpahs whose particles have been used for prediction
- getTrainedMicFns()[source]
Return the list of microgrpahs whose particles have been used for training
- insertCaculateConsensusSteps(mode, prerequisites)[source]
Insert the steps neccessary for calculating the consensus coordinates of type “mode”
- insertExtractPartSteps(mode, prerequisites)[source]
Inserts the steps necessary for extracting the particles from the micrographs
- joinSetOfParticlesStep(mode, micFns='', trainingPass='', clean=False)[source]
Stores the particles extracted from a set of micrographs in a images.xmd metadata file
- lastRoundStep()[source]
Starts the last round of training and predictions with the remainign microgrpahs when all the inputs have arrived
- loadTrainedParams()[source]
Load the dictionary stored in pickle format which stores the trained parameters. Creates a initial one if it does not exist yet
- networkReadyToPredict()[source]
Returns true if the CNN is trained or the user specified it does not need to be trained
- pickNoise()[source]
Find noise coordinates from micrographs in order to use them as negatives in the training process
- predictCNN()[source]
Predict the particles from the micrographs and calificates the consensus coordinates
- predictionsOn()[source]
Return a boolean for whether to perform a prediction. True if there must be a preliminar prediction or if the training process has finished (trainingPass==’‘)
- readyPreliminarPrediction()[source]
Return a boolean for whether to perform a preliminar predition. True if the user set it and the current trained network has not been used yet
- readyToExtractMicFns(mode)[source]
Return the list of micrograph filenames which are ready to be extracted (preprocessed and common for all inputs) and have not or are not being extracted yet
- readyToPredictMicFns()[source]
Return the list of micrograph filenames which are ready to be used for prediction and have not or are not being predicted yet
- readyToPreprocessMics(shared)[source]
Return the list of micrograph filenames which are ready to be preprocessed and have not been preprocessed yet
- readyToTrainMicFns()[source]
Return the list of micrograph filenames which are ready to be used for training and have not or are not being trained yet
- retrievePreviousPassModel(trPass, lastTrPass='')[source]
Retrieves a previous CNN model and copies its folders to used the network in a new location
- retrievePreviousRunModel(prevProt, trPass='')[source]
Retrieves a CNN model from other protocol and copies its folders to used the network in a new location
- retrieveTrainSets()[source]
Retrieve, link and return a setOfParticles corresponding to the NegativeTrain DeepConsensus trainning set with certain extraction conditions (phaseFlip/invContrast)
- trainCNN(toTrainMicFns)[source]
Trains the CNN with the particles from the ready to train micrographs
- trainingOn()[source]
Return a boolean for whether to perform training. True if the training must not be skipped and if the finish training criteria has not been reached yet