scipion logo

Project definition

Introduction

Scipion is an image processing framework for obtaining 3D models of macromolecular complexes using Electron Microscopy (3DEM). It is designed to integrate several software packages in the field and present a unified interface for both biologists and developers. Scipion allows the execution of workflows combining different software tools, while taking care of formats and conversions. Additionally, all steps are tracked and can be reproduced later on.

Image Processing in Structural Biology

State of the art in 3D Electron Microscopy Image Processing

Structural biology aims at the visualization of microscopic biological structures with the ultimate goal of understanding the molecular mechanisms taking place in the healthy cell as well as in the pathological cell. This spatial information is crucial to correctly put in context the biochemical information obtained by other experimental means. The Transmission Electron Microscope (TEM) is a powerful tool for acquiring three-dimensional structural information at two different structural levels: the macromolecular level (visualization of proteins or protein complexes [Frank2006a]) and the subcellular level (visualization of thin cellular sections; [Frank2006b]). The resolution (understood as the smallest reliable detail in the volume) at the macromolecular level ranges between 4 and 20 Å, while in the second case between 20 and 50 Å for sections thinner than 0.5 μm.

Three-dimensional structure of a wide range of biological specimens can be computed from images collected by transmission electron microscopy (TEM). This information, integrated with structural data obtained with other techniques (e.g., X-ray crystallography, nuclear magnetic resonance), helps biologists to understand the function of macromolecular complexes and organelles within cells. This kind of analysis have allowed to experimentally visualize the molecule trafficking through the nuclear pore complex (the communication between the cell nucleus and the cell cytoplasm) and the structure of each protein involved [Beck2004, Robinson2007], the mechanism of cell infection by certain viruses [Zhang2010], the capsid structure of the dengue virus [Pokidysheva2006], or the structure of several proteins below 4 Å [Ludtke2008, Zhang2008].

Some of the most common uses of the TEM are the visualization of macromolecular complexes, using a technique known as single-particle analysis (SPA), the visualization of up to 0.5-1 micron thick cellular components (or sections), using electron tomography (ET), the determination of the arrangement of atoms in solids using Electron Crystallography and complemented by X-ray Crystallography and two special cases of SPA where symmetry has a crucial impact: helical structures and icosahedral viruses. This project is primarily focused on the first two techniques although it will be designed in such a way that the remaining techniques will be easily included in the near future.

  • Single-particle analysis (SPA) is a 3D-TEM method, used for studies of macromolecules and macromolecular assemblies whose structure and dynamic interactions can be analysed in vitro (e.g., proteins, ribosomes, viruses) [Sorzano2007, Jonic2008]. This method is very powerful and under a permanent development. It is complementary to nuclear magnetic resonance, since it allows computing a global structure of large macromolecular assemblies (size > 10 nm, molecular mass > 200 kDa). It is also complementary to X-ray crystallography, since it allows the study of non-crystalline matter. Generally speaking, SPA uses a large number of random “snapshots” of individual molecules to reconstruct the 3D structure. In practice, the analysis starts with two-dimensional (2D) TEM images of thousands of individual molecules of the same type taken in random orientation. Complex image processing algorithms and huge computing power are then required to reconstruct the 3D structural features of the macromolecules. When an even distribution of single particles orientations is observed on the specimen grid, and the population is structurally and conformationally homogeneous, standard image processing strategies allow computing a 3D average structure of the studied particles at near-atomic resolution (4-10 Å) [Ludtke2008, Zhang2008].

  • Electron tomography (ET) is a relatively recently introduced technique that allows the visualization of thin cellular sections (up to 0.5μm thick, while the whole cell may be 10 to 30 times larger) with a resolution around 30-40 Å [Lucic2006, Jonic2008]. This is achieved by tilting the specimen within the electron microscope and acquiring projection images that are reconstructed using a standard tomographic algorithm. In this way, the macromolecules observed by SPA can be seen in their native context (see Fig. 3). This has given rise to the emerging of the field of Visual Proteomics [Nickell2006]. Image processing algorithms have been devised for the identification and location of proteins in the electron tomogram [Frangakis2002].

  • Electron Crystallography, in which the complexes are regularly arranged in a two-dimensional crystal lattice.

  • Helical reconstruction, in which the macromolecules are regularly arranged in a helical, tubular structure.

  • Icosahedral virus reconstruction, in which the molecules to study are located in an icosahedral virus capsid. Although the orientation of each virus in each image is random, the high internal symmetry of the virus capsid allows the achievement of high resolution.

  • X-ray microscopy allows the visualization of the whole cell in a near-native state at a resolution around 150 Å. The energy of the X-ray photons is located within the water window (from the absorption edge of carbon at 284 eV to the absorption edge of oxygen at 534 eV). Within this energy region, the biological structures (lipids and proteins) absorb 10 times more than the surrounding water or ice. In this way, the sample is visualized in a near-native state without the need of any dye, which might modify the internal structure of the cell. Moreover, the high penetration power of X-ray photons allows imaging up to 15 μm, i.e., the whole thickness of a single cell [Schneider1998, LeGros2005]. Photons in this energy region are called soft X-rays, as compared to hard X-rays (e.g. used in medical radiographs) ranging between 20 and 150 keV. Using a synchrotron as the source of the soft X-rays provides an extremely bright light source as compared to other X-ray emitting devices. The design of the microscope is highly tuned to the X-ray photon energy window, and currently there are only two microscopes in the world specifically designed for the water window: one is at the synchrotron of the Lawrence Berkeley Natl. Lab. (USA) and the other is at the BESSY Synchrotron in Berlin (Germany). Interestingly, Spain is in a very favorable position in this area in the global scientific context, since during 2011 it is expected to start the operations of Mistral, the X-ray Tomography Microscope of the Spanish synchrotron Alba. In fact, it was our group who defended to the ALBA Scientific Advisory Committee the creation of this beamline at the synchrotron, and we are currently participating in its design and construction.

State of the art in data, package and workflow integration

Several attempts have already been made toward data, package, and workflow integration, as we will comment on in the next paragraphs.

  • Data integration: The problem of data integration has been addressed from different perspectives:

    • Databases:

      • EMDataBank (http://www.emdatabank.org), an initiative first started in Europe under the EU BioImage Project (Coordinated by the BCU of National Center for Biotechnology) and then followed within TEMBLOR (Coordinated by the EBI-EMBL), providing access to the reconstructed maps as well as the conditions in which the sample was prepared and the data was recorded. Currently the center is operated jointly by the Protein Databank in Europe (PDBe), at the European Bioinformatics Institute, http://www.ebi.ac.uk/pdbe/emdb, the Research Collaboratory for Structural Bioinformatics (RCSB, http://www.rcsb.org/pdb/home/home.do), and the National Center for Macromolecular Imaging (NCMI, http://ncmi.bcm.edu/ncmi). The database provide a basic description of the biological experiment and sample preparation as well as the final results obtained after the image analysis and their relationships to other databases (for instance, the fittings of domain structures known at atomic resolutions at the Worldwide Protein Data Bank) [Henrick2003, Tagari2002]. Since 2002, more than 1000 such maps have been deposited by researchers using a web-based interface and are freely available to the public through a search interface and an FTP server. A measure of its success is that many journals request that the authors of 3DEM publications deposit of the 3D maps in the data base before publication. Unfortunately, the lack of adherence to a set of common conventions is hampering the usefulness of this database. Furthermore, small but important differences in the exact meanings of parameters among packages handicap users who try to operate with more than one package.

      • Worldwide Protein Data Bank (wwPDB, http://www.wwpdb.org), is a publicly available resource that stores and provides integrated tools for data on biological macromolecular structures. It was founded by the PDBe (http://www.ebi.ac.uk/pdbe), RCSB PDB (http://www.rcsb.org/pdb/home/home.do) and PDBj (http://www.pdbj.org).

    • PIMS: The Protein Information Management System (PIMS, https://www.pims-lims.org) is a project that aims to fully describe the sample preparation steps for crystallography. It provides a description of the experiment in a much more detailed manner than the one stored at EMDB. ** Conventions: (3DEM-CWS, http://conventions.cnb.csic.es) is a web site that documents whether the map files produced in a particular package are consistent with the 3DEM conventions [Heymann2005]. It has been recognized that one of the most difficult obstacles for data sharing is the different definitions of geometrical conventions (directions of the axes, rotations, shifts, etc.). To solve these problems, this work set up a centralized resource where convention-related information is being collected, stored and tested.

  • Package integration:

    • Image Processing Library and Toolkit (IPLT, http://www.iplt.org) is a project written in C++ and reflected to Python that aims at providing a comprehensive library for the EM community [Philippsen2007]. In principle, this Python layer would allow a user to call the different packages in a single Python script. However, to the best of our knowledge this possibility has not been exploited to actually build scripts using different packages. ** 2dx (http://www.2dx.unibas.ch) is a software package that aims at providing a “wrapper” for the MRC software for electron crystallography. It contains a friendly graphical interface and workflow support. Final and intermediate outcomes can be reviewed using the incorporated visualization tools. The program provides useful functionality for interactive use; however, a lack of a flexible and customizable workflow may be one of its main disadvantages. It provides a good software entry point for further integrative efforts.

    • SPIDER Reconstruction Engine (SPIRE, http://www.wadsworth.org/spider_doc/spider/spire/doc) is a framework that provides a graphical user interface to process the SPIDER modules [Baxter2007]. It also contains a database that aims at providing some level of traceability to the processing workflow and some further modules like Jweb, an interface to interact with SPIDER images. Spire provides a configuration file that, in principle, enables the user to execute any external program inside its environment, although it has not been extensively used for this concrete task.

    • Single Particle Analysis for Resolution eXtension (SPARX, http://sparx-em.org/sparxwiki) is a Python framework and a core library of fundamental C++ image processing functions that includes a user interface that has been built around EMAN2[Hohn2007]. It also provides a data/process-flow infrastructure. As well as IPLT, it is a low-level approach to package integration and it requires programming skills from the user that are far from the standard background of structural biologists.

  • Workflow integration: Currently, the single platform in the 3DEM field allowing a real integration of different software packages is Appion [Lander2009]. Appion (http://www.appion.org) is a web-based application and Python scripting system allowing the analysis of single particles with several software packages (Eman, Imagic, Spider, and Xmipp). It is integrated with Leginon (http://ami.scripps.edu/redmine/projects/leginon/wiki), a system designed for automated collection of images from a transmission electron microscope [Suloway2005]. Appion can be seen as a “pipeline” with registered input and output data that provide user guidance through the whole process. The program grew from the specific laboratory needs, practices and customs of a specific laboratory and it may be difficult to generalize it to fully cover several packages, different and reusable workflows, etc. Low performance and lack of a friendly interface and installation process are some of the weak points of its current implementation. However, Appion has an important pioneering value and, furthermore, concrete plans exist to extend and generalize its functionalities. We acknowledge the key value of their current experience as well its inspirational role.

Motivation

Three-dimensional electron microscopy (3DEM) allows imaging of large biological macromolecules nearly in their native state. Different techniques have been developed within the last forty years to process the EM data and, not surprisingly, success in EM particle analysis is highly correlated with the continuous development of various image processing packages. Still, a severe lack of image and volume processing standards is preventing 3DEM from becoming a high throughput (HT), “-omics-like” technique, capable of generating rich biological databases on which a new step of discovery through data mining could be attempted. The workflows proposed by the different EM reconstruction packages are similar and, therefore, it should be easy to exploit the different strengths of each package. But, in practice, relatively small differences between format and conventions due to the lack of standardization in the field heavily penalize users. This proposal aims at establishing a firm basis for solving some of the main burdens related to 3DEM data processing, such as the interoperability between different software packages and the current steep image processing learning curve, key issues that are preventing a more widespread use of 3DEM in biology at large. Furthermore, this work will open many new standardization issues such as the definition of interchange formats as well as processing workflows involving several EM packages, together with the development of new image processing ontologies. We will fully exploit the possibility of defining standardized workflows and running them in the ecosystem of high-performance computational infrastructures that has been developed over the recent years in Europe, thereby minimizing the image analysis time and maximizing the throughput capabilities of 3DEM. Ultimately, the combination of efficient EM workflows with utility computing will create completely new scientific collaborative environments and thus enable new science.

Each technique (we will mainly focus on electron tomography, electron crystallography, helical reconstruction, icosahedral virus reconstruction, and X-ray tomography) has a set of software packages allowing the reconstruction of the three-dimensional structures from two-dimensional images (to mention the most widespread ones: Spider [Frank1996], Imagic [vanHeel1996], MRC package [Crowther1996], IMOD [Kremer1996], Brandeis helical package [Owen1996], Eman [Ludtke1999], Bsoft [Heymann2001], Xmipp [Sorzano2004], Tom Toolbox [Nickell2005], 2dx [Gipson2007], and TomoJ [Messaoudi2007]). Some packages cover several techniques at the same time (Spider, Imagic, Bsoft, Xmipp), while others are specific to a single one (MRC package, IMOD, Brandeis helical package, Eman, Tom Toolbox, TomoJ). Current software suites for the analysis of these kinds of images are actually composed by hundreds of small programs, each one performing an “atomic” task, which must be assembled into a script constituting the image processing pipeline. Inexperienced users get lost during the reconstruction process, so the development of standard and controlled sequences of actions will be of great help to the community; we will refer to these sequences as “workflows”. Workflows should be defined at the logical level and should be independent of any particular piece of software. In general, users understand the workflows but get confused with programs that implement these workflows. So it may be considered a good idea to hide the workflow implementation whenever possible, but it is clear that to properly process the data, it is necessary to understand what is being done. Therefore, it is not possible to fully hide the implementation; still, many of the user’s troubles appear because the software developed in scientific environments lacks well defined interfaces and therefore is not easy to use outside the place and moment in which it was created. Moreover, the programs are usually called through UNIX shells, which imposes another important entrance barrier to the novice. Learning to use each program suite takes a few months, and experimentalists are reluctant to change from one package to the next due to the steep learning curve of each one.

These scripts are usually run several times, varying the parameters among the different executions until the best reconstruction is obtained. The traceability of the process relies purely on the lab notebooks of the user and his/her good practices, and sometimes the reproducibility of the final reconstruction is seriously compromised by weak notebook annotations. As a result, we have noticed that a few months after completing a project, the researcher is usually unable to reproduce their own results work without a major effort. Not only is accurate data bookkeeping needed, but also the possibility of an automatic repetition of a given pipeline.

Additionally, image processing pipelines usually lack internal quality controls, and only the final results are usually controlled through some measure of resolution. This implies that some of the steps in the whole process may not be optimal, degrading the overall performance of the whole pipeline, but there are currently no easy ways to identify these steps. Only advanced users are able to check the intermediate results of each step and take the required actions.

Objectives

We propose to further develop image processing standards, leading to the design of a software platform for structural biologists who use the transmission electron microscope such that the process of obtaining the three-dimensional structure of the biological sample will be greatly simplified. This platform will integrate tools from the main image processing software packages and will be built upon a consensus of the structural biology community. Scipion will keep track of all operations performed in an image processing LIMS (Laboratory Information Management System) and will allow access to large computational facilities (HPC and cloud computing) so that researchers do not need to manage their own computer cluster or the execute their image analysis in a supercomputing centre (although these possibilities will also be explicitly considered in the design as it is the common practice of advanced users). The platform will also be connected to other data integration initiatives and will easily extend to other techniques and packages.

References

  • [Baxter2007] Baxter WT, Leith A, Frank J. SPIRE: the SPIDER reconstruction engine. J. Struct. Biol., 2007, 157 (1): 56–63

  • [Beck2004] Beck, M.; Förster, F.; Ecke, M.; Plitzko, J.; Melchior, F.; Gerisch, G.; Baumeister, W. & Medalia, O. Science, 2004, 306, 1387-1390

  • [Crowther1996] Crowther, R. A.; Henderson, R. & Smith, J. M. MRC image processing programs J. Structural Biology, 1996, 116, 9-16

  • [Frangakis2002] Frangakis, A.; Böhm, J.; Förster, F.; Nickell, S.; Nicastro, D.; Typke, D.; Hegerl, R. & Baumeister, W. Proc. Natl. Acad. Sci. USA, 2002, 99, 14153-14158

  • [Frank1996] Frank, J.; Radermacher, M.; Penczek, P.; Zhu, J.; Li, Y.; Ladjadj, M. & Leith, A. SPIDER and WEB: Processing and visualization of images in 3D electron microscopy and related fields. J. Structural Biology, 1996, 116, 190-9

  • [Frank2006a] Frank, J. Oxford Univ. Press, 2006

  • [Frank2006b] Frank, J. Springer, 2006

  • [Fu2004] Fu, X.; Bultan, T. & Su, J. Proc. 13th Intl. Conf. on World Wide Web, 2004, 621-630

  • [Gipson2007] Gipson, B.; Zeng, X.; Zhang, Z. Y. & Stahlberg, H. 2dx–user-friendly image processing for 2D crystals. J. Structural Biology, 2007, 157, 64-72

  • [Harrison2007] Harrison, A.; Kelley, I.; Mueller, K.; Shields, M. & Taylor, I. Proc. of the UK e-Science All Hands Meeting, 2007

  • [Henrick2003] Henrick, K., R. Newman, M. Tagari, and M. Chagoyen, 2003. EMDep: a web-based system for the deposition and validation of high-resolution electron microscopy. J Struct Biol, 144, 228-37.

  • [Heymann2001] Heymann, J. B. Bsoft: Image and molecular processing in electron microscopy J. Structural Biology, 2001, 133, 156-169

  • [Heymann2005] Heymann, J. B.; Chagoyen, M. & Belnap, D. M. Common conventions for interchange and archiving of three-dimensional electron microscopy information in structural biology J. Structural Biology, 2005, 151, 196-207

  • [Hohn2007] Hohn, M.; Tang, G.; Goodyear, G.; Baldwin, P. R.; Huang, Z.; Penczek, P. A.; Yang, C.; Glaeser, R. M.; Adams, P. D. & Ludtke, S. J. SPARX, a new environment for Cryo-EM image processing. J. Structural Biology, 2007, 157, 47-55

  • [Hull2006] Hull, D.; Wolstencroft, K.; Stevens, R.; Goble, C.; Pocock, M.; Li, P. & Oinn, T. Nucleic Acids Res, 2006, 34, 729-732

  • [Jonic2008] Jonic, S.; Sorzano, C. O. S. & Boisset, N. J. Microscopy, 2008, 232, 562-579

  • [Kremer1996] Kremer, J. R.; Mastronarde, D. N. & McIntosh, J. R. Computer visualization of three-dimensional image data using IMOD J. Structural Biology, 1996, 71-7

  • [Krishnan2002] Krishnan, s.; Wagstrom, P. & von Laszewski, G. Argonne Natl. Laboratory, Illinois, 2002

  • [Lander2009] Lander, G. C.; Stagg, S. M.; Voss, N. R.; Cheng, A.; Fellmann, D.; Pulokas, J.; Yoshioka, C.; Irving, C.; Mulder, A.; Lau, P.; Lyumkis, D.; Potter, C. S. & Carragher, B. Appion: an integrated, database-driven pipeline to facilitate EM image processing. J. Structural Biology, 2009, 166, 95-102

  • [Lucic2006] Lucic, V.; Forster, F. & Baumeister, W. Ann. Rev. Biochemistry, 2006, 74, 833-865

  • [Ludtke1999] Ludtke, S. J.; Baldwin, P. R. & Chiu, W. EMAN: Semiautomated software for high-resolution single-particle reconstructions J. Structural Biology, 1999, 128, 82-97

  • [Ludtke2008] Ludtke, S. J., Baker, M. L., Chen, D. H., Song, J. L., Chuang, D. T. & Chiu, W. Structure, 2008, 16, 441-448.

  • [Menager2006] Menager, H. & Lacroix, Z. Proc. 22nd International Conference on Data Engineering Workshops, 2006, 68-68

  • [Messaoudi2007] Messaoudi, C.; Boudier, T.; Sorzano, C. O. S. & Marco, S. TomoJ: software for multiple-axis tomography BMC Bioinformatics, 2007, 8, 288

  • [Nickell2005] Nickell, S.; Förster, F.; Linaroudis, A.; Del Net, W.; Beck, F.; Hegerl, R.; Baumeister, W. & Plitzko, J. M. TOM software toolbox: acquisition and analysis for electron tomography J. Structural Biology, 2005, 149, 227-234

  • [Nickell2006] Nickell, S.; Kofler, C.; Leis, A. P. & Baumeister, W. Nature Reviews on Molecular Cell Biology, 2006, 7, 225-230

  • [Owen1996] Owen CH, Morgan DG, DeRosier DJ. Image analysis of helical objects: the Brandeis Helical Package. J. Struct. Biol, 1996, 116, 167–175.

  • [Philippsen2007] Philippsen, A.; Schenk, A. D.; Signorell, G. A.; Mariani, V.; Berneche, S. & Engel, A. Collaborative EM image processing with the IPLT image processing library and toolbox J. Structural Biology, 2007, 157, 28-37

  • [Pokidysheva2006] Pokidysheva, E.; Zhang, Y.; Battisti, A. J.; Bator-Kelly, C. M.; Chipman, P. R.; Xiao, C.; Gregorio, G. G.; Hendrickson, W. A.; Kuhn, R. J. & Rossmann, M. G. Cell, 2006, 124, 485-493

  • [Robinson2007] Robinson, C. V.; Sali, A. & Baumeister, W. Nature, 2007, 450, 973-982

  • [Schneider1998] Schneider, G. Ultramicroscopy, 1998, 75, 85-104

  • [Sorzano2004] Sorzano, C. O. S.; Marabini, R.; Velázquez-Muriel, J.; Bilbao-Castro, J. R.; Scheres, S. H. W.; Carazo, J. M. & Pascual-Montano, A. XMIPP: A new generation of an open-source image processing package for Electron Microscopy J. Structural Biology, 2004, 148, 194-204

  • [Sorzano2007] Sorzano, C. O. S.; Jonic, S.; Cottevieille, M.; Larquet, E.; Boisset, N. & Marco, S. European Biophysics Journal, 2007, 36, 995-1013

  • [Suloway2005] C Suloway, J Pulokas, D Fellmann, A Cheng, F Guerra, J Quispe, S Stagg, CS Potter and B Carragher. Automated Molecular Microscopy: The New Leginon System, Journal of Structural Biology, 2005, 151 (1), 41-60.

  • [Tagari2002] M. Tagari, R. Newman,M. Chagoyen, J.-M. Carazo, K. Henrick. New Electron Microscopy Database and Deposition System. Trends in Biochemical Sciences, 2002, 27(11), 589.

  • [Thatte2001] Thatte, S. Microsoft Corp., 2001

  • [vanHeel1996] van Heel, M.; Harauz, G.; Orlova, E. V.; Schmidt, R. & Schatz, M. A new generation of the IMAGIC image processing system J. Structural Biology, 1996, 116, 17-24

  • [Zeng2007] Zeng, J.; Du, Z.; Hu, C. & Huai, J. Lecture Notes in Computer Science, 2007, 4782, 249-259

  • [Zhang2008] Zhang, X.; Settembre, E.; Xu, C.; Dormitzer, P. R.; Bellamy, R.; Harrison, S. C. & Grigorieff, N. Proc Natl Acad Sci USA, 2008, 105, 1867-1872

  • [Zhang2010] Zhang, X.; Jin, L.; Fang, Q.; Hui, W. H. & Zhou, Z. H. Cell, 2010, 141, 472-482