14. Import sequence protocol

Protocol designed to import aminoacid or nucleotide sequences in Scipion from four possible origins (plain text, atomic structures from PDB database or a file in your computer, text file of the user’s computer, and UniProtKB/ GeneBank databases).

Requirements to run this protocol and visualize results:
- Scipion plugin: scipion-em
Scipion menu: Model building -> Imports (Fig. 14.5 (A))

Fig. 14.5 Protocol import sequence. A: Protocol location in Scipion menu. B: Protocol form.
Protocol form parameters (Fig. 14.5 (B)):
- Sequence ID: Optional short name to identify your sequence (acronym or number, e. g. Q05769). If no ID is assigned by the user, and the sequence has been downloaded from GeneBank/UniProtKB/PDB database, the database ID will be selected as Sequence ID (Read Help section (question mark) to see some examples). Otherwise, Sequence name will be set as Sequence ID. The Sequence ID will be included in sequence alignments in ChimeraX to identify the sequence.
- Sequence name: Compulsory short name to identify your sequence (example: PGH2_MOUSE). Names with certain meaning are recommended. The Sequence name will appear in the summary box of the protocol.
- Sequence description: Optional description of your sequence. It can include functionality, organism, size, etc. (example: Prostaglandin G/H synthase 2). If no description is assigned by the user, and the sequence has been downloaded from GeneBank/UniProtKB/PDB database, the database description will be selected as Sequence description. Otherwise, no description will be included.
- Import sequence of: Selection parameter to choose between aminoacids and nucleotides. After selecting one of them, a new selection menu will be opened:
  - aminoacids: Parameter to select one of these four options:
    - plain text: Select this option if you want to introduce your own single letter aminoacid sequence. Since your sequence will be cleaned according to the standard protein alphabet of 20 aminoacids (Protein) or to an extended alphabet that includes 6 additional aminoacids or aminoacid groups (Extended Protein), you have to select one of these IUPAC Protein alphabets. Read Help section (question mark) to know the aminoacids included in each alphabet. Not only non-canonical aminoacids will be cleaned, but also wildcard characters such as *, #, ?, -, etc… Write your sequence here indicates the place where your single letter aminoacid sequence has to be written or paste.
    - atomic structure: Select this option if you want to download the sequence from an atomic structure (Fig. 14.6 (A)). Select id to download your sequence from PDB database. Then, write the PDB ID (Atomic structure ID) and select the chain sequence of your preference (Chain). Use the wizard on the right side of Chain parameter to select that chain. Follow an analogous process to download the sequence from an atomic structure that you already have in your computer. This time, the File path will replace the Atomic structure ID. By pressing the folder symbol, a browser will help you to find the structure file.
    - file: Select this option if your sequence is written in a text file that you already have in your computer (Fig. 14.6 (B)). By pressing the folder symbol, a browser will help you to find the sequence file.
    - UniProtID: Select this option if you want to download the sequence from UniProtKB database (Fig. 14.6 (C)). Write the name/ID of the respective sequence in the parameter box UniProt name/ID. An error message appears in case you introduce a wrong ID.
      
      Fig. 14.6 Protocol import sequence. Protocol form to import aminoacid sequences from the PDB database by indicating its respective ID (A), from a file (B), or from UniProtKB by writing the database ID/name (C).
  - nucleotides: Analogously to aminoacids parameter, select one of these four options:
    - plain text: Parameter to introduce your own single letter nucleotide sequence (Fig. 14.7 (A)). Since your sequence will be cleaned according to the standard nucleic acid alphabet, you have to select one of the next five alphabets. The first three are DNA alphabets and the last two ones are RNA alphabets. Read Help section (question mark) to understand each alphabet. The most restricted ones are Unambiguous DNA (“A, C, G, T”) and Unambiguous RNA (“A, C, G, U”) for DNA and RNA, respectively. The cleaning process also involves wildcard characters such as *, #, ?, -, etc…
    - atomic structure: Information described for aminoacids is valid for nucleotides (Fig. 14.7 (B)).
    - file: Information described for aminoacids is valid for nucleotides (Fig. 14.7 (C)).
    - GeneBank: Information described for aminoacids is valid for nucleotides, this time replacing UniProtKB by GeneBank (Fig. 14.7 (D)).
      
      Fig. 14.7 Protocol import sequence. Protocol form to write (A) or import nucleotide sequences from PDB database by indicating its respective ID (B), from a file (C), or from GeneBank by writing the database accession number (D).
Protocol execution:

Adding specific sequence label is recommended in Run name section, at the form top. To add the label, open the protocol form, press the pencil symbol at the right side of Run name box, complete the label in the new opened window, press OK and, finally, close the protocol. This label will be shown in the output summary content (see below). If you want to run again this protocol, do not forget to set to Restart the Run mode.

Press the Execute red button at the form bottom.
Visualization of protocol results:

After executing the protocol, press Analyze Results and a text editor will be opened in which you can read the sequence in fasta format. Sequence ID and Sequence description are included in the header.
Summary content:
- Protocol output (below Scipion framework):
  
  pwem - import sequence -> ouputSequence;
  
  Sequence name
- SUMMARY box:
  
  Sequence of aminoacids/ nucleotides:
  
  Sequence Sequence name imported from plain text/ atomic structure/ file/ UniProt ID.