Protein Structures
==================

Use ``Protein`` when a workflow starts from PDB or mmCIF structural data and
needs protein-chain, residue, and atom traversal.

Read a PDB file directly:

.. code-block:: python

   from cosmolkit import Protein, ResidueCode

   protein = Protein.from_pdb("1crn.pdb")

   print(protein.num_models())
   print(protein.num_chains())
   print(protein.num_residues())
   print(protein.num_atoms())

Read PDB text that is already in memory:

.. code-block:: python

   protein = Protein.from_pdb_str(pdb_text)

Read mmCIF input with the same high-level protein projection:

.. code-block:: python

   protein = Protein.from_mmcif("1crn.cif")
   protein = Protein.from_mmcif_str(cif_text, path="1crn.cif")

``Protein`` keeps amino-acid residues and excludes ligands, nucleic acids, and
waters by default. Use it for protein-focused traversal rather than low-level
mixed structural tables.

Chains, Residues, And Atoms
---------------------------

``Protein`` behaves like a chain collection. ``len(protein)`` returns the
number of protein chains, and ``protein[i]`` returns a ``ProteinChain``.

.. code-block:: python

   first_chain = protein[0]
   print(first_chain.index(), first_chain.kind(), len(first_chain))

   for chain in protein.chains():
       for residue in chain.residues():
           if residue.code() == ResidueCode.MET:
               print("methionine", residue.index(), residue.fasta_code())
           print(residue.index(), residue.name(), residue.code(), len(residue))

           for atom in residue.atoms():
               print(atom.index(), atom.name(), atom.element(), atom.position())

``atom.position()`` returns ``None`` when the atom has no Cartesian coordinate
in the selected structure data; otherwise it returns ``(x, y, z)``.

Residue Information
-------------------

``ProteinResidue.name()`` returns the raw residue name from the structure.
Use ``ProteinResidue.code()`` for enum matching against Gemmi's tabulated
residue vocabulary, and ``ProteinResidue.info()`` when you need the
source-derived classification fields. Sequence expansion follows Gemmi's
``expand_one_letter`` and ``expand_one_letter_sequence`` residue tables.

.. code-block:: python

   from cosmolkit import (
       ResidueCode,
       ResidueInfoKind,
       expand_one_letter_sequence,
       find_tabulated_residue,
   )

   info = find_tabulated_residue("MSE")
   assert info.code() == ResidueCode.MSE
   assert info.kind() == ResidueInfoKind.AA
   assert info.fasta_code() == "X"
   assert expand_one_letter_sequence("ACD(MSE)", ResidueInfoKind.AA) == [
       "ALA",
       "CYS",
       "ASP",
       "MSE",
   ]

Protein vs Molecule PDB APIs
----------------------------

Use ``Protein.from_pdb()`` or ``Protein.from_pdb_str()`` when the desired
object is a protein structural view:

.. code-block:: python

   protein = Protein.from_pdb("input.pdb")

Use ``Molecule.from_pdb_block()`` only when the desired object is a
RDKit-compatible molecule conversion from PDB text:

.. code-block:: python

   from cosmolkit import Molecule

   mol = Molecule.from_pdb_block(
       pdb_text,
       sanitize=True,
       remove_hs=True,
       proximity_bonding=True,
   )

The molecule conversion path is useful for cheminformatics-style molecule
operations. The ``Protein`` path is the ergonomic path for protein chain,
residue, and atom access.