File IO and Arrays¶

SDF Files¶

COSMolKit exposes multiple SDF reading styles because small files, large seekable files, and stream inputs have different memory and access patterns.

Use Molecule.read_sdf() when you only need the first record:

from cosmolkit import Molecule

mol = Molecule.read_sdf("input.sdf", coordinate_dim="auto")

Use MoleculeBatch.read_sdf() when you intentionally want the entire file as one in-memory batch:

from cosmolkit import MoleculeBatch

batch = MoleculeBatch.read_sdf(
    "input.sdf",
    coordinate_dim="auto",
    errors="keep",
    progress_bar=True,
)

progress_bar=True first builds a lightweight record index so the progress bar has an accurate total. For very large files this is still an all-in-memory batch; use SdfDataset.batches() when bounded memory matters.

Use SdfDataset for large seekable files when you want cheap len(), random access, metadata inspection, or chunked processing:

from cosmolkit import SdfDataset

dataset = SdfDataset.open("large.sdf", coordinate_dim="auto")

print(len(dataset))
print(dataset.metadata(0).title())

record = dataset[12345]
print(record.title(), record.data_field("supplier_id"))

head = dataset[:1024]
print(head.valid_mask())

for batch in dataset.batches(size=1024, errors="keep", progress_bar=True):
    fingerprints = batch.fingerprint_morgan_list()

dataset[i] returns one SdfRecord with SDF metadata and data fields. Slices, integer index lists, and boolean masks return MoleculeBatch objects.

Use SdfReader for one-pass stream-style processing where random access is not needed:

from cosmolkit import SdfReader

for batch in SdfReader.open("large.sdf").batches(size=1024, errors="keep"):
    smiles = batch.to_smiles_list(canonical=True)

SdfReader does not know the final record count without pre-indexing the file, so accurate record-count progress belongs to SdfDataset.

Read a single-record MDL molfile with the same CTAB parser:

mol = Molecule.read_mol("input.mol", coordinate_dim="auto")
mol = Molecule.read_mol_from_str(mol_text, coordinate_dim="2d")

Molecule.read_mol() and Molecule.read_mol_from_str() follow RDKit MolFromMolBlock boundaries: they parse the molfile CTAB through the first M END line and ignore unread trailing text, including SDF data fields and $$$$ record separators. Use Molecule.read_sdf(), SdfDataset, or SdfReader when those SDF data fields are part of the requested input.

Write a molecule to SDF. SDF writing is explicit about the coordinate source: write_sdf() and to_2d_sdf_string() export 2D coordinates, generating them when needed; to_3d_sdf_string() exports an existing 3D conformer and raises if the molecule has no 3D coordinates.

mol = Molecule.from_smiles("CCO").with_2d_coordinates()
mol.write_sdf(
    "python/examples/output/ethanol.sdf",
    format="v2000",
    include_stereo=True,
    kekulize=True,
)

SDF Strings¶

text = mol.to_2d_sdf_string(format="v2000", include_stereo=True, kekulize=True)
restored = Molecule.read_sdf_from_str(text, coordinate_dim="2d")

Molecule.read_sdf_from_str() uses the SDF record parser and therefore validates and parses data fields after M END. For molfile-only string input where trailing SDF text should be ignored, use Molecule.read_mol_from_str().

Use to_3d_sdf_string() when you explicitly want to export an existing 3D conformer. The molecule must already have 3D coordinates.

The format argument accepts "v2000", "v3000", or None for automatic selection. include_stereo=False and kekulize=False are exposed as RDKit parity parameters; branches that are not implemented yet fail with a clear unsupported-path error instead of silently changing behavior.

For multi-record strings, use the batch API:

batch = MoleculeBatch.read_sdf_records_from_str(sdf_text, coordinate_dim="auto")

MOL2 Files¶

Read Tripos MOL2 input with the source-ported RDKit Mol2FileToMol and Mol2BlockToMol profile:

mol = Molecule.read_mol2("input.mol2")
mol = Molecule.read_mol2_from_str(
    mol2_text,
    sanitize=True,
    remove_hs=True,
    variant="corina",
    cleanup_substructures=True,
)

The MOL2 reader exposes RDKit’s Mol2ParserParams controls. variant is currently limited to "corina", the only variant present in RDKit’s public MOL2 enum for this parser.

PDB and mmCIF Blocks¶

Use Protein.from_pdb() or Protein.from_pdb_str() when you want to read PDB data as a protein structural view:

from cosmolkit import Protein

protein = Protein.from_pdb("input.pdb")
protein = Protein.from_pdb_str(pdb_text)

For mmCIF, use Protein.from_mmcif() or Protein.from_mmcif_str():

protein = Protein.from_mmcif("input.cif")
protein = Protein.from_mmcif_str(cif_text, path="input.cif")

Use Molecule.from_pdb_block() when you want a molecule state comparable to RDKit Chem.MolFromPDBBlock for the modeled conversion profile:

from cosmolkit import Molecule

pdb_mol = Molecule.from_pdb_block(
    pdb_text,
    sanitize=True,
    remove_hs=True,
    proximity_bonding=True,
)

Use Molecule.from_mmcif_block() for mmCIF structural text. COSMolKit reads the mmCIF structure and then applies the same molecule-conversion profile used by from_pdb_block():

cif_mol = Molecule.from_mmcif_block(
    cif_text,
    sanitize=True,
    remove_hs=True,
    proximity_bonding=True,
)

XYZ Blocks¶

Use Molecule.from_xyz_block() when you want to read atom identities and Cartesian coordinates from XYZ text:

xyz_mol = Molecule.from_xyz_block(xyz_text)

XYZ input contains coordinates but no bond table, so the returned molecule has atoms and one 3D conformer without inferred bonds. This matches COSMolKit’s RDKit-aligned MolFromXYZBlock profile. Because the molecule is coordinate-only, topology-dependent operations such as adding hydrogens or ETKDG conformer generation require constructing a trusted bond graph first.

Coordinate Arrays¶

coordinates_2d(), coordinates_3d(), and dg_bounds_matrix() return NumPy arrays. coordinates_2d() has shape (num_atoms, 3) with a zero-filled z column; coordinates_3d() has shape (num_atoms, 3) for the selected 3D conformer:

mol = Molecule.from_smiles("c1ccccc1O").with_2d_coordinates()

coords = mol.coordinates_2d()
bounds = mol.dg_bounds_matrix()

print(coords.shape)
print(bounds.shape)

Depiction Files¶

mol = Molecule.from_smiles("c1ccccc1O").with_2d_coordinates()

mol.write_svg("python/examples/output/phenol.svg", width=400, height=300)
mol.write_png("python/examples/output/phenol.png", width=400, height=300)