Skip to content

alff.gdata¤

alff.gdata ¤

Modules:

convert_mpchgnet_to_xyz ¤

Functions:

Attributes:

info_keys = ['uncorrected_total_energy', 'corrected_total_energy', 'energy_per_atom', 'ef_per_atom', 'e_per_atom_relaxed', 'ef_per_atom_relaxed', 'magmom', 'bandgap', 'mp_id'] module-attribute ¤

chgnet_to_ase_atoms(datum: dict[str, dict[str, Any]]) -> list[Atoms] ¤

run_convert() ¤

gendata ¤

Data generation workflow implementation.

Classes:

  • WorkflowGendata

    Workflow for generate initial data for training ML models.

Functions:

WorkflowGendata(params_file: str, machines_file: str) ¤

Bases: Workflow

Workflow for generate initial data for training ML models.

Methods:

  • run

    The main function to run the workflow. This default implementation works for simple workflow,

Attributes:

stage_map = {'make_structure': make_structure, 'optimize_structure': optimize_structure, 'sampling_space': sampling_space, 'run_dft': run_dft, 'collect_data': collect_data} instance-attribute ¤
wf_name = 'DATA GENERATION' instance-attribute ¤
params_file = params_file instance-attribute ¤
machines_file = machines_file instance-attribute ¤
schema_file = schema_file instance-attribute ¤
multi_mdicts = config_machine.multi_mdicts instance-attribute ¤
pdict = Config.loadconfig(self.params_file) instance-attribute ¤
stage_list = self._load_stage_list() instance-attribute ¤
run() ¤

The main function to run the workflow. This default implementation works for simple workflow, for more complex workflow (e.g. with iteration like active learning), need to reimplement this .run() function.

make_structure(pdict, mdict) ¤

Build structures based on input parameters.

optimize_structure(pdict, mdict) ¤

Optimize the structures.

sampling_space(pdict, mdict) ¤

Explore the sampling space.

Sampling space includes: - Range of strains (in x, y, z directions) + range of temperatures - Range of temperatures + range of stresses

Notes - Structure paths are save into 2 lists: original and sampling structure paths

run_dft(pdict, mdict) ¤

Run DFT calculations.

collect_data(pdict, mdict) ¤

Collect data from DFT simulations.

copy_labeled_structure(src_dir: str, dest_dir: str) ¤

Copy labeled structures - First, try copy labeled structure if it exists. - If there is no labeled structure, copy the unlabeled structure.

strain_dim(struct_files: list[str], strain_list: list[float], dim: int) -> list[str] ¤

Scale 'a single spatial dimension' of the structures.

Parameters:

  • struct_files (list[str]) –

    List of structure file paths.

  • strain_list (list[float]) –

    Strain values to apply.

  • dim (int) –

    Dimension index to strain (0=x, 1=y, 2=z).

strain_x_dim(struct_files: list[str], strain_x_list: list[float]) -> list[str] ¤

Scale the x dimension of the structures.

strain_y_dim(struct_files: list[str], strain_y_list: list[float]) -> list[str] ¤

Scale the y dimension of the structures.

strain_z_dim(struct_files: list[str], strain_z_list: list[float]) -> list[str] ¤

Scale the z dimension of the structures.

perturb_structure(struct_files: list, perturb_num: int, perturb_disp: float) ¤

Perturb the structures.

libgen_gpaw ¤

Classes:

OperGendataGpawOptimize(work_dir, pdict, multi_mdict, mdict_prefix='gpaw') ¤

Bases: RemoteOperation

This class does GPAW optimization for a list of structures in task_dirs.

Methods:

  • prepare

    Prepare the operation.

  • postprocess

    This function does:

  • run

    Function to submit jobs to remote machines.

Attributes:

op_name = 'GPAW optimize' instance-attribute ¤
task_filter = {'has_files': [K.FILE_FRAME_UNLABEL], 'no_files': [K.FILE_FRAME_LABEL]} instance-attribute ¤
work_dir = work_dir instance-attribute ¤
pdict = pdict instance-attribute ¤
mdict_list = self._select_machines(multi_mdicts, mdict_prefix) instance-attribute ¤
task_dirs = self._load_task_dirs() instance-attribute ¤
commandlist_list: list[list[str]] instance-attribute ¤
forward_files: list[str] instance-attribute ¤
backward_files: list[str] instance-attribute ¤
forward_common_files: list[str] instance-attribute ¤
backward_common_files: list[str] = [] instance-attribute ¤
prepare() ¤

Prepare the operation.

Includes: - Prepare ase_args for GPAW and gpaw_run_file. Note: Must define pdict.dft.calc_args.gpaw{} for this function. - Prepare the task_list - Prepare forward & backward files - Prepare commandlist_list for multi-remote submission

postprocess() ¤

This function does: - Remove unlabeled .extxyz files, just keep the labeled ones.

run() ¤

Function to submit jobs to remote machines.

Note
  • Orginal taks_dirs is relative to run_dir, and should not be changed. But the sumbmission function needs taks_dirs relative path to work_dir, so we make temporary change here.

OperGendataGpawSinglepoint(work_dir, pdict, multi_mdict, mdict_prefix='gpaw') ¤

Bases: OperGendataGpawOptimize

Methods:

Attributes:

op_name = 'GPAW singlepoint' instance-attribute ¤
work_dir = work_dir instance-attribute ¤
pdict = pdict instance-attribute ¤
mdict_list = self._select_machines(multi_mdicts, mdict_prefix) instance-attribute ¤
task_dirs = self._load_task_dirs() instance-attribute ¤
task_filter = {'has_files': [K.FILE_FRAME_UNLABEL], 'no_files': [K.FILE_FRAME_LABEL]} instance-attribute ¤
commandlist_list: list[list[str]] instance-attribute ¤
forward_files: list[str] instance-attribute ¤
backward_files: list[str] instance-attribute ¤
forward_common_files: list[str] instance-attribute ¤
backward_common_files: list[str] = [] instance-attribute ¤
prepare() ¤
postprocess() ¤

This function does: - Remove unlabeled .extxyz files, just keep the labeled ones.

run() ¤

Function to submit jobs to remote machines.

Note
  • Orginal taks_dirs is relative to run_dir, and should not be changed. But the sumbmission function needs taks_dirs relative path to work_dir, so we make temporary change here.

OperGendataGpawAIMD(work_dir, pdict, multi_mdict, mdict_prefix='gpaw') ¤

Bases: RemoteOperation

See class OperGendataGpawOptimize for more details.

Methods:

  • prepare

    Refer to the pregen_gpaw_optimize() function.

  • postprocess

    Refer to the postgen_gpaw_optimize() function.

  • run

    Function to submit jobs to remote machines.

Attributes:

op_name = 'GPAW aimd' instance-attribute ¤
task_filter = {'has_files': [K.FILE_FRAME_UNLABEL], 'no_files': [K.FILE_TRAJ_LABEL]} instance-attribute ¤
work_dir = work_dir instance-attribute ¤
pdict = pdict instance-attribute ¤
mdict_list = self._select_machines(multi_mdicts, mdict_prefix) instance-attribute ¤
task_dirs = self._load_task_dirs() instance-attribute ¤
commandlist_list: list[list[str]] instance-attribute ¤
forward_files: list[str] instance-attribute ¤
backward_files: list[str] instance-attribute ¤
forward_common_files: list[str] instance-attribute ¤
backward_common_files: list[str] = [] instance-attribute ¤
prepare() ¤

Refer to the pregen_gpaw_optimize() function.

Note: - This function differs from OperGendataGpawOptimize.prepare() in the aspects that ase_args now in task_dirs (not in work_dir). So, the forward files and commandlist_list are different. - structure_dirs: contains the optimized structures without scaling. - strain_structure_dirs: contains the scaled structures.

postprocess() ¤

Refer to the postgen_gpaw_optimize() function.

run() ¤

Function to submit jobs to remote machines.

Note
  • Orginal taks_dirs is relative to run_dir, and should not be changed. But the sumbmission function needs taks_dirs relative path to work_dir, so we make temporary change here.

OperAlGpawSinglepoint(work_dir, pdict, multi_mdict, mdict_prefix='gpaw') ¤

Bases: OperGendataGpawOptimize

Methods:

Attributes:

op_name = 'GPAW singlepoint' instance-attribute ¤
work_dir = work_dir instance-attribute ¤
pdict = pdict instance-attribute ¤
mdict_list = self._select_machines(multi_mdicts, mdict_prefix) instance-attribute ¤
task_dirs = self._load_task_dirs() instance-attribute ¤
task_filter = {'has_files': [K.FILE_FRAME_UNLABEL], 'no_files': [K.FILE_FRAME_LABEL]} instance-attribute ¤
commandlist_list: list[list[str]] instance-attribute ¤
forward_files: list[str] instance-attribute ¤
backward_files: list[str] instance-attribute ¤
forward_common_files: list[str] instance-attribute ¤
backward_common_files: list[str] = [] instance-attribute ¤
prepare() ¤
postprocess() ¤

Do post DFT tasks.

run() ¤

Function to submit jobs to remote machines.

Note
  • Orginal taks_dirs is relative to run_dir, and should not be changed. But the sumbmission function needs taks_dirs relative path to work_dir, so we make temporary change here.

util_dataset ¤

Utility functions for handling dataset files.

Functions:

split_extxyz_dataset(extxyz_files: list[str], train_ratio: float = 0.9, valid_ratio: float = 0.1, seed: int | None = None, outfile_prefix: str = 'dataset') ¤

Split a dataset into training, validation, and test sets.

If input (train_ratio + valid_ratio) < 1, the remaining data will be used as the test set.

Parameters:

  • extxyz_files (list[str]) –

    List of file paths in EXTXYZ format.

  • train_ratio (float, default: 0.9 ) –

    Ratio of training set. Defaults to 0.9.

  • valid_ratio (float, default: 0.1 ) –

    Ratio of validation set. Defaults to 0.1.

  • seed (Optional[int], default: None ) –

    Random seed. Defaults to None.

  • outfile_prefix (str, default: 'dataset' ) –

    Prefix for output file names. Defaults to "dataset".

read_list_extxyz(extxyz_files: list[str]) -> list[Atoms] ¤

Read a list of EXTXYZ files and return a list of ASE Atoms objects.

merge_extxyz_files(extxyz_files: list[str], outfile: str, sort_natoms: bool = False, sort_composition: bool = False, sort_pbc_len: bool = False) ¤

Unify multiple EXTXYZ files into a single file.

Parameters:

  • extxyz_files (list[str]) –

    List of EXTXYZ file paths.

  • outfile (str) –

    Output file path.

  • sort_natoms (bool, default: False ) –

    Sort by number of atoms. Defaults to True.

  • sort_composition (bool, default: False ) –

    Sort by chemical composition. Defaults to True.

  • sort_pbc_len (bool, default: False ) –

    Sort by periodic length. Defaults to True.

Note
  • np.lexsort is used to sort by multiple criteria. np.argsort is used to sort by a single criterion.
  • np.lexsort does not support descending order, so we reverse the sorted indices using idx[::-1].
  • If multiple sorting criteria are provided, they will be applied in the order of 'last_key' to 'first_key' (i.e., the last key in the list is the primary sort key). Example: np.lexsort((key1, key2)) sorts by key2 first, then by `key1.

change_key_in_extxyz(extxyz_file: str, key_pairs: dict[str, str]) ¤

Change keys in extxyz file.

Parameters:

  • extxyz_file (str) –

    Path to the extxyz file.

  • key_pairs (dict) –

    Dictionary of key pairs {"old_key": "new_key"} to change. Example: {"old_key": "new_key", "forces": "ref_forces", "stress": "ref_stress"}

Note
  • If Atoms contains internal-keys (e.g., energy, forces, stress, momenta, free_energy,...), there will be a SinglePointCalculator object included to the Atoms, and these keys are stored in dict atoms.calc.results or can be accessed using .get_() methods.
  • These internal-keys are not stored in atoms.arrays or atoms.info. If we want to store (and access) these properties in atoms.arrays or atoms.info, we need to change these internal-keys to custom-keys (e.g., ref_energy, ref_forces, ref_stress, ref_momenta, ref_free_energy,...).

remove_key_in_extxyz(extxyz_file: str, key_list: list[str]) ¤

Remove unwanted keys from extxyz file to keep it clean.

select_structs_from_extxyz(extxyz_file: str, has_symbols: list | None = None, only_symbols: list | None = None, exact_symbols: list | None = None, has_properties: list | None = None, only_properties: list | None = None, has_columns: list | None = None, only_columns: list | None = None, natoms: int | None = None, tol: float = 1e-06) ¤

Choose frames from a extxyz trajectory file, based on some criteria.

Parameters:

  • extxyz_file (str) –

    Path to the extxyz file.

  • has_symbols (list, default: None ) –

    List of symbols that each frame must have at least one of them.

  • only_symbols (list, default: None ) –

    List of symbols that each frame must have only these symbols.

  • exact_symbols (list, default: None ) –

    List of symbols that each frame must have exactly these symbols.

  • has_properties (list, default: None ) –

    List of properties that each frame must have at least one of them.

  • only_properties (list, default: None ) –

    List of properties that each frame must have only these properties.

  • has_columns (list, default: None ) –

    List of columns that each frame must have at least one of them.

  • only_columns (list, default: None ) –

    List of columns that each frame must have only these columns.

  • natoms (int, default: None ) –

    total number of atoms in frame.

  • tol (float, default: 1e-06 ) –

    Tolerance for comparing floating point numbers.

sort_atoms_by_position(struct: Atoms) -> Atoms ¤

Sorts the atoms in an Atoms object based on their Cartesian positions.

are_structs_identical(input_struct1: Atoms, input_struct2: Atoms, tol=1e-06) -> bool ¤

Checks if two Atoms objects are identical by first sorting them and then comparing their attributes.

Parameters:

  • input_struct1 (Atoms) –

    First Atoms object.

  • input_struct2 (Atoms) –

    Second Atoms object.

  • tol (float, default: 1e-06 ) –

    Tolerance for position comparison.

Returns:

  • bool ( bool ) –

    True if the structures are identical, False otherwise.

are_structs_equivalent(struct1: Atoms, struct2: Atoms) -> bool ¤

Check if two Atoms objects are equivalent using ase.utils.structure_comparator.SymmetryEquivalenceCheck.compare().

Parameters:

  • struct1 (Atoms) –

    First Atoms object.

  • struct2 (Atoms) –

    Second Atoms object.

Returns:

  • bool ( bool ) –

    True if the structures are equivalent, False otherwise.

Note
  • It is not clear what is "equivalent"?

remove_duplicate_structs_serial(extxyz_file: str, tol=1e-06) -> None ¤

Check if there are duplicate structs in a extxyz file.

Parameters:

  • extxyz_file (str) –

    Path to the extxyz file.

  • tol (float, default: 1e-06 ) –

    Tolerance for comparing atomic positions. Defaults to 1e-6.

Returns:

  • None

    extxyz_file without duplicate structs.

remove_duplicate_structs_hash(extxyz_file: str, seen_extxyz: str | None = None, tol: float = 1e-06, backup: bool = True) -> None ¤

Remove duplicate structures using hashing (very fast).

  • Much less memory overhead compared to pairwise are_structs_identical calls.
  • This reduces duplicate checking to O(N) instead of O(N²). No parallelism needed — it's already O(N)

Parameters:

  • extxyz_file (str) –

    Path to the extxyz file.

  • seen_extxyz (str | None, default: None ) –

    Optional path to an extxyz file to be included into the set of seen structures. Defaults to None.

  • tol (float, default: 1e-06 ) –

    Tolerance for comparing atomic positions. Defaults to 1e-6.

  • backup (bool, default: True ) –

    Whether to create a backup of the original file. Defaults to True.

Note
  • Use reversed() does not modify the original list, and it is memory-efficient (no copy)