alff.gdata¤
alff.gdata
¤
Modules:
-
convert_mpchgnet_to_xyz– -
gendata–Data generation workflow implementation.
-
libgen_gpaw– -
util_dataset–Utility functions for handling dataset files.
convert_mpchgnet_to_xyz
¤
Functions:
Attributes:
gendata
¤
Data generation workflow implementation.
Classes:
-
WorkflowGendata–Workflow for generate initial data for training ML models.
Functions:
-
make_structure–Build structures based on input parameters.
-
optimize_structure–Optimize the structures.
-
sampling_space–Explore the sampling space.
-
run_dft–Run DFT calculations.
-
collect_data–Collect data from DFT simulations.
-
copy_labeled_structure–Copy labeled structures
-
strain_dim–Scale 'a single spatial dimension' of the structures.
-
strain_x_dim–Scale the x dimension of the structures.
-
strain_y_dim–Scale the y dimension of the structures.
-
strain_z_dim–Scale the z dimension of the structures.
-
perturb_structure–Perturb the structures.
WorkflowGendata(params_file: str, machines_file: str)
¤
Bases: Workflow
Workflow for generate initial data for training ML models.
Methods:
-
run–The main function to run the workflow. This default implementation works for simple workflow,
Attributes:
-
stage_map– -
wf_name– -
params_file– -
machines_file– -
schema_file– -
multi_mdicts– -
pdict– -
stage_list–
stage_map = {'make_structure': make_structure, 'optimize_structure': optimize_structure, 'sampling_space': sampling_space, 'run_dft': run_dft, 'collect_data': collect_data}
instance-attribute
¤
wf_name = 'DATA GENERATION'
instance-attribute
¤
params_file = params_file
instance-attribute
¤
machines_file = machines_file
instance-attribute
¤
schema_file = schema_file
instance-attribute
¤
multi_mdicts = config_machine.multi_mdicts
instance-attribute
¤
pdict = Config.loadconfig(self.params_file)
instance-attribute
¤
stage_list = self._load_stage_list()
instance-attribute
¤
run()
¤
The main function to run the workflow. This default implementation works for simple workflow,
for more complex workflow (e.g. with iteration like active learning), need to reimplement this .run() function.
make_structure(pdict, mdict)
¤
Build structures based on input parameters.
optimize_structure(pdict, mdict)
¤
Optimize the structures.
sampling_space(pdict, mdict)
¤
Explore the sampling space.
Sampling space includes: - Range of strains (in x, y, z directions) + range of temperatures - Range of temperatures + range of stresses
Notes - Structure paths are save into 2 lists: original and sampling structure paths
run_dft(pdict, mdict)
¤
Run DFT calculations.
collect_data(pdict, mdict)
¤
Collect data from DFT simulations.
copy_labeled_structure(src_dir: str, dest_dir: str)
¤
Copy labeled structures - First, try copy labeled structure if it exists. - If there is no labeled structure, copy the unlabeled structure.
strain_dim(struct_files: list[str], strain_list: list[float], dim: int) -> list[str]
¤
strain_x_dim(struct_files: list[str], strain_x_list: list[float]) -> list[str]
¤
Scale the x dimension of the structures.
strain_y_dim(struct_files: list[str], strain_y_list: list[float]) -> list[str]
¤
Scale the y dimension of the structures.
strain_z_dim(struct_files: list[str], strain_z_list: list[float]) -> list[str]
¤
Scale the z dimension of the structures.
perturb_structure(struct_files: list, perturb_num: int, perturb_disp: float)
¤
Perturb the structures.
libgen_gpaw
¤
Classes:
-
OperGendataGpawOptimize–This class does GPAW optimization for a list of structures in
task_dirs. -
OperGendataGpawSinglepoint– -
OperGendataGpawAIMD–See class OperGendataGpawOptimize for more details.
-
OperAlGpawSinglepoint–
OperGendataGpawOptimize(work_dir, pdict, multi_mdict, mdict_prefix='gpaw')
¤
Bases: RemoteOperation
This class does GPAW optimization for a list of structures in task_dirs.
Methods:
-
prepare–Prepare the operation.
-
postprocess–This function does:
-
run–Function to submit jobs to remote machines.
Attributes:
-
op_name– -
task_filter– -
work_dir– -
pdict– -
mdict_list– -
task_dirs– -
commandlist_list(list[list[str]]) – -
forward_files(list[str]) – -
backward_files(list[str]) – -
forward_common_files(list[str]) – -
backward_common_files(list[str]) –
op_name = 'GPAW optimize'
instance-attribute
¤
task_filter = {'has_files': [K.FILE_FRAME_UNLABEL], 'no_files': [K.FILE_FRAME_LABEL]}
instance-attribute
¤
work_dir = work_dir
instance-attribute
¤
pdict = pdict
instance-attribute
¤
mdict_list = self._select_machines(multi_mdicts, mdict_prefix)
instance-attribute
¤
task_dirs = self._load_task_dirs()
instance-attribute
¤
commandlist_list: list[list[str]]
instance-attribute
¤
forward_files: list[str]
instance-attribute
¤
backward_files: list[str]
instance-attribute
¤
forward_common_files: list[str]
instance-attribute
¤
backward_common_files: list[str] = []
instance-attribute
¤
prepare()
¤
Prepare the operation.
Includes:
- Prepare ase_args for GPAW and gpaw_run_file. Note: Must define pdict.dft.calc_args.gpaw{} for this function.
- Prepare the task_list
- Prepare forward & backward files
- Prepare commandlist_list for multi-remote submission
postprocess()
¤
This function does: - Remove unlabeled .extxyz files, just keep the labeled ones.
run()
¤
Function to submit jobs to remote machines.
Note
- Orginal
taks_dirsis relative torun_dir, and should not be changed. But the sumbmission function needstaks_dirsrelative path towork_dir, so we make temporary change here.
OperGendataGpawSinglepoint(work_dir, pdict, multi_mdict, mdict_prefix='gpaw')
¤
Bases: OperGendataGpawOptimize
Methods:
-
prepare– -
postprocess–This function does:
-
run–Function to submit jobs to remote machines.
Attributes:
-
op_name– -
work_dir– -
pdict– -
mdict_list– -
task_dirs– -
task_filter– -
commandlist_list(list[list[str]]) – -
forward_files(list[str]) – -
backward_files(list[str]) – -
forward_common_files(list[str]) – -
backward_common_files(list[str]) –
op_name = 'GPAW singlepoint'
instance-attribute
¤
work_dir = work_dir
instance-attribute
¤
pdict = pdict
instance-attribute
¤
mdict_list = self._select_machines(multi_mdicts, mdict_prefix)
instance-attribute
¤
task_dirs = self._load_task_dirs()
instance-attribute
¤
task_filter = {'has_files': [K.FILE_FRAME_UNLABEL], 'no_files': [K.FILE_FRAME_LABEL]}
instance-attribute
¤
commandlist_list: list[list[str]]
instance-attribute
¤
forward_files: list[str]
instance-attribute
¤
backward_files: list[str]
instance-attribute
¤
forward_common_files: list[str]
instance-attribute
¤
backward_common_files: list[str] = []
instance-attribute
¤
prepare()
¤
postprocess()
¤
This function does: - Remove unlabeled .extxyz files, just keep the labeled ones.
run()
¤
Function to submit jobs to remote machines.
Note
- Orginal
taks_dirsis relative torun_dir, and should not be changed. But the sumbmission function needstaks_dirsrelative path towork_dir, so we make temporary change here.
OperGendataGpawAIMD(work_dir, pdict, multi_mdict, mdict_prefix='gpaw')
¤
Bases: RemoteOperation
See class OperGendataGpawOptimize for more details.
Methods:
-
prepare–Refer to the
pregen_gpaw_optimize()function. -
postprocess–Refer to the
postgen_gpaw_optimize()function. -
run–Function to submit jobs to remote machines.
Attributes:
-
op_name– -
task_filter– -
work_dir– -
pdict– -
mdict_list– -
task_dirs– -
commandlist_list(list[list[str]]) – -
forward_files(list[str]) – -
backward_files(list[str]) – -
forward_common_files(list[str]) – -
backward_common_files(list[str]) –
op_name = 'GPAW aimd'
instance-attribute
¤
task_filter = {'has_files': [K.FILE_FRAME_UNLABEL], 'no_files': [K.FILE_TRAJ_LABEL]}
instance-attribute
¤
work_dir = work_dir
instance-attribute
¤
pdict = pdict
instance-attribute
¤
mdict_list = self._select_machines(multi_mdicts, mdict_prefix)
instance-attribute
¤
task_dirs = self._load_task_dirs()
instance-attribute
¤
commandlist_list: list[list[str]]
instance-attribute
¤
forward_files: list[str]
instance-attribute
¤
backward_files: list[str]
instance-attribute
¤
forward_common_files: list[str]
instance-attribute
¤
backward_common_files: list[str] = []
instance-attribute
¤
prepare()
¤
Refer to the pregen_gpaw_optimize() function.
Note:
- This function differs from OperGendataGpawOptimize.prepare() in the aspects that ase_args now in task_dirs (not in work_dir). So, the forward files and commandlist_list are different.
- structure_dirs: contains the optimized structures without scaling.
- strain_structure_dirs: contains the scaled structures.
postprocess()
¤
Refer to the postgen_gpaw_optimize() function.
run()
¤
Function to submit jobs to remote machines.
Note
- Orginal
taks_dirsis relative torun_dir, and should not be changed. But the sumbmission function needstaks_dirsrelative path towork_dir, so we make temporary change here.
OperAlGpawSinglepoint(work_dir, pdict, multi_mdict, mdict_prefix='gpaw')
¤
Bases: OperGendataGpawOptimize
Methods:
-
prepare– -
postprocess–Do post DFT tasks.
-
run–Function to submit jobs to remote machines.
Attributes:
-
op_name– -
work_dir– -
pdict– -
mdict_list– -
task_dirs– -
task_filter– -
commandlist_list(list[list[str]]) – -
forward_files(list[str]) – -
backward_files(list[str]) – -
forward_common_files(list[str]) – -
backward_common_files(list[str]) –
op_name = 'GPAW singlepoint'
instance-attribute
¤
work_dir = work_dir
instance-attribute
¤
pdict = pdict
instance-attribute
¤
mdict_list = self._select_machines(multi_mdicts, mdict_prefix)
instance-attribute
¤
task_dirs = self._load_task_dirs()
instance-attribute
¤
task_filter = {'has_files': [K.FILE_FRAME_UNLABEL], 'no_files': [K.FILE_FRAME_LABEL]}
instance-attribute
¤
commandlist_list: list[list[str]]
instance-attribute
¤
forward_files: list[str]
instance-attribute
¤
backward_files: list[str]
instance-attribute
¤
forward_common_files: list[str]
instance-attribute
¤
backward_common_files: list[str] = []
instance-attribute
¤
prepare()
¤
postprocess()
¤
Do post DFT tasks.
run()
¤
Function to submit jobs to remote machines.
Note
- Orginal
taks_dirsis relative torun_dir, and should not be changed. But the sumbmission function needstaks_dirsrelative path towork_dir, so we make temporary change here.
util_dataset
¤
Utility functions for handling dataset files.
Functions:
-
split_extxyz_dataset–Split a dataset into training, validation, and test sets.
-
read_list_extxyz–Read a list of EXTXYZ files and return a list of ASE Atoms objects.
-
merge_extxyz_files–Unify multiple EXTXYZ files into a single file.
-
change_key_in_extxyz–Change keys in extxyz file.
-
remove_key_in_extxyz–Remove unwanted keys from extxyz file to keep it clean.
-
select_structs_from_extxyz–Choose frames from a extxyz trajectory file, based on some criteria.
-
sort_atoms_by_position–Sorts the atoms in an Atoms object based on their Cartesian positions.
-
are_structs_identical–Checks if two Atoms objects are identical by first sorting them and then comparing their attributes.
-
are_structs_equivalent–Check if two Atoms objects are equivalent using
ase.utils.structure_comparator.SymmetryEquivalenceCheck.compare(). -
remove_duplicate_structs_serial–Check if there are duplicate structs in a extxyz file.
-
remove_duplicate_structs_hash–Remove duplicate structures using hashing (very fast).
split_extxyz_dataset(extxyz_files: list[str], train_ratio: float = 0.9, valid_ratio: float = 0.1, seed: int | None = None, outfile_prefix: str = 'dataset')
¤
Split a dataset into training, validation, and test sets.
If input (train_ratio + valid_ratio) < 1, the remaining data will be used as the test set.
Parameters:
-
extxyz_files(list[str]) –List of file paths in EXTXYZ format.
-
train_ratio(float, default:0.9) –Ratio of training set. Defaults to 0.9.
-
valid_ratio(float, default:0.1) –Ratio of validation set. Defaults to 0.1.
-
seed(Optional[int], default:None) –Random seed. Defaults to None.
-
outfile_prefix(str, default:'dataset') –Prefix for output file names. Defaults to "dataset".
read_list_extxyz(extxyz_files: list[str]) -> list[Atoms]
¤
Read a list of EXTXYZ files and return a list of ASE Atoms objects.
merge_extxyz_files(extxyz_files: list[str], outfile: str, sort_natoms: bool = False, sort_composition: bool = False, sort_pbc_len: bool = False)
¤
Unify multiple EXTXYZ files into a single file.
Parameters:
-
extxyz_files(list[str]) –List of EXTXYZ file paths.
-
outfile(str) –Output file path.
-
sort_natoms(bool, default:False) –Sort by number of atoms. Defaults to True.
-
sort_composition(bool, default:False) –Sort by chemical composition. Defaults to True.
-
sort_pbc_len(bool, default:False) –Sort by periodic length. Defaults to True.
Note
np.lexsortis used to sort by multiple criteria.np.argsortis used to sort by a single criterion.np.lexsortdoes not support descending order, so we reverse the sorted indices usingidx[::-1].- If multiple sorting criteria are provided, they will be applied in the order of 'last_key' to 'first_key' (i.e., the last key in the list is the primary sort key). Example:
np.lexsort((key1, key2))sorts bykey2first, then by `key1.
change_key_in_extxyz(extxyz_file: str, key_pairs: dict[str, str])
¤
Change keys in extxyz file.
Parameters:
-
extxyz_file(str) –Path to the extxyz file.
-
key_pairs(dict) –Dictionary of key pairs {"old_key": "new_key"} to change. Example:
{"old_key": "new_key", "forces": "ref_forces", "stress": "ref_stress"}
Note
- If Atoms contains internal-keys (e.g.,
energy,forces,stress,momenta,free_energy,...), there will be aSinglePointCalculatorobject included to the Atoms, and these keys are stored in dictatoms.calc.resultsor can be accessed using.get_()methods. - These internal-keys are not stored in
atoms.arraysoratoms.info. If we want to store (and access) these properties inatoms.arraysoratoms.info, we need to change these internal-keys to custom-keys (e.g.,ref_energy,ref_forces,ref_stress,ref_momenta,ref_free_energy,...).
remove_key_in_extxyz(extxyz_file: str, key_list: list[str])
¤
Remove unwanted keys from extxyz file to keep it clean.
select_structs_from_extxyz(extxyz_file: str, has_symbols: list | None = None, only_symbols: list | None = None, exact_symbols: list | None = None, has_properties: list | None = None, only_properties: list | None = None, has_columns: list | None = None, only_columns: list | None = None, natoms: int | None = None, tol: float = 1e-06)
¤
Choose frames from a extxyz trajectory file, based on some criteria.
Parameters:
-
extxyz_file(str) –Path to the extxyz file.
-
has_symbols(list, default:None) –List of symbols that each frame must have at least one of them.
-
only_symbols(list, default:None) –List of symbols that each frame must have only these symbols.
-
exact_symbols(list, default:None) –List of symbols that each frame must have exactly these symbols.
-
has_properties(list, default:None) –List of properties that each frame must have at least one of them.
-
only_properties(list, default:None) –List of properties that each frame must have only these properties.
-
has_columns(list, default:None) –List of columns that each frame must have at least one of them.
-
only_columns(list, default:None) –List of columns that each frame must have only these columns.
-
natoms(int, default:None) –total number of atoms in frame.
-
tol(float, default:1e-06) –Tolerance for comparing floating point numbers.
sort_atoms_by_position(struct: Atoms) -> Atoms
¤
Sorts the atoms in an Atoms object based on their Cartesian positions.
are_structs_identical(input_struct1: Atoms, input_struct2: Atoms, tol=1e-06) -> bool
¤
Checks if two Atoms objects are identical by first sorting them and then comparing their attributes.
Parameters:
-
input_struct1(Atoms) –First Atoms object.
-
input_struct2(Atoms) –Second Atoms object.
-
tol(float, default:1e-06) –Tolerance for position comparison.
Returns:
-
bool(bool) –True if the structures are identical, False otherwise.
are_structs_equivalent(struct1: Atoms, struct2: Atoms) -> bool
¤
Check if two Atoms objects are equivalent using ase.utils.structure_comparator.SymmetryEquivalenceCheck.compare().
Parameters:
-
struct1(Atoms) –First Atoms object.
-
struct2(Atoms) –Second Atoms object.
Returns:
-
bool(bool) –True if the structures are equivalent, False otherwise.
Note
- It is not clear what is "equivalent"?
remove_duplicate_structs_serial(extxyz_file: str, tol=1e-06) -> None
¤
remove_duplicate_structs_hash(extxyz_file: str, seen_extxyz: str | None = None, tol: float = 1e-06, backup: bool = True) -> None
¤
Remove duplicate structures using hashing (very fast).
- Much less memory overhead compared to pairwise
are_structs_identicalcalls. - This reduces duplicate checking to O(N) instead of O(N²). No parallelism needed — it's already O(N)
Parameters:
-
extxyz_file(str) –Path to the extxyz file.
-
seen_extxyz(str | None, default:None) –Optional path to an extxyz file to be included into the set of seen structures. Defaults to None.
-
tol(float, default:1e-06) –Tolerance for comparing atomic positions. Defaults to 1e-6.
-
backup(bool, default:True) –Whether to create a backup of the original file. Defaults to True.
Note
- Use
reversed()does not modify the original list, and it is memory-efficient (no copy)