Skip to content

API

thkit

The Python package of general ulitities.

Developed and maintained by C.Thang Nguyen

Modules:

Attributes:

THKIT_ROOT = Path(__file__).parent module-attribute

__author__ = 'C.Thang Nguyen' module-attribute

__contact__ = 'http://thangckt.github.io/email' module-attribute

_version

Attributes:

__all__ = ['__version__', '__version_tuple__', 'version', 'version_tuple', '__commit_id__', 'commit_id'] module-attribute

TYPE_CHECKING = False module-attribute

VERSION_TUPLE = Tuple[Union[int, str], ...] module-attribute

COMMIT_ID = Union[str, None] module-attribute

version: str = '0.1.1.dev1' module-attribute

__version__: str = '0.1.1.dev1' module-attribute

__version_tuple__: VERSION_TUPLE = (0, 1, 1, 'dev1') module-attribute

version_tuple: VERSION_TUPLE = (0, 1, 1, 'dev1') module-attribute

commit_id: COMMIT_ID = 'g858759d9c' module-attribute

__commit_id__: COMMIT_ID = 'g858759d9c' module-attribute

config

Functions:

validate_config(config_dict=None, config_file=None, schema_dict=None, schema_file=None, allow_unknown=False, require_all=False)

Validate the config file with the schema file.

Parameters:

  • config_dict (dict, default: None ) –

    config dictionary. Defaults to None.

  • config_file (str, default: None ) –

    path to the YAML config file, will override config_dict. Defaults to None.

  • schema_dict (dict, default: None ) –

    schema dictionary. Defaults to None.

  • schema_file (str, default: None ) –

    path to the YAML schema file, will override schema_dict. Defaults to None.

  • allow_unknown (bool, default: False ) –

    whether to allow unknown fields in the config file. Defaults to False.

  • require_all (bool, default: False ) –

    whether to require all fields in the schema file to be present in the config file. Defaults to False.

Raises:

  • ValueError

    if the config file does not match the schema

loadconfig(filename: str | Path) -> dict

Load data from a JSON or YAML file.

Args: filename (Union[str, Path]): The filename to load data from, whose suffix should be .json, jsonc, .yml, or .yml

Returns:

  • jdata ( dict ) –

    (dict) The data loaded from the file

Notes
  • The YAML file can contain variable-interpolation, will be processed by OmegaConf. Example input YAML file:
    server:
    host: localhost
    port: 80
    
    client:
    url: http://${server.host}:${server.port}/
    server_port: ${server.port}
    # relative interpolation
    description: Client of ${.url}
    

load_jsonc(filename: str) -> dict

Load data from a JSON file that allow comments.

get_default_args(func: callable) -> dict

Get dict of default values of arguments of a function Args: func (callable): function to inspect

argdict_to_schemadict(func: callable) -> dict

Convert a function's type-annotated arguments into a cerberus schema dict.

Handles
  • Single types
  • Union types (as list of types)
  • Nullable types (None in Union)
  • Only checks top-level types (no recursion into list[int], dict[str, float], etc.)
  • Supports multiple types in cerberus (e.g. {"type": ["integer", "string"]}) when a Union is given.

Parameters:

  • func (callable) –

    function to inspect

Returns:

  • schemadict ( dict ) –

    cerberus schema dictionary

io

Classes:

  • DotDict

    Dictionary supporting dot notation (attribute access) as well as standard dictionary access.

Functions:

  • write_yaml

    Write data to a YAML file.

  • read_yaml

    Read data from a YAML file.

  • combine_text_files

    Combine text files into a single file in a memory-efficient. Read and write in chunks to avoid loading large files into memory

  • unpack_dict

    Unpack one level of nested dictionary.

  • download_rawtext

    Download raw text from a URL.

  • txt2str

    Convert a text file to a string

  • str2txt

    Convert a string to a text file

  • txt2list

    Convert a text file to a list of lines (without newline characters)

  • list2txt

    Convert a list of lines to a text file

  • float2str

    convert float number to str

DotDict(dct=None)

Bases: dict

Dictionary supporting dot notation (attribute access) as well as standard dictionary access. Nested dicts and sequences (list/tuple/set) are converted recursively.

Parameters:

  • dct (dict, default: None ) –

    Initial dictionary to populate the DotDict. Defaults to empty dict.

Usage

d = DotDict({'a': 1, 'b': {'c': 2, 'd': [3, {'e': 4}]}}) print(d.b.c) # 2 print(d['b']['c']) # 2 d.b.d[1].e = 42 print(d.b.d[1].e) # 42 print(d.to_dict()) # plain dict

Methods:

  • __setitem__

    Set item using dot notation or standard dict syntax.

  • __setattr__
  • to_dict

    Recursively convert DotDict back to plain dict.

Attributes:

__getattr__ = dict.__getitem__ class-attribute instance-attribute
__delattr__ = dict.__delitem__ class-attribute instance-attribute
__setitem__(key, value)

Set item using dot notation or standard dict syntax.

__setattr__(key, value)
_wrap(value)
to_dict()

Recursively convert DotDict back to plain dict.

write_yaml(jdata: dict, filename: str | Path)

Write data to a YAML file.

read_yaml(filename: str | Path) -> dict

Read data from a YAML file.

combine_text_files(files: list[str], output_file: str, chunk_size: int = 1024)

Combine text files into a single file in a memory-efficient. Read and write in chunks to avoid loading large files into memory

Parameters:

  • files (list[str]) –

    List of file paths to combine.

  • output_file (str) –

    Path to the output file.

  • chunk_size (int, default: 1024 ) –

    Size of each chunk in KB to read/write. Defaults to 1024 KB.

unpack_dict(nested_dict: dict) -> dict

Unpack one level of nested dictionary.

download_rawtext(url: str, outfile: str = None) -> str

Download raw text from a URL.

txt2str(file_path: str | Path) -> str

Convert a text file to a string

str2txt(text: str, file_path: str | Path) -> None

Convert a string to a text file

txt2list(file_path: str | Path) -> list[str]

Convert a text file to a list of lines (without newline characters)

list2txt(lines: list[str], file_path: str | Path) -> None

Convert a list of lines to a text file

float2str(number: float, decimals=6)

convert float number to str REF: https://stackoverflow.com/questions/2440692/formatting-floats-without-trailing-zeros

Parameters:

  • number (float) –

    float number

  • fmt (str) –

    format of the output string

Returns:

  • s ( str ) –

    string of the float number

jobman

Modules:

helper

Functions:

  • change_logpath_dispatcher

    Change the logfile of dpdispatcher.

  • validate_machine_config

    Validate the YAML file contains multiple machines configs. This function is used to validate machine configs at very beginning of program to avoid later errors.

  • loadconfig_multi_machines

    Load and validate the YAML file contains multiple machine configs. This function to load machine configs for general purpose usage.

_COLOR_MAP = {0: 'blue', 1: 'green', 2: 'yellow', 3: 'magenta', 4: 'cyan', 5: 'red', 6: 'white', 7: 'white', 8: 'white', 9: 'white', 10: 'white'} module-attribute
change_logpath_dispatcher(newlogfile: str)

Change the logfile of dpdispatcher.

_init_jobman_logger(logfile: str | None = None)

Initialize the default logger under log/, if not provided

_info_current_dispatch(num_tasks: int, num_tasks_current_chunk: int, job_limit, chunk_index, old_time=None, new_time=None, machine_index=0) -> str

Return the information of the current chunk of tasks.

_remote_info(machine_dict) -> str

Return the remote machine information. Args: mdict (dict): the machine dictionary

validate_machine_config(machine_file: str)

Validate the YAML file contains multiple machines configs. This function is used to validate machine configs at very beginning of program to avoid later errors.

Notes
  • To specify multiple remote machines for the same purpose, the top-level keys in the machine config file should start with the same prefix. Example:
    • train_1, train_2,... for training jobs
    • lammps_1, lammps_2,... for lammps jobs
    • gpaw_1, gpaw_2,... for gpaw jobs
_parse_multi_mdict(multi_mdict: dict, mdict_prefix: str = '') -> list[dict]

Parse multiple machine dicts from a multi-machine dict based on the prefix.

Parameters:

  • multi_mdict (dict) –

    the bid dict contains multiple machines configs

  • mdict_prefix (str, default: '' ) –

    the prefix to select remote machines for the same purpose. Example: 'dft', 'md', 'train'.

Returns:

  • list[dict]

    list[dict]: list of machine dicts

loadconfig_multi_machines(machine_file: str, mdict_prefix: str = '') -> list[dict]

Load and validate the YAML file contains multiple machine configs. This function to load machine configs for general purpose usage.

Parameters:

  • machine_file (str) –

    the path of the machine config file

Returns:

  • dict ( list[dict] ) –

    the multi-machine dict

submit

Functions:

_machine_locks = {} module-attribute
_prepare_submission(mdict: dict, work_dir: str, task_list: list[Task], forward_common_files: list[str] = [], backward_common_files: list[str] = [], create_remote_path: bool = True) -> Submission

Function to simplify the preparation of the Submission object for dispatching jobs.

Parameters:

  • mdict (dict) –

    a dictionary contain settings of the remote machine. The parameters described in here

  • create_remote_path (bool, default: True ) –

    whether to create the remote path if it does not exist.

submit_job_chunk(mdict: dict, work_dir: str, task_list: list[Task], forward_common_files: list[str] = [], backward_common_files: list[str] = [], machine_index: int = 0, logger: object = None)

Function to submit a jobs to the remote machine. The function will:

  • Prepare the task list
  • Make the submission of jobs to remote machines
  • Wait for the jobs to finish and download the results to the local machine

Parameters:

  • mdict (dict) –

    a dictionary contain settings of the remote machine. The parameters described in the remote machine schema. This dictionary defines the login information, resources, execution command, etc. on the remote machine.

  • task_list (list[Task]) –

    a list of Task objects. Each task object contains the command to be executed on the remote machine, and the files to be copied to and from the remote machine. The dirs of each task must be relative to the work_dir.

  • forward_common_files (list[str], default: [] ) –

    common files used for all tasks. These files are i n the work_dir.

  • backward_common_files (list[str], default: [] ) –

    common files to download from the remote machine when the jobs are finished.

  • machine_index (int, default: 0 ) –

    index of the machine in the list of machines.

  • logger (object, default: None ) –

    the logger object to be used for logging.

Note
  • Split the task_list into chunks to control the number of jobs submitted at once.
  • Should not use the Local contexts, it will interference the current shell environment which leads to the unexpected behavior on local machine. Instead, use another account to connect local machine with SSH context.
_get_machine_lock(machine_index)
_run_submission_wrapper(submission, logger, check_interval=30, machine_index=0) async

Ensure only one instance of 'submission.run_submission' runs at a time. - If use one global lock for all machines, it will prevent concurrent execution of submissions on different machines. Therefore, each machine must has its own lock, so different machines can process jobs in parallel.

async_submit_job_chunk(mdict: dict, work_dir: str, task_list: list[Task], forward_common_files: list[str] = [], backward_common_files: list[str] = [], machine_index: int = 0, logger: object = None) async

Convert submit_job_chunk() into an async function but only need to wait for the completion of the entire for loop (without worrying about the specifics of each operation inside the loop)

Note
  • An async function normally contain a await ... statement to be awaited (yield control to event loop)
  • If the 'event loop is blocked' by a asynchronous function (it will not yield control to event loop), the async function will wait for the completion of the synchronous function. So, the async function will not be executed asynchronously. Try to use await asyncio.to_thread() to run the synchronous function in a separate thread, so that the event loop is not blocked.
_alff_prepare_task_list(command_list: list[str], task_dirs: list[str], forward_files: list[str], backward_files: list[str], outlog: str, errlog: str) -> list[Task]

Prepare the task list for alff package.

The feature of jobs in alff package is that they have the same: command_list, forward_files, backward_files. So, this function prepares the list of Task object for alff package. For general usage, should prepare the task list manually.

Parameters:

  • command_list (list[str]) –

    the list of commands to be executed on the remote machine.

  • task_dirs (list[str]) –

    the list of directories for each task. They must be relative to the work_dir in function _prepare_submission

  • forward_files (list[str]) –

    the list of files to be copied to the remote machine. These files must existed in each task_dir.

  • backward_files (list[str]) –

    the list of files to be copied back from the remote machine.

  • outlog (str) –

    the name of the output log file.

  • errlog (str) –

    the name of the error log file.

  • # delay_fail_report (bool) –

    whether to delay the failure report until all tasks are done. This is useful when there are many tasks, and we want to wait all tasks finished instead of "the controller interupts if one task fail".

Returns:

  • list[Task]

    list[Task]: a list of Task objects.

_divide_workload(mdict_list: list[dict]) -> list[float]

Revise workload ratio among multiple machines based on their work load ratio.

_divide_task_dirs(mdict_list: list[dict], task_dirs: list[str]) -> list[list[str]]

Distribute task_dirs among multiple machines based on their work load ratio.

alff_submit_job_multi_remotes(multi_mdict: dict, prepare_command_list: callable, work_dir: str, task_dirs: list[str], forward_files: list[str], backward_files: list[str], forward_common_files: list[str] = [], backward_common_files: list[str] = [], mdict_prefix: str = 'dft', logger: object = None) async

Submit jobs to multiple machines asynchronously.

Update 2025Sep20: Use alff_submit_job_multi_remotes_new() instead, which is more flexible. This function is kept for backward compatibility.

Parameters:

  • multi_mdict (dict) –

    the big_dict contains multiple mdicts. Each mdict contains parameters of one remote machine, which parameters as in the remote machine schema.

  • prepare_command_list (callable) –

    a function to prepare the command list based on each remote machine.

  • mdict_prefix (str, default: 'dft' ) –

    the prefix to select remote machines for the same purpose. Example: 'dft', 'md', 'train'.

alff_submit_job_multi_remotes_new(mdict_list: list[dict], commandlist_list: list[list[str]], work_dir: str, task_dirs: list[str], forward_files: list[str], backward_files: list[str], forward_common_files: list[str] = [], backward_common_files: list[str] = [], logger: object = None) async

Submit jobs to multiple machines asynchronously.

Parameters:

  • mdict_list (list[dict]) –

    list of multiple mdicts. Each mdict contains parameters of one remote machine, which parameters as in the remote machine schema.

  • commandlist_list (list[list[str]]) –

    list of command_lists, each list for each remote machine. Need to prepare outside.

path

Functions:

  • make_dir

    Create a directory with a backup option.

  • make_dir_ask_backup

    Make a directory and ask for backup if the directory already exists.

  • ask_user_action

    Ask user for one of: yes / no / backup, and return normalized choice.

  • list_paths

    List all files/folders in given directories and their subdirectories that match the given patterns.

  • collect_files

    Collect files from a list of paths (files/folders). Will search files in folders and their subdirectories.

  • change_pathname

    change path names

  • remove_files

    Remove files from a given list of file paths.

  • remove_dirs

    Remove a list of directories.

  • remove_files_in_paths

    Remove files in the files list in the paths list.

  • remove_dirs_in_paths

    Remove directories in the dirs list in the paths list.

  • copy_file

    Copy a file/folder from the source path to the destination path. It will create the destination directory if it does not exist.

  • move_file

    Move a file/folder from the source path to the destination path.

  • filter_dirs

    Return directories containing has_files and none of no_files.

make_dir(path: str, backup: bool = True)

Create a directory with a backup option.

make_dir_ask_backup(dir_path: str)

Make a directory and ask for backup if the directory already exists.

ask_user_action(question: str) -> str

Ask user for one of: yes / no / backup, and return normalized choice.

list_paths(paths: list[str], patterns: list[str], recursive=True) -> list[str]

List all files/folders in given directories and their subdirectories that match the given patterns.

Parameters

paths : list[str] The list of paths to search files/folders. patterns : list[str] The list of patterns to apply to the files. Each filter can be a file extension or a pattern.

Returns:

List[str]: A list of matching paths.

Example:
folders = ["path1", "path2", "path3"]
patterns = ["*.ext1", "*.ext2", "something*.ext3", "*folder/"]
files = list_files_in_dirs(folders, patterns)
Note:
  • glob() does not list hidden files by default. To include hidden files, use glob(".*", recursive=True).
  • When use recursive=True, must include ** in the pattern to search subdirectories.
    • glob("*", recursive=True) will search all FILES & FOLDERS in the CURRENT directory.
    • glob("*/", recursive=True) will search all FOLDERS in the current CURRENT directory.
    • glob("**", recursive=True) will search all FILES & FOLDERS in the CURRENT & SUB subdirectories.
    • glob("**/", recursive=True) will search all FOLDERS in the current CURRENT & SUB subdirectories.
    • "/*" is equivalent to "".
    • "/*/" is equivalent to "/".
  • IMPORTANT: "/" will replicate the behavior of "**", then give unexpected results.

collect_files(paths: list[str], patterns: list[str]) -> list[str]

Collect files from a list of paths (files/folders). Will search files in folders and their subdirectories.

Parameters

paths : list[str] The list of paths to collect files from. patterns : list[str] The list of patterns to apply to the files. Each filter can be a file extension or a pattern.

Returns:

List[str]: A list of paths matching files.

change_pathname(paths: list[str], old_string: str, new_string: str, replace: bool = False) -> None

change path names

Parameters:

  • paths (list[str]) –

    paths to the files/dirs

  • old_string (str) –

    old string in path name

  • new_string (str) –

    new string in path name

  • replace (bool, default: False ) –

    replace the old path name if the new one exists. Defaults to False.

remove_files(files: list[str]) -> None

Remove files from a given list of file paths.

Parameters:

  • files (list[str]) –

    list of file paths

remove_dirs(dirs: list[str]) -> None

Remove a list of directories.

Parameters:

  • dirs (list[str]) –

    list of directories to remove.

remove_files_in_paths(files: list, paths: list) -> None

Remove files in the files list in the paths list.

remove_dirs_in_paths(dirs: list, paths: list) -> None

Remove directories in the dirs list in the paths list.

copy_file(src_path: str, dest_path: str)

Copy a file/folder from the source path to the destination path. It will create the destination directory if it does not exist.

move_file(src_path: str, dest_path: str)

Move a file/folder from the source path to the destination path.

filter_dirs(dirs: list[str], has_files: list[str] = None, no_files: list[str] = None) -> list[str]

Return directories containing has_files and none of no_files.

Parameters:

  • dirs (list[str]) –

    List of directory paths to scan.

  • has_files (list[str], default: None ) –

    Files that must exist in the directory. Defaults to [].

  • no_files (list[str], default: None ) –

    Files that must not exist in the directory. Defaults to [].

Returns:

  • list[str]

    List of directory paths meeting the conditions.

pkg

Functions:

strip_ansi_codes(msg: str) -> str

Strip ANSI codes for color formatting from a string.

create_logger(logger_name: str = None, log_file: str = None, level: str = 'INFO', level_logfile: str = None) -> logging.Logger

Create and configure a logger with console and optional file handlers.

check_package(package_name: str, auto_install: bool = False, git_repo: str = None, conda_channel: str = None)

Check if the required packages are installed

install_package(package_name: str, git_repo: str | None = None, conda_channel: str | None = None) -> None

Install the required package
  • Default using: pip install -U {package_name}
  • If git_repo is provided: pip install -U git+{git_repo}
  • If conda_channel is provided: conda install -c {conda_channel} {package_name}

Parameters:

  • package_name (str) –

    package name

  • git_repo (str, default: None ) –

    git path for the package. Default: None. E.g., http://somthing.git

  • conda_channel (str, default: None ) –

    conda channel for the package. Default: None. E.g., conda-forge

dependency_info(packages=['numpy', 'polars', 'thkit', 'ase']) -> str

Get the dependency information

Note

Use importlib instead of __import__ for clarity.

range

Functions:

  • range_inclusive

    Generate evenly spaced points including the endpoint (within tolerance).

  • composite_range

    A custom parser to allow define composite ranges. This use needed for defining input parameters in YAML files.

  • composite_index

    Allow define composite index ranges.

  • composite_strain_points

    Generate composite spacing points from multiple ranges with tolerance-based uniqueness.

range_inclusive(start: float, end: float, step: float, tol: float = 1e-06) -> list[float]

Generate evenly spaced points including the endpoint (within tolerance).

composite_range(list_inputs: list[int | float | str], tol=1e-06) -> list[float]

A custom parser to allow define composite ranges. This use needed for defining input parameters in YAML files.

Parameters:

  • list_inputs (list[int | float | str]) –

    Accepts numbers or strings with special form 'start:end[:step]' (inclusive).

  • tol (float, default: 1e-06 ) –

    Tolerance for including the endpoint.

Examples: ["-3.1:-1", 0.1, 2, "3.1:5.2", "6.0:10.1:0.5"]

composite_index(list_inputs: list[int | str]) -> list[int]

Allow define composite index ranges.

Parameters:

  • list_inputs (list[int | str]) –

    Accepts ints or strings with special form 'start:end[:step]' (inclusive).

Examples: [1, 2, "3-5", "7-10:2"] -> [1, 2, 3, 4, 5, 7, 9, 10]

composite_strain_points(list_inputs: list[int | float | str], tol=1e-06) -> list[float]

Generate composite spacing points from multiple ranges with tolerance-based uniqueness. Notes: - np.round(np.array(all_points) / tol).astype(int) is a trick to avoid floating point issues when comparing points with a certain tolerance.

stuff

Functions:

chunk_list(input_list: list, n: int) -> Generator

Yield successive n-sized chunks from input_list.

text_fill_center(input_text='example', fill='-', length=60)

Create a line with centered text.

text_fill_left(input_text='example', margin=15, fill_left='-', fill_right=' ', length=60)

Create a line with left-aligned text.

text_fill_box(input_text: str = '', fill: str = ' ', sp: str = 'ǁ', length: int = 60) -> str

Return a string centered in a box with side delimiters.

Example

text_fill_box("hello", fill="-", sp="|", length=20) '|-------hello-------|'

Notes: - To input unicode characters, use the unicode escape sequence (e.g., "ǁ" for a specific character). See https://symbl.cc/en/unicode-table/ for more details. - ║ (double vertical bar, u2551) - ‖ (double vertical line, u2016) - ǁ (Latin letter lateral click, u01C1)

text_repeat(input_str: str, length: int) -> str

Repeat the input string to a specified length.

text_color(text: str, color: str = 'blue') -> str

ANSI escape codes for color the text. follow this link for more details.

time_uuid() -> str

simple_uuid()

Generate a simple random UUID of 4 digits.