util
thml.util
¤
Modules:
-
check_installation– -
convert_doc– -
cookie_tool– -
show_text_diffs_in_jupyter–Show the differences between 2 texts in Jupytere Notebook side-by-side.
-
util_rag–This module contains support functions for the RAG system.
check_installation
¤
Functions:
convert_doc
¤
Functions:
-
pdf2md–Convert a PDF file to a markdown file.
pdf2md(input_pdf: str, output_md: str, num_threads: int = 8, device: str = 'auto', do_ocr: bool = False, ocr_engine: str = 'easyocr', ocr_lang: list[str] = ['vi', 'en']) -> None
¤
Convert a PDF file to a markdown file.
Parameters:
-
input_pdf(str) –The local-path or URL to the PDF/documents.
-
output_md(str) –The path to the output markdown file.
-
num_threads(int, default:8) –The number of threads to use. Defaults to 8.
-
device(str, default:'auto') –The accelerate device to use. Defaults to "auto".
-
do_ocr(bool, default:False) –Whether to perform OCR on the PDF. Defaults to False.
-
ocr_engine(str, default:'easyocr') –The OCR engine to use. Defaults to "easyocr". Choices: "easyocr" or "rapidocr".
-
ocr_lang(list[str], default:['vi', 'en']) –The list of languages to use for OCR. Defaults to ["vi", "en"].
Note
- See docling examples, and supported formats.
- To use OCR feature, you need to install
ocr engine, see installation guide here.
cookie_tool
¤
Functions:
-
search_cookie_files–Search all cookie files based on the search string.
-
first_cookie_file–select the first cookie file that are matched patterns
-
read_cookies–Read all cookie files, and select some cookies based on names.
show_text_diffs_in_jupyter
¤
Show the differences between 2 texts in Jupytere Notebook side-by-side. Following this article: https://skeptric.com/python-diffs/
Functions to create the diffs: - Escape any HTML characters so that they will display properly in HTML - Align the texts at a sentence level - Markup the differences between the tokens in each pair of aligned sentences - Output the markedup and aligned sentences as side-by-side HTML
REF: - Showing Side-by-Side Diffs in Jupyter
Functions:
-
html_diffs–Return the side-by-side HTML of the differences between text_a and text_b.
-
align_sentences–Align the sentences between two lists of sentences of text.
-
display_diffs_jupyter–Display the differences between text_a and text_b in Jupyter Notebook.
Attributes:
-
Token– -
TokenList– -
whitespace– -
end_sentence–
Token = str
module-attribute
¤
TokenList = list[Token]
module-attribute
¤
whitespace = re.compile('\\s+')
module-attribute
¤
end_sentence = re.compile('(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!\\:)\\s+')
module-attribute
¤
html_diffs(a: str, b: str) -> str
¤
align_sentences(list1: list[str], list2: list[str]) -> Tuple[list[str], list[str]]
¤
Align the sentences between two lists of sentences of text.
- Use the similarity score to check sentence between two lists
- Align them base on similarity score > a certain threshold,
- Insert empty sentences if it is neccessary, but do not change the order of sentence in original lists
Parameters:
-
list1(list[str]) –The first list of sentences of text.
-
list2(list[str]) –The second list of sentences of text.
Returns:
display_diffs_jupyter(a: str, b: str)
¤
Display the differences between text_a and text_b in Jupyter Notebook. Args: a: The first string. b: The second string.
util_rag
¤
This module contains support functions for the RAG system.
Classes:
-
Cookie–Convenience class for Bing Cookie files, data, and configuration. This Class
Attributes:
-
log–
log = Log.BingChat.debug
module-attribute
¤
Cookie
¤
Convenience class for Bing Cookie files, data, and configuration. This Class is updated dynamically by the Query class to allow cycling through >1 cookie/credentials file e.g. when daily request limits (current 200 per account per day) are exceeded.
Methods:
-
files–Return a sorted list of all cookie files matching .search_pattern in
-
import_data–Read the active cookie file and populate the following attributes:
-
import_next–Cycle through to the next cookies file then import it.
Attributes:
-
current_file_index– -
dir_path– -
current_file_path– -
search_pattern– -
ignore_files– -
request_count– -
supplied_files– -
rotate_cookies–
current_file_index = 0
class-attribute
instance-attribute
¤
dir_path = Path.home().resolve() / 'bing_cookies'
class-attribute
instance-attribute
¤
current_file_path = dir_path
class-attribute
instance-attribute
¤
search_pattern = 'bing_cookies_*.json'
class-attribute
instance-attribute
¤
ignore_files = set()
class-attribute
instance-attribute
¤
request_count = {}
class-attribute
instance-attribute
¤
supplied_files = set()
class-attribute
instance-attribute
¤
rotate_cookies = True
class-attribute
instance-attribute
¤
files() -> list[Path]
classmethod
¤
Return a sorted list of all cookie files matching .search_pattern in cls.dir_path, plus any supplied files, minus any ignored files.
import_data() -> None
classmethod
¤
Read the active cookie file and populate the following attributes:
.current_file_path .current_data .image_token
import_next(discard: bool = False) -> None
classmethod
¤
Cycle through to the next cookies file then import it.
discard (bool): True -Mark the previous file to be ignored for the remainder of the current session. Otherwise cycle through all available cookie files (sharing the workload and 'resting' when not in use).