PDF Loader

class PDFLoader(max_chunk_words: int = 500)

Bases: DocumentLoader

A class to load and process PDF documents.

Inherits from DocumentLoader and specializes in handling PDF files. It extracts text blocks from PDF pages, cleans them, and chunks them into smaller segments.

Args:

max_chunk_words (int, optional): The maximum number of words per text chunk. Defaults to 500.

Example:

from eazyml_genai.components import PDFLoader


# Initialize the PDFLoader and mention max_chunk_words
pdf_loader = PDFLoader(max_chunk_words=800)

# Loads pdf as json document which has keys such as title,
# content, path for images of formula, table and pictures
# from pdf and meta information such as page number,
# paragraph and bounding boxes information.
documents = pdf_loader.load(file_path='YOUR FILE PATH')

load(file_path: str, pages: int | list | str | None = None)

Loads content from a PDF file, optionally for a specific page, cleans it, chunks it, and converts it into a list of document dictionaries.

Args:

file_path (str): The path to the PDF file to load.
pages (int, list): The specific page number to load. If not mentioned, By Default it will load all the pages.

Returns:

(List[dict]): A list of dictionaries, where each dictionary represents a chunk of text from the PDF. Each dictionary will typically contain keys like ‘content’, ‘metadata’ (including page number, file path, etc.).

Raises:

Exception: If pages is less than 1.