Files Parser
Parsing and Chunking files using Cosmos.
About Files Parser
Welcome to the Cosmos Universal File Parser and Chunker.
This guide will walk you through the process of parsing files into smaller blocks of text using CosmosAPI. This can be useful for processing large documents, extracting specific sections, or managing content in a more structured way. For example, you might want to extract specific pages of a document or split the text into standardized-size blocks as a first step for a Retrieved Augmented Generation architecture.
Storage
⚠️ Warning: This parsed version of your file will not be stored permanently and will not be accessible once the session is closed. For persistent storage of processed files, consider using the Files Index functions.
Supported File Formats
CosmosAPI supports a wide range of document formats for chunking, including:
- Documents:
.pdf
,.docx
,.doc
,.odt
,.txt
,.md
- Spreadsheets:
.xlsx
,.xls
,.ods
,.csv
- Presentations:
.pptx
,.ppt
,.odp
- Notebooks:
.ipynb
Files Parsing
The file parsing endpoint (available in https://platform.cosmos-suite.ai/api/v1/parse
) allows you to break down a file into smaller, manageable pieces.
You can chunk files into:
- Chunks: Sections within size limits.
- Pages: Selected pages of the document.
- File: The whole file content.
Before starting, make sure you've completed the Quickstart Guide to set up your account and obtain an API key.
Step-by-step Tutorial
In this guide:
Section 1
: Prerequisites.Section 2
: Setup Cosmos Python Client.Section 3
: Parse a file. Parameters and examples.Section 4
: Handle Errors.
1. Prerequisites
Before you begin, ensure you have:
- An active CosmosPlatform account
- API key from the API keys dashboard
2. Setup Cosmos Python Client
Using Python Cosmos client you can perform the API requests in a convenient way.
2.1. Install Cosmos Python Client:
Get the Cosmos Python client through PIP:
pip install delos-cosmos
2.2. Authenticate Requests:
Initialize the client with your API key:
from cosmos import CosmosClient cosmos_client = CosmosClient(api_key=your-cosmos-api-key)
2.3. Call API:
You can start invoking any Cosmos endpoints. For example, let's try the /health
endpoint to check the validity of your API key and the availability of the client services:
response = cosmos_client.status_health() print(response)
3. Parse a file
Let's see an example of a File Parsing request using Python client (/files/parse
endpoint):
from cosmos import CosmosClient cosmos_client = CosmosClient(api_key="your-cosmos-api-key") response = cosmos_client.files_parse(filepath=path/to/file.pdf, extract_type="pages", read_images=True) print(response)
If your request is successful, you will receive a response with the chunked content. Here’s an example response for parsing a document into chunks:
{ "data": { "chunks": [ { "content": "Example content", "size": 17, "page": 1, "position": {} } // More chunks... ] }, "metadata": { "total_tokens": 573, "request_cost": 0.003, "request_id": "3e6c184e-f6a3-4adc-b4a9-ab307d4decc5", "timestamp": "2024-10-09T09:58:59.522391+00:00" } }
3.1. Parameters:
The
filepath
expects the local path to your file.The
extract_type
determines the type of blocks that will be obtained from the document. There are three types of file text parsing available:chunks
: Blocks of text within size limits.pages
: Selected pages of the document.file
: The whole file content. It is the default behavior.
The boolean option
read_images
allow to apply image scan methods in order to read the images.The
k_min
andk_max
control the tokens size ofchunks
(if that extraction type is selected). Theoverlap
parameter allows for overlapping between consecutive chunks, up to the specified number of tokens indicated by overlap.
3.2. Examples:
Here are all the parameters for the file parsing request:
Parameter | Description | Example |
---|---|---|
filepath | The path to the file to be parsed | /path/to/file.pdf |
extract_type | The type of extraction to perform: chunks (default), pages , or file | chunks |
read_images (optional) | Whether to scan images or not (default). | False |
k_min (optional) | Minimum amount of tokens in a chunk (only used in chunks extraction type) | 500 |
k_max (optional) | Maximum amount of tokens in a chunk (only used in chunks extraction type) | 1000 |
overlap (optional) | The overlap between chunks (only used in chunks extraction type) | 10 |
filter_pages (optional) | Pages to be selected | [1, 2, 3] |
This is a request with all parameters:
from cosmos import CosmosClient cosmos_client = CosmosClient(api_key="your-cosmos-api-key") response = cosmos_client.files_parse( filepath=path/to/file.pdf, extract_type="chunks", read_images=False, k_min=500, k_max=1000, overlap=10, ) print(response)
4. Handle Errors
If there is an issue with your request, you will receive an error response. For example, if you don't provide a file, you will get a 400 Bad Request error:
{ "error_code": "400", "error_message": "Invalid request parameters", "details": "No file provided for the chunker" }
If you encounter any errors:
- Ensure that you have provided all necessary parameters.
- Verify that your API key is correct. You can find it in your API keys dashboard. Double check if it is
Active
and copy it. To verify it is not a key issue, try a request towards the health service:
from cosmos import CosmosClient cosmos_client = CosmosClient(api_key="false-api-key") response = cosmos_client.status_health() print(response)
- Make sure the file path is correct and the file exists.
Conclusion
This step-by-step guide has walked you through the process of parsing and chunking files using CosmosAPI. For more detailed information and advanced usage, refer to the API Reference.
Happy parsing!