Files Parser
Parsing and Chunking files using Cosmos.
About Files Parser
Welcome to the Cosmos Universal File Parser and Chunker.
This guide will walk you through the process of parsing files into smaller blocks of text using CosmosPlatform's API. This can be useful for processing large documents, extracting specific sections, or managing content in a more structured way. For example, you might want to extract specific pages of a document or split the text into standardized-size blocks as a first step for a Retrieved Augmented Generation architecture.
Storage
⚠️ Warning: This parsed version of your file will not be stored permanently and will not be accessible once the session is closed. For persistent storage of processed files, consider using the Files Index functions.
Supported File Formats
CosmosPlatform supports a wide range of document formats for chunking, including:
- Documents:
.pdf
,.docx
,.doc
,.odt
,.txt
,.md
- Spreadsheets:
.xlsx
,.xls
,.ods
,.csv
- Presentations:
.pptx
,.ppt
,.odp
- Notebooks:
.ipynb
Files Parsing
The file parsing endpoint (available in https://platform.cosmos-suite.ai/api/v1/parse
) allows you to break down a file into smaller, manageable pieces.
You can chunk files into:
- Chunks: Sections within size limits.
- Pages: Selected pages of the document.
- File: The whole file content.
Before starting, make sure you've completed the Quickstart Guide to set up your account and obtain an API key.
Step-by-step Tutorial
1. Prerequisites
Before you begin, ensure you have:
- An active CosmosPlatform account
- API key from the API keys dashboard
2. Setup Cosmos Python Client
Using Python Cosmos client you can perform the API requests in a convenient way.
2.1. Install Cosmos Python Client:
In order to install the Python client, you can add it to your project by doing:
poetry add delos-cosmos
Or install it in your virtual environment with:
pip install delos-cosmos
2.2. Authenticate Requests:
- Initialize the client with your API key:
from cosmos import CosmosClient cosmos_client = CosmosClient(apikey=your-cosmos-api-key)
2.3. Call API:
- You can start invoking any Cosmos endpoints. For example, let's try the
/health
endpoint to check the validity of your API key and the availability of the client services:
response = cosmos_client.status_health_request() print(response)
3. Parse a file
Let's see an example of a File Parsing request using Python client:
from cosmos import CosmosClient cosmos_client = CosmosClient(apikey="your-cosmos-api-key") response = cosmos_client.files_parse_request(filepath=path/to/file.pdf, extract_type="pages", read_images=True) print(response)
There are three types of file text parsing available:
chunks
: Blocks of text within size limits.pages
: Selected pages of the document.file
: The whole file content. It is the default behavior.
The parameters for the file parsing request are:
Parameter | Description | Example |
---|---|---|
filepath | The path to the file to be parsed | /path/to/file.pdf |
extract_type | The type of extraction to perform: chunks (default), pages , or file | chunks |
read_images (optional) | Whether to scan images or not (default). | False |
k_min (optional) | Minimum amount of tokens in a chunk (only used in chunks extraction type) | 500 |
k_max (optional) | Maximum amount of tokens in a chunk (only used in chunks extraction type) | 1000 |
overlap (optional) | The overlap between chunks (only used in chunks extraction type) | 10 |
filter_pages (optional) | Pages to be selected | [1, 2, 3] |
The boolean option read_images
allow to apply image scan methods in order to read the images.
Another example providing the optional parameters:
from cosmos import CosmosClient cosmos_client = CosmosClient(apikey="your-cosmos-api-key") response = cosmos_client.files_parse_request( filepath=path/to/file.pdf, extract_type="chunks", read_images=False, k_min=500, k_max=1000, overlap=10, ) print(response)
If your request is successful, you will receive a response with the chunked content. Here’s an example response for parsing a document into chunks:
{ "data": { "chunks": [ { "content": "Example content", "size": 17, "page": 1, "position": {} } // More chunks... ] }, "metadata": { "total_tokens": 573, "request_cost": 0.003, "request_id": "3e6c184e-f6a3-4adc-b4a9-ab307d4decc5", "timestamp": "2024-10-09T09:58:59.522391+00:00" } }
4. Handle Errors
If there is an issue with your request, you will receive an error response. For example, if you don't provide a file, you will get a 400 Bad Request error:
{ "error_code": "400", "error_message": "Invalid request parameters", "details": "No file provided for the chunker" }
If you encounter any errors:
- Ensure that you have provided all necessary parameters.
- Verify that your API key is correct. You can find it in your API keys dashboard. Double check if it is
Active
and copy it. To verify it is not a key issue, try a request towards the health service:
from cosmos import CosmosClient cosmos_client = CosmosClient(apikey="false-api-key") response = cosmos_client.status_health_request() print(response)
- Make sure the file path is correct and the file exists.
Conclusion
This step-by-step guide has walked you through the process of parsing and chunking files using CosmosPlatform's API. For more detailed information and advanced usage, refer to the API Reference.
Happy parsing!