Files Parser

Parsing and Chunking files using Cosmos.

About Files Parser

Welcome to the Cosmos Universal File Parser and Chunker.

This guide will walk you through the process of parsing files into smaller blocks of text using CosmosAPI. This can be useful for processing large documents, extracting specific sections, or managing content in a more structured way. For example, you might want to extract specific pages of a document or split the text into standardized-size blocks as a first step for a Retrieved Augmented Generation architecture.

Storage

⚠️ Warning: This parsed version of your file will not be stored permanently and will not be accessible once the session is closed. For persistent storage of processed files, consider using the Files Index functions.

Supported File Formats

CosmosAPI supports a wide range of document formats for chunking, including:

Documents: .pdf, .docx, .doc, .odt, .txt, .md
Spreadsheets: .xlsx, .xls, .ods, .csv
Presentations: .pptx, .ppt, .odp
Notebooks: .ipynb

Files Parsing

The file parsing endpoint (available in https://platform.cosmos-suite.ai/api/v1/parse) allows you to break down a file into smaller, manageable pieces.

You can chunk files into:

Chunks: Sections within size limits.
Pages: Selected pages of the document.
File: The whole file content.

Before starting, make sure you've completed the Quickstart Guide to set up your account and obtain an API key.

Step-by-step Tutorial

In this guide:
Section 1: Prerequisites.
Section 2: Setup Cosmos Python Client.
Section 3: Parse a file. Parameters and examples.
Section 4: Handle Errors.

1. Prerequisites

Before you begin, ensure you have:

An active CosmosPlatform account
API key from the API keys dashboard

2. Setup Cosmos Python Client

Using Python Cosmos client you can perform the API requests in a convenient way.

2.1. Install Cosmos Python Client:

Get the Cosmos Python client through PIP:

pip install delos-cosmos

2.2. Authenticate Requests:

Initialize the client with your API key:

from cosmos import CosmosClient

cosmos_client = CosmosClient(api_key=your-cosmos-api-key)

2.3. Call API:

You can start invoking any Cosmos endpoints. For example, let's try the /health endpoint to check the validity of your API key and the availability of the client services:

response = cosmos_client.status_health()
print(response)

3. Parse a file

Let's see an example of a File Parsing request using Python client (/files/parse endpoint):

from cosmos import CosmosClient

cosmos_client = CosmosClient(api_key="your-cosmos-api-key")
response = cosmos_client.files_parse(filepath=path/to/file.pdf, extract_type="pages", read_images=True)
print(response)

If your request is successful, you will receive a response with the chunked content. Here’s an example response for parsing a document into chunks:

{
  "data": {
    "chunks": [
      { "content": "Example content", "size": 17, "page": 1, "position": {} }
      // More chunks...
    ]
  },
  "metadata": {
    "total_tokens": 573,
    "request_cost": 0.003,
    "request_id": "3e6c184e-f6a3-4adc-b4a9-ab307d4decc5",
    "timestamp": "2024-10-09T09:58:59.522391+00:00"
  }
}

3.1. Parameters:

The filepath expects the local path to your file.
The extract_type determines the type of blocks that will be obtained from the document. There are three types of file text parsing available:
- chunks: Blocks of text within size limits.
- pages: Selected pages of the document.
- file: The whole file content. It is the default behavior.
The boolean option read_images allow to apply image scan methods in order to read the images.
The k_min and k_max control the tokens size of chunks (if that extraction type is selected). The overlap parameter allows for overlapping between consecutive chunks, up to the specified number of tokens indicated by overlap.

3.2. Examples:

Here are all the parameters for the file parsing request:

Parameter	Description	Example
`filepath`	The path to the file to be parsed	`/path/to/file.pdf`
`extract_type`	The type of extraction to perform: `chunks` (default), `pages`, or `file`	`chunks`
`read_images` (optional)	Whether to scan images or not (default).	`False`
`k_min` (optional)	Minimum amount of tokens in a chunk (only used in `chunks` extraction type)	`500`
`k_max` (optional)	Maximum amount of tokens in a chunk (only used in `chunks` extraction type)	`1000`
`overlap` (optional)	The overlap between chunks (only used in `chunks` extraction type)	`10`
`filter_pages` (optional)	Pages to be selected	`[1, 2, 3]`

This is a request with all parameters:

from cosmos import CosmosClient

cosmos_client = CosmosClient(api_key="your-cosmos-api-key")
response = cosmos_client.files_parse(
    filepath=path/to/file.pdf,
    extract_type="chunks",
    read_images=False,
    k_min=500,
    k_max=1000,
    overlap=10,
)
print(response)

4. Handle Errors

If there is an issue with your request, you will receive an error response. For example, if you don't provide a file, you will get a 400 Bad Request error:

{
  "error_code": "400",
  "error_message": "Invalid request parameters",
  "details": "No file provided for the chunker"
}

If you encounter any errors:

Ensure that you have provided all necessary parameters.
Verify that your API key is correct. You can find it in your API keys dashboard. Double check if it is Active and copy it. To verify it is not a key issue, try a request towards the health service:

from cosmos import CosmosClient

cosmos_client = CosmosClient(api_key="false-api-key")
response = cosmos_client.status_health()
print(response)

Make sure the file path is correct and the file exists.

Conclusion

This step-by-step guide has walked you through the process of parsing and chunking files using CosmosAPI. For more detailed information and advanced usage, refer to the API Reference.

Happy parsing!