Files Parser

Parsing and Chunking files using Cosmos.

About Files Parser

Welcome to the Cosmos Universal File Parser and Chunker.

This guide will walk you through the process of parsing files into smaller blocks of text using CosmosPlatform's API. This can be useful for processing large documents, extracting specific sections, or managing content in a more structured way. For example, you might want to extract specific pages of a document or split the text into standardized-size blocks as a first step for a Retrieved Augmented Generation architecture.

Storage

⚠️ Warning: This parsed version of your file will not be stored permanently and will not be accessible once the session is closed. For persistent storage of processed files, consider using the Files Index functions.

Supported File Formats

CosmosPlatform supports a wide range of document formats for chunking, including:

  • Documents: .pdf, .docx, .doc, .odt, .txt, .md
  • Spreadsheets: .xlsx, .xls, .ods, .csv
  • Presentations: .pptx, .ppt, .odp
  • Notebooks: .ipynb

Files Parsing

The file parsing endpoint (available in https://platform.cosmos-suite.ai/api/v1/parse) allows you to break down a file into smaller, manageable pieces.

You can chunk files into:

  • Chunks: Sections within size limits.
  • Pages: Selected pages of the document.
  • File: The whole file content.

Before starting, make sure you've completed the Quickstart Guide to set up your account and obtain an API key.


Step-by-step Tutorial

1. Prerequisites

Before you begin, ensure you have:

2. Setup Cosmos Python Client

Using Python Cosmos client you can perform the API requests in a convenient way.

2.1. Install Cosmos Python Client:

In order to install the Python client, you can add it to your project by doing:

poetry add delos-cosmos

Or install it in your virtual environment with:

pip install delos-cosmos

2.2. Authenticate Requests:

  • Initialize the client with your API key:
from cosmos import CosmosClient

cosmos_client = CosmosClient(apikey=your-cosmos-api-key)

2.3. Call API:

  • You can start invoking any Cosmos endpoints. For example, let's try the /health endpoint to check the validity of your API key and the availability of the client services:
response = cosmos_client.status_health_request()
print(response)

3. Parse a file

Let's see an example of a File Parsing request using Python client:

from cosmos import CosmosClient

cosmos_client = CosmosClient(apikey="your-cosmos-api-key")
response = cosmos_client.files_parse_request(filepath=path/to/file.pdf, extract_type="pages", read_images=True)
print(response)

There are three types of file text parsing available:

  • chunks: Blocks of text within size limits.
  • pages: Selected pages of the document.
  • file: The whole file content. It is the default behavior.

The parameters for the file parsing request are:

ParameterDescriptionExample
filepathThe path to the file to be parsed/path/to/file.pdf
extract_typeThe type of extraction to perform: chunks (default), pages, or filechunks
read_images (optional)Whether to scan images or not (default).False
k_min (optional)Minimum amount of tokens in a chunk (only used in chunks extraction type)500
k_max (optional)Maximum amount of tokens in a chunk (only used in chunks extraction type)1000
overlap (optional)The overlap between chunks (only used in chunks extraction type)10
filter_pages (optional)Pages to be selected[1, 2, 3]

The boolean option read_images allow to apply image scan methods in order to read the images.

Another example providing the optional parameters:

from cosmos import CosmosClient

cosmos_client = CosmosClient(apikey="your-cosmos-api-key")
response = cosmos_client.files_parse_request(
    filepath=path/to/file.pdf,
    extract_type="chunks",
    read_images=False,
    k_min=500,
    k_max=1000,
    overlap=10,
)
print(response)

If your request is successful, you will receive a response with the chunked content. Here’s an example response for parsing a document into chunks:

{
  "data": {
    "chunks": [
      { "content": "Example content", "size": 17, "page": 1, "position": {} }
      // More chunks...
    ]
  },
  "metadata": {
    "total_tokens": 573,
    "request_cost": 0.003,
    "request_id": "3e6c184e-f6a3-4adc-b4a9-ab307d4decc5",
    "timestamp": "2024-10-09T09:58:59.522391+00:00"
  }
}

4. Handle Errors

If there is an issue with your request, you will receive an error response. For example, if you don't provide a file, you will get a 400 Bad Request error:

{
  "error_code": "400",
  "error_message": "Invalid request parameters",
  "details": "No file provided for the chunker"
}

If you encounter any errors:

  1. Ensure that you have provided all necessary parameters.
  2. Verify that your API key is correct. You can find it in your API keys dashboard. Double check if it is Active and copy it. To verify it is not a key issue, try a request towards the health service:
from cosmos import CosmosClient

cosmos_client = CosmosClient(apikey="false-api-key")
response = cosmos_client.status_health_request()
print(response)
  1. Make sure the file path is correct and the file exists.

Conclusion

This step-by-step guide has walked you through the process of parsing and chunking files using CosmosPlatform's API. For more detailed information and advanced usage, refer to the API Reference.

Happy parsing!