Files Index

Query and analyze Index of Files using Cosmos.

About Index

Index are sets of documents that will be processed and analyzed together, allowing to ask questions and retrieve informations contained in the Index files.

They are referred through an unique index_uuid and a name of type string. You can find all the details in API Reference, but for a quick overview, these are the Index properties:

Index attributesDescriptionExample
index_uuidUnique identifier for the indexyour-index-uuid
nameName of the indexFinancial Reports 2023
statusIndex is active or in countdown (scheduled for deletion)active
vectorizedFile contents are embedded and ready for queryingfalse
created_atCreation timestamp2024-11-15T15:03:00.219676+00:00
updated_atUpdate timestamp2024-11-15T15:03:00.219681+00:00
expires_atExpiration date of the index, if scheduled for deletionNone
filesFiles linked to the the index, and storage.3 files: 147086 bytes

Storage Limits

⚠️ Warning: There is a total storage limit that is set for the total of files across all index. Once reached, no new files can be uploaded to existing index or new index created. Existing index can still be queried, deleted, or some files removed from them to free space. You can also manage storage through the Dashboard.

Supported File Formats

  • Documents: .pdf, .docx, .doc, .odt, .txt, .md
  • Spreadsheets: .xlsx, .xls, .ods, .csv
  • Presentations: .pptx, .ppt, .odp
  • Notebooks: .ipynb

Index for File Analysis

Let's create a Index to work with several files. We will first send the files to create the Index, then process them, and then the Index will be ready for our queries.

These are the Index requests available in Python cosmos_client:

GroupClient methodUsed for
Index Management.files_index_create_requestCreate a new file index
.files_index_delete_requestDelete a specific file index
.files_index_restore_requestRestore a specific deleted file index
-------------------------------------------------------------------------------------------------
Index Contents.files_index_add_files_requestAdd files to an existing file index
.files_index_delete_files_requestDelete files from an existing file index
.files_index_rename_requestRename an existing file index
-------------------------------------------------------------------------------------------------
Index Details.files_index_list_requestList all file indexes
.files_index_details_requestSee the details of an index
-------------------------------------------------------------------------------------------------
Index Querying.files_index_embed_requestEmbed the files in the specified index
.files_index_ask_requestQuery the files in a specific index

Step-by-step Tutorial

1. Prerequisites

Before you begin, ensure you have:

2. Setup Cosmos Python Client

Using Python Cosmos client you can perform the API requests in a convenient way.

2.1. Install Cosmos Python Client:

In order to install the Python client, you can add it to your project by doing:

poetry add delos-cosmos

Or install it in your virtual environment with:

pip install delos-cosmos

2.2. Authenticate Requests:

  • Initialize the client with your API key:
from cosmos import CosmosClient

cosmos_client = CosmosClient(apikey=your-cosmos-api-key)

2.3. Call API:

  • You can start invoking any Cosmos endpoints. For example, let's try the /health endpoint to check the validity of your API key and the availability of the client services:
response = cosmos_client.status_health_request()
print(response)

3. Index Operations

Index name a group of documents that are analyzed and processed together. They may be concerning the same topic, or share a common structure. When asking a question to the Index, the Model will process these documents together, and retrieve the most relevant information in order to answer the question.

3.1. NEW INDEX + DETAILS

1. Create new Index:

In order to create a new index, which will be shared to your team:

response = cosmos_client.files_index_create_request(
    filepaths=['/path/to/document1.pdf', '/path/to/document2.docx'],
    name="my_new_index",
    read_images=False)
print(response)

The parameter read_images allows to enable or not the scanning of the images and graphic elements while processing the file contents. By default it is disabled (read_images=False). This option consumes more since it requires a more complex processing.

Expected response:

{
  "status": "success",
  "message": "Index created successfully",
  "data": {
    "index_uuid": "my_new_index"
    // Additional index metadata
  }
}

2. See Index details:

You can retrieve the details of your created index by providing the index_uuid:

response = cosmos_client.files_index_details_request(index_uuid=index_uuid)
print(response)

The response will be similar to the following:

{
    "status": "success",
    "message": "Index details retrieved",
    "data": {
      "index_uuid": "your-index-uuid",
      "name":"my_new_index",
      "vectorized": false,
      "status": "active",
      "expires_at": "None",
      "created_at": "2024-11-15T15:03:00.219676+00:00",
      "updated_at": "2024-11-15T15:03:00.219681+00:00",
      "storage": {
        "size_bytes": 147086,
        "size_mb": 0.01,
        "num_files": 2
      },
      // Additional index metadata
    }
    "error": null,
    "timestamp": "2024-11-15T16:11:50.398764Z"
  }

The details come handy to make sure which files are in every index, and the storage details associated to them. Storage is limited to 1000 MB per organization (across all index).

Your organization storage is also managable through the Dashboard.

Using the index_uuid, you can add or remove files instantly from the index.

The complete deletion of an index is also possible, but it gets effective 2h after the request is performed, giving time to reverse the operation in case of errors.

3.2. MODIFY INDEX

You may want to rename an index or modify the set of files that an index contains.

1. Rename Index

You can rename an index by using the /rename_index endpoint:

response = cosmos_client.files_index_rename_request(
    new_index_uuid,
    "New name",
)
print(response)

Expected response:

{
  "status_code": 200,
  "status": "success",
  "message": "Index name 'Old name' changed to 'New name'.",
  "data": {
    "index_uuid": "your-index-uuid",
    "old_name": "Old name",
    "new_name": "New name",
    "updated_at": "2024-11-15T15:03:00.219676+00:00"
  }
}

2. Add new Files to Index

For adding new files, use the /add_files_to_index endpoint. You can choose to enable the read_images parameter if graphic contents are relevant to your processing (by default it is disabled):


response = cosmos_client.files_index_add_files_request(
    index_uuid=your-index-uuid,
    filepaths=["files=path/to/document3.pdf", \\
               "files=path/to/document4.txt"],
    read_images=True,
)
print(response)

Expected response:

{
  "status": "success",
  "message": "Index created successfully",
  "data": {
    "index_uuid": "your-index-uuid"
    // Additional index metadata
  }
}

3. Remove Files from Index

Or to remove one or more files from the index, by providing the filehash (it can be retrieved from the index details):

files_hashes = ["32348860fbb4c700eed9067261fc340"]

response = cosmos_client.files_index_delete_files_request(
    index_uuid=your-index-uuid,
    files_hashes=files_hashes
)
print(response)

Expected response:

{
  "status": "success",
  "message": "File(s) deleted from index successfully",
  "data": {
    "index_uuid": "your-index-uuid",
    "remaining_files": [
      "26a79e1e7233ef12c763c4a0e6b3221ddba54357d4",
      "f66d2345d7c64ea7e4428f87d537927b567d8eba00"
    ]
  }
}

You can request the index details again in order to make sure the files were correctly added or removed (see section 3.1 for Index details).

Also, you will be able to see the storage that those files in the index take in your quota, which is limited to 100 MB per organization (across all index).

3.3. EMBEDDING & QUERYING

1. Embed Index

In order to perform vectorized searches, you need to embed the index. This operation will calculate the embeddings of files belonging to the index:

response = cosmos_client.files_index_embed_request(index_uuid=your-index-uuid)
print(response)

Expected response:

{
  "status": "success",
  "message": "Index `your-index-uuid` successfully vectorized.",
  "data": {
    "index_uuid": "your-index-uuid"
  }
}

2. Query Index

Now that the index is vectorized, you can ask the index. The index will return the answer to the question based on the embeddings of files belonging to the index:

response = cosmos_client.files_index_ask_request(
    index_uuid=new_index_uuid,
    question="Where is located the bridge these articles mention?",
    output_language="en",
)

We can specify one or more filehashes to limit the files the Index is going to analyze in order to answer to your question:

response = cosmos_client.files_index_ask_request(
    index_uuid=new_index_uuid,
    question="Where is located the bridge these articles mention?",
    output_language="en",
    active_files_hashes = [
      "26a79e1e7233ef12c763c4a0e6b3221ddba54357d4",
      "f66d2345d7c64ea7e4428f87d537927b567d8eba00"
    ]
)

The responses will contain and answer to the question, as well as the sources of index file and page that contain the information to base the answer to the question.

Expected response:

{
  "status": "success",
  "message": "Query processed successfully",
  "data": {
    "answer": "The article discusses the bridge of Brooklyn 'FILE:1 PAGE:2'.",
    "sources": {
      "1": "efcb3858b45a3edb306c9d3457820e41ec78e3492f833ad189254c02886df260"
    }
  }
}

3.4. INDEX MANAGEMENT

1. List All Index

You can list all the Index in your team that are in active or countdown (scheduled deletion) status:

response = cosmos_client.files_index_list_request()
print(response)

Expected response:

{
  "status": "success",
  "message": "Retrieved 2 index.",
  "data": {
    "index": [
      {
        "index_uuid": "your-index-uuid",
        "name": "my_new_index",
        "status": "active",
        "vectorized": true,
        "created_at": "2024-11-15T15:03:00.219676+00:00",
        "updated_at": "2024-11-15T15:03:00.219681+00:00",
        "expires_at": "2024-11-16T15:03:00.219682+00:00",
        "storage": {
          "size_bytes": 147086,
          "size_mb": 0.01,
          "num_files": 2
        }
      },
      {
        "index_uuid": "another-index-uuid",
        "name": "2024 Sales results",
        "status": "active",
        "vectorized": false,
        "created_at": "2024-11-15T15:03:00.219676+00:00",
        "updated_at": "2024-11-15T15:03:00.219681+00:00",
        "expires_at": "2024-11-16T15:03:00.219682+00:00",
        "storage": {
          "size_bytes": 147086,
          "size_mb": 0.01,
          "num_files": 2
        }
      }
    ],
    "total_storage": {
      "bytes": 289172,
      "mb": 0.028,
      "limit_mb": 100,
      "usage_percentage": 2.8
    }
  },
  "error": null,
  "timestamp": "2024-11-15T16:11:50.398764Z"
}

2. Delete an Index (⚠️ *warning*: delayed opperation)

You can delete an index if you no longer need to access it. Unlike the other endpoints, which perform the requests live, this endpoint provides a security marge to be effective. It will delete the index after 2h, giving time to reverse the operation in case of errors. Index that are marked for deletion receive the status "countdown" once the expiry date is set, instead of the "active" status.

response = cosmos_client.files_index_delete_request(new_index_uuid)
print(response)

Expected response:

{
  "status": "success",
  "message": "Index deleted successfully",
  "data": {
    "index_uuid": "your-index-uuid"
  }
}

3. Restore an Index scheduled deletion

After an index is marked for deletion, but before the expiry date, you can restore it. This will allow you to revert the operation in case of errors. It will restore the "active" status and cancel the scheduled deletion. This is only possible within the 2h timelapse (while index status is "countdown").

response = cosmos_client.files_index_restore_request(new_index_uuid)
print(response)

Expected response:

{
  "status": "success",
  "message": "Index restored successfully",
  "data": {
    "index_uuid": "your-index-uuid"
  }
}