Files Index
Query and analyze Index of Files using Cosmos.
About Index
Index are sets of documents that will be processed and analyzed together, allowing to ask questions and retrieve informations contained in the Index files.
They are referred through an unique index_uuid
and a name
of type string. You can find all the details in API Reference, but for a quick overview, these are the Index properties:
Index attributes | Description | Example |
---|---|---|
index_uuid | Unique identifier for the index | your-index-uuid |
name | Name of the index | Financial Reports 2023 |
status | Index is active or in countdown (scheduled for deletion) | active |
vectorized | File contents are embedded and ready for querying | false |
created_at | Creation timestamp | 2024-11-15T15:03:00.219676+00:00 |
updated_at | Update timestamp | 2024-11-15T15:03:00.219681+00:00 |
expires_at | Expiration date of the index, if scheduled for deletion | None |
files | Files linked to the the index, and storage. | 3 files: 147086 bytes |
Storage Limits
⚠️ Warning: There is a total storage limit that is set for the total of files across all index. Once reached, no new files can be uploaded to existing index or new index created. Existing index can still be queried, deleted, or some files removed from them to free space. You can also manage storage through the Dashboard.
Supported File Formats
- Documents:
.pdf
,.docx
,.doc
,.odt
,.txt
,.md
- Spreadsheets:
.xlsx
,.xls
,.ods
,.csv
- Presentations:
.pptx
,.ppt
,.odp
- Notebooks:
.ipynb
Index for File Analysis
Let's create a Index to work with several files. We will first send the files to create the Index, then process them, and then the Index will be ready for our queries.
These are the Index requests available in Python cosmos_client:
Group | Client method | Used for |
---|---|---|
Index Management | .files_index_create_request | Create a new file index |
.files_index_delete_request | Delete a specific file index | |
.files_index_restore_request | Restore a specific deleted file index | |
-------------------- | ------------------------------------- | ---------------------------------------- |
Index Contents | .files_index_add_files_request | Add files to an existing file index |
.files_index_delete_files_request | Delete files from an existing file index | |
.files_index_rename_request | Rename an existing file index | |
-------------------- | ------------------------------------- | ---------------------------------------- |
Index Details | .files_index_list_request | List all file indexes |
.files_index_details_request | See the details of an index | |
-------------------- | ------------------------------------- | ---------------------------------------- |
Index Querying | .files_index_embed_request | Embed the files in the specified index |
.files_index_ask_request | Query the files in a specific index |
Step-by-step Tutorial
1. Prerequisites
Before you begin, ensure you have:
- An active CosmosPlatform account
- API key from the API keys dashboard
2. Setup Cosmos Python Client
Using Python Cosmos client you can perform the API requests in a convenient way.
2.1. Install Cosmos Python Client:
In order to install the Python client, you can add it to your project by doing:
poetry add delos-cosmos
Or install it in your virtual environment with:
pip install delos-cosmos
2.2. Authenticate Requests:
- Initialize the client with your API key:
from cosmos import CosmosClient cosmos_client = CosmosClient(apikey=your-cosmos-api-key)
2.3. Call API:
- You can start invoking any Cosmos endpoints. For example, let's try the
/health
endpoint to check the validity of your API key and the availability of the client services:
response = cosmos_client.status_health_request() print(response)
3. Index Operations
Index name a group of documents that are analyzed and processed together. They may be concerning the same topic, or share a common structure. When asking a question to the Index, the Model will process these documents together, and retrieve the most relevant information in order to answer the question.
3.1. NEW INDEX + DETAILS
1. Create new Index:
In order to create a new index, which will be shared to your team:
response = cosmos_client.files_index_create_request( filepaths=['/path/to/document1.pdf', '/path/to/document2.docx'], name="my_new_index", read_images=False) print(response)
The parameter read_images
allows to enable or not the scanning of the images and graphic elements while processing the file contents. By default it is disabled (read_images=False
). This option consumes more since it requires a more complex processing.
Expected response:
{ "status": "success", "message": "Index created successfully", "data": { "index_uuid": "my_new_index" // Additional index metadata } }
2. See Index details:
You can retrieve the details of your created index by providing the index_uuid
:
response = cosmos_client.files_index_details_request(index_uuid=index_uuid) print(response)
The response will be similar to the following:
{ "status": "success", "message": "Index details retrieved", "data": { "index_uuid": "your-index-uuid", "name":"my_new_index", "vectorized": false, "status": "active", "expires_at": "None", "created_at": "2024-11-15T15:03:00.219676+00:00", "updated_at": "2024-11-15T15:03:00.219681+00:00", "storage": { "size_bytes": 147086, "size_mb": 0.01, "num_files": 2 }, // Additional index metadata } "error": null, "timestamp": "2024-11-15T16:11:50.398764Z" }
The details come handy to make sure which files are in every index, and the storage details associated to them. Storage is limited to 1000 MB per organization (across all index).
Your organization storage is also managable through the Dashboard.
Using the index_uuid
, you can add or remove files instantly from the index.
The complete deletion of an index is also possible, but it gets effective 2h after the request is performed, giving time to reverse the operation in case of errors.
3.2. MODIFY INDEX
You may want to rename an index or modify the set of files that an index contains.
1. Rename Index
You can rename an index by using the /rename_index
endpoint:
response = cosmos_client.files_index_rename_request( new_index_uuid, "New name", ) print(response)
Expected response:
{ "status_code": 200, "status": "success", "message": "Index name 'Old name' changed to 'New name'.", "data": { "index_uuid": "your-index-uuid", "old_name": "Old name", "new_name": "New name", "updated_at": "2024-11-15T15:03:00.219676+00:00" } }
2. Add new Files to Index
For adding new files, use the /add_files_to_index
endpoint. You can choose to enable the read_images
parameter if graphic contents are relevant to your processing (by default it is disabled):
response = cosmos_client.files_index_add_files_request( index_uuid=your-index-uuid, filepaths=["files=path/to/document3.pdf", \\ "files=path/to/document4.txt"], read_images=True, ) print(response)
Expected response:
{ "status": "success", "message": "Index created successfully", "data": { "index_uuid": "your-index-uuid" // Additional index metadata } }
3. Remove Files from Index
Or to remove one or more files from the index, by providing the filehash (it can be retrieved from the index details):
files_hashes = ["32348860fbb4c700eed9067261fc340"] response = cosmos_client.files_index_delete_files_request( index_uuid=your-index-uuid, files_hashes=files_hashes ) print(response)
Expected response:
{ "status": "success", "message": "File(s) deleted from index successfully", "data": { "index_uuid": "your-index-uuid", "remaining_files": [ "26a79e1e7233ef12c763c4a0e6b3221ddba54357d4", "f66d2345d7c64ea7e4428f87d537927b567d8eba00" ] } }
You can request the index details again in order to make sure the files were correctly added or removed (see section 3.1 for Index details).
Also, you will be able to see the storage that those files in the index take in your quota, which is limited to 100 MB per organization (across all index).
3.3. EMBEDDING & QUERYING
1. Embed Index
In order to perform vectorized searches, you need to embed the index. This operation will calculate the embeddings of files belonging to the index:
response = cosmos_client.files_index_embed_request(index_uuid=your-index-uuid) print(response)
Expected response:
{ "status": "success", "message": "Index `your-index-uuid` successfully vectorized.", "data": { "index_uuid": "your-index-uuid" } }
2. Query Index
Now that the index is vectorized, you can ask the index. The index will return the answer to the question based on the embeddings of files belonging to the index:
response = cosmos_client.files_index_ask_request( index_uuid=new_index_uuid, question="Where is located the bridge these articles mention?", output_language="en", )
We can specify one or more filehashes to limit the files the Index is going to analyze in order to answer to your question:
response = cosmos_client.files_index_ask_request( index_uuid=new_index_uuid, question="Where is located the bridge these articles mention?", output_language="en", active_files_hashes = [ "26a79e1e7233ef12c763c4a0e6b3221ddba54357d4", "f66d2345d7c64ea7e4428f87d537927b567d8eba00" ] )
The responses will contain and answer to the question, as well as the sources of index file and page that contain the information to base the answer to the question.
Expected response:
{ "status": "success", "message": "Query processed successfully", "data": { "answer": "The article discusses the bridge of Brooklyn 'FILE:1 PAGE:2'.", "sources": { "1": "efcb3858b45a3edb306c9d3457820e41ec78e3492f833ad189254c02886df260" } } }
3.4. INDEX MANAGEMENT
1. List All Index
You can list all the Index in your team that are in active
or countdown
(scheduled deletion) status:
response = cosmos_client.files_index_list_request() print(response)
Expected response:
{ "status": "success", "message": "Retrieved 2 index.", "data": { "index": [ { "index_uuid": "your-index-uuid", "name": "my_new_index", "status": "active", "vectorized": true, "created_at": "2024-11-15T15:03:00.219676+00:00", "updated_at": "2024-11-15T15:03:00.219681+00:00", "expires_at": "2024-11-16T15:03:00.219682+00:00", "storage": { "size_bytes": 147086, "size_mb": 0.01, "num_files": 2 } }, { "index_uuid": "another-index-uuid", "name": "2024 Sales results", "status": "active", "vectorized": false, "created_at": "2024-11-15T15:03:00.219676+00:00", "updated_at": "2024-11-15T15:03:00.219681+00:00", "expires_at": "2024-11-16T15:03:00.219682+00:00", "storage": { "size_bytes": 147086, "size_mb": 0.01, "num_files": 2 } } ], "total_storage": { "bytes": 289172, "mb": 0.028, "limit_mb": 100, "usage_percentage": 2.8 } }, "error": null, "timestamp": "2024-11-15T16:11:50.398764Z" }
2. Delete an Index (⚠️ *warning*: delayed opperation
)
You can delete an index if you no longer need to access it. Unlike the other endpoints, which perform the requests live, this endpoint provides a security marge to be effective. It will delete the index after 2h, giving time to reverse the operation in case of errors. Index that are marked for deletion receive the status "countdown" once the expiry date is set, instead of the "active" status.
response = cosmos_client.files_index_delete_request(new_index_uuid) print(response)
Expected response:
{ "status": "success", "message": "Index deleted successfully", "data": { "index_uuid": "your-index-uuid" } }
3. Restore an Index scheduled deletion
After an index is marked for deletion, but before the expiry date, you can restore it. This will allow you to revert the operation in case of errors. It will restore the "active" status and cancel the scheduled deletion. This is only possible within the 2h timelapse (while index status is "countdown").
response = cosmos_client.files_index_restore_request(new_index_uuid) print(response)
Expected response:
{ "status": "success", "message": "Index restored successfully", "data": { "index_uuid": "your-index-uuid" } }