Tutorial: Hands-on with Rakam Systems

This tutorial will guide you through using Rakam Systems to extract content from documents, create and manage vector stores, and perform various operations like searching, adding, and deleting content.

Prerequisites

Before starting, make sure you have installed the necessary dependencies and set up your environment. If you haven’t already, check the installation guide to install the library.

In this tutorial, we will: 1. Extract content from documents (PDFs). 2. Create a vector store from the extracted content. 3. Add new files to the vector store. 4. Perform searches within the vector store. 5. Add and delete nodes. 6. Reload the vector store and verify operations.

Make sure you have the following directory structure (or modify the paths as necessary): - A folder containing PDF documents for testing. - A folder to store the FAISS indices.

Step 1: Setting Up the Environment

First, initialize a VectorStores object with a base directory for storing FAISS indices and an embedding model for generating vector embeddings. You can use any SentenceTransformer model, but in this tutorial, we’ll use all-MiniLM-L6-v2.

Create a Python file named vector_store_example.py and include the following setup:

import os
from rakam_systems.vector_store import VectorStores
from rakam_systems.ingestion.data_processor import DataProcessor

# Define the folder path for documents and the base index path for the vector store
folder_path = "/path/to/your/documents"
base_index_path = "/path/to/store/indices"
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

# Initialize the VectorStore
vector_store = VectorStores(base_index_path=base_index_path, embedding_model=embedding_model)

Step 2: Extracting Content from Documents

Next, use the DataProcessor class to extract content from documents in the specified folder. The PDFContentExtractor extracts the content from PDF files and converts them into nodes for indexing in the vector store.

Add the following to your vector_store_example.py:

from rakam_systems.ingestion.data_processor import DataProcessor

# Initialize the DataProcessor and extract content from the folder
processor = DataProcessor()
vs_files = processor.process_files_from_directory(folder_path)

print(f"Extracted content from {len(vs_files)} files.")

Step 3: Creating a Vector Store from Extracted Content

Once the content is extracted, you can create a vector store and index the extracted nodes using the create_from_files method. This method will create a FAISS index for the documents and store the embeddings for efficient search.

# Create a vector store from the extracted VSFiles
store_name = "my_vector_store"
store_files = {store_name: vs_files}

vector_store.create_from_files(store_files)

print(f"Vector store '{store_name}' created successfully.")

Step 4: Adding New Files to the Vector Store

You can add additional files to the vector store using the add_files method. In this step, we’ll extract content from a new PDF file and add it to the existing vector store.

from rakam_systems.ingestion.content_extractors import PDFContentExtractor

# Path to the additional PDF file
additional_file_path = "/path/to/new.pdf"
additional_pdf_extractor = PDFContentExtractor(parser_name="SimplePDFParser")

# Extract content from the new file and add it to the store
additional_vs_files = additional_pdf_extractor.extract_content(additional_file_path)
vector_store.add_files(store_name, additional_vs_files)

print(f"New file '{additional_file_path}' added to the store.")

Step 5: Searching the Vector Store

You can search the vector store using a query. The search method allows you to find the most similar documents based on the embeddings generated by the vector store.

# Perform a search in the vector store
query = "Your search query here"
results, suggested_nodes = vector_store.search(
    store_name, query, distance_type="cosine", number=5
)

print(f"Search results: {results}")

Step 6: Deleting Files and Nodes

If you need to remove specific files or nodes from the vector store, you can use the delete_files and delete_nodes methods. In this step, we delete the previously added file and some nodes.

# Deleting the added file from the vector store
vector_store.delete_files(store_name, additional_vs_files)
print(f"File '{additional_file_path}' deleted from the store.")

# Deleting specific nodes from the vector store
node_ids_to_delete = [0]  # Replace with actual node IDs to delete
vector_store.delete_nodes(store_name, node_ids_to_delete)

print(f"Node IDs {node_ids_to_delete} deleted from the store.")

Step 7: Adding New Nodes to the Vector Store

You can also manually create and add new nodes to the vector store using the add_nodes method. Each node must contain content and metadata.

from rakam_systems.core import Node, NodeMetadata

# Create a new node with content and metadata
new_node_metadata = NodeMetadata(source_file_uuid="new_file_uuid", position=1)
new_node = Node(content="This is some new content", metadata=new_node_metadata)

# Add the new node to the vector store
vector_store.add_nodes(store_name, [new_node])

print("New node added to the store.")

Step 8: Reloading the Vector Store

You can reload all stored vector stores and check their integrity using the load_all_stores method. This is useful after performing several operations to ensure that the store is consistent.

# Reload all stores
vector_store.load_all_stores()
print("Vector stores reloaded successfully.")

Conclusion

In this tutorial, you learned how to: - Extract content from PDF files using the DataProcessor. - Create and manage a vector store using VectorStores. - Perform searches and manage files and nodes within the vector store.

Explore additional features like retrieval-augmented generation (RAG) and query classification by referring to the relevant sections in the documentation.

Happy coding!