Tutorial: Hands-on with Rakam Systems ===================================== This tutorial will guide you through using **Rakam Systems** to extract content from documents, create and manage vector stores, and perform various operations like searching, adding, and deleting content. Prerequisites ------------- Before starting, make sure you have installed the necessary dependencies and set up your environment. If you haven't already, check the :doc:`installation` guide to install the library. In this tutorial, we will: 1. Extract content from documents (PDFs). 2. Create a vector store from the extracted content. 3. Add new files to the vector store. 4. Perform searches within the vector store. 5. Add and delete nodes. 6. Reload the vector store and verify operations. Make sure you have the following directory structure (or modify the paths as necessary): - A folder containing PDF documents for testing. - A folder to store the FAISS indices. Step 1: Setting Up the Environment ---------------------------------- First, initialize a **VectorStores** object with a base directory for storing FAISS indices and an embedding model for generating vector embeddings. You can use any SentenceTransformer model, but in this tutorial, we’ll use `all-MiniLM-L6-v2`. Create a Python file named `vector_store_example.py` and include the following setup: .. code-block:: python import os from rakam_systems.vector_store import VectorStores from rakam_systems.ingestion.data_processor import DataProcessor # Define the folder path for documents and the base index path for the vector store folder_path = "/path/to/your/documents" base_index_path = "/path/to/store/indices" embedding_model = "sentence-transformers/all-MiniLM-L6-v2" # Initialize the VectorStore vector_store = VectorStores(base_index_path=base_index_path, embedding_model=embedding_model) Step 2: Extracting Content from Documents ----------------------------------------- Next, use the **DataProcessor** class to extract content from documents in the specified folder. The **PDFContentExtractor** extracts the content from PDF files and converts them into nodes for indexing in the vector store. Add the following to your `vector_store_example.py`: .. code-block:: python from rakam_systems.ingestion.data_processor import DataProcessor # Initialize the DataProcessor and extract content from the folder processor = DataProcessor() vs_files = processor.process_files_from_directory(folder_path) print(f"Extracted content from {len(vs_files)} files.") Step 3: Creating a Vector Store from Extracted Content ------------------------------------------------------ Once the content is extracted, you can create a vector store and index the extracted nodes using the **create_from_files** method. This method will create a FAISS index for the documents and store the embeddings for efficient search. .. code-block:: python # Create a vector store from the extracted VSFiles store_name = "my_vector_store" store_files = {store_name: vs_files} vector_store.create_from_files(store_files) print(f"Vector store '{store_name}' created successfully.") Step 4: Adding New Files to the Vector Store -------------------------------------------- You can add additional files to the vector store using the **add_files** method. In this step, we'll extract content from a new PDF file and add it to the existing vector store. .. code-block:: python from rakam_systems.ingestion.content_extractors import PDFContentExtractor # Path to the additional PDF file additional_file_path = "/path/to/new.pdf" additional_pdf_extractor = PDFContentExtractor(parser_name="SimplePDFParser") # Extract content from the new file and add it to the store additional_vs_files = additional_pdf_extractor.extract_content(additional_file_path) vector_store.add_files(store_name, additional_vs_files) print(f"New file '{additional_file_path}' added to the store.") Step 5: Searching the Vector Store ---------------------------------- You can search the vector store using a query. The **search** method allows you to find the most similar documents based on the embeddings generated by the vector store. .. code-block:: python # Perform a search in the vector store query = "Your search query here" results, suggested_nodes = vector_store.search( store_name, query, distance_type="cosine", number=5 ) print(f"Search results: {results}") Step 6: Deleting Files and Nodes -------------------------------- If you need to remove specific files or nodes from the vector store, you can use the **delete_files** and **delete_nodes** methods. In this step, we delete the previously added file and some nodes. .. code-block:: python # Deleting the added file from the vector store vector_store.delete_files(store_name, additional_vs_files) print(f"File '{additional_file_path}' deleted from the store.") # Deleting specific nodes from the vector store node_ids_to_delete = [0] # Replace with actual node IDs to delete vector_store.delete_nodes(store_name, node_ids_to_delete) print(f"Node IDs {node_ids_to_delete} deleted from the store.") Step 7: Adding New Nodes to the Vector Store -------------------------------------------- You can also manually create and add new nodes to the vector store using the **add_nodes** method. Each node must contain content and metadata. .. code-block:: python from rakam_systems.core import Node, NodeMetadata # Create a new node with content and metadata new_node_metadata = NodeMetadata(source_file_uuid="new_file_uuid", position=1) new_node = Node(content="This is some new content", metadata=new_node_metadata) # Add the new node to the vector store vector_store.add_nodes(store_name, [new_node]) print("New node added to the store.") Step 8: Reloading the Vector Store ---------------------------------- You can reload all stored vector stores and check their integrity using the **load_all_stores** method. This is useful after performing several operations to ensure that the store is consistent. .. code-block:: python # Reload all stores vector_store.load_all_stores() print("Vector stores reloaded successfully.") Conclusion ---------- In this tutorial, you learned how to: - Extract content from PDF files using the **DataProcessor**. - Create and manage a vector store using **VectorStores**. - Perform searches and manage files and nodes within the vector store. Explore additional features like retrieval-augmented generation (RAG) and query classification by referring to the relevant sections in the documentation. Happy coding!