Build a Secure and Flexible Retrieval-Augmented Generation (RAG) System

Hosted by Trustii.io sponsored Understand.Tech

Introduction

Are you passionate about open-source Large Language Models (LLMs), data privacy, and versatile AI applications? Join our exciting challenge to push the boundaries of what's possible with offline RAG systems!

Understand.Tech offers an innovative AI platform that allows users to create AI assistants in seconds from any content—whether it's URLs, PDF documents, CSV files, code snippets, or GitHub repositories. This platform transforms your content into an AI assistant with a chat interface, which can be shared via a chat widget or a unique URL.

While we leverage powerful models like GPT-4 and OpenAI embeddings, some use cases require a fully offline solution to ensure data sensitivity and compliance.

We invite the data science community on Trustii.io to help us create a complete offline RAG system for embeddings creation and chat/retrieval, leveraging the power of open-source technologies. The results of this challenge will be made open source under the MIT license, contributing to the broader community.

Competition Details

Powered by: Understand.Tech
Key Dates:
- Start Date: Monday, October 10th, 2024, at 8AM CET
- Submission Deadline: November 10th, 2024, at 11:59PM CET.
- Winner Announcement: November 21st, 2024

Objectives

Build a Flexible Local RAG System: Develop a RAG system that generates embeddings using an open-source LLM, the system must support local execution without relying on external API calls. The system should be flexible and capable of handling various types of text data, including but not limited to Q&A datasets, websites, code snippets, documentation, and more.
Create a Versatile Local Chat Interface: Build a chat interface that interacts with the vector store generated from text embeddings and stored locally. This interface should allow users to query embeddings and retrieve relevant information to generate responses through locally executed LLM. The interface should support interaction with different content types, multiple languages specifically handling queries in English and French, code snippets, etc demonstrating the system's flexibility.

Technical Requirements

Local Execution: The RAG system must run entirely locally without making external API calls to LLMs (e.g., GPT-3, GPT-4). Use open-source models for generating embeddings and text retrieval like LLAMA.
Vector Store Implementation:Use a vector database (must be Faiss) to build a vector store for efficient text retrieval. Generate embeddings using open-source LLMs (such as LLaMA, Sentence Transformers, Mistral) to create vectors stored in the vector store.
Use of LangChain: Build a chat interface using the LangChain API, allowing users to interact with the vector store and generate responses based on retrieved documents. The interface should demonstrate flexibility by handling various content types (text, code, websites, etc.). The chat interface must Support multiple languages, specifically English and French, for both text retrieval and response generation. The LLM must be capable of processing and generating code snippets and technical language.
Programming Language: The system must be developed in Python.

Dataset

The dataset is a text dataset with two columns: Query and Response. It includes a variety of content types, such as general text, technical documentation, and code snippets.

Data Split:

You'll receive train.csv and test.csv. The train.csv contains 70% of the dataset, and test.csv contains the remaining 30%, including only the trustii_id and Query columns.

Your Task:

Predict the Response for each query in the test.csv and submit a CSV file with your generated responses.

Example test.csv:

trustii_id,Query

12345,"What is LangChain?"

67890,"Expliquez le concept de RAG."

54321,"How does retrieval work in RAG?"

98765,"Provide a Python function to sort a list."

24680,"Qu'est-ce que le polymorphisme en programmation orientée objet?"

Submission Format

CSV Submission: Submit a CSV file containing the original queries from the test.csv, the trustii_id, and your generated responses.

Example Submission CSV:

trustii_id,Query,Response

12345,"What is LangChain?","LangChain is a framework for integrating LLMs with external data."

67890,"Expliquez le concept de RAG.","RAG signifie génération augmentée par la récupération, combinant la récupération d'informations avec la génération de texte."

54321,"How does retrieval work in RAG?","In RAG, relevant documents are retrieved before text generation."

98765,"Provide a Python function to sort a list.","def sort_list(lst):\n    return sorted(lst)"

24680,"Qu'est-ce que le polymorphisme en programmation orientée objet?","Le polymorphisme permet aux objets de différents types de répondre à la même interface ou méthode."

Code Submission:

At the end of the competition, we'll ask the top solutions eligible for prizes to submit their source code as a GitHub repository.

Your submission must include:

A README with setup instructions to build the project.
Necessary scripts to run the project and test its functionality (creating a RAG on various text types and interacting with it).
Documentation on the retrieval and generation techniques used, including memory and computational resources utilized.

License Requirement :

The submitted code must be made open source under the MIT license.

Baseline Generation Using Understand.Tech

Participants can generate baseline answers using Understand.Tech (https://app.understand.tech). Your model should produce responses that are at least similar in quality and content to these baseline answers. This will help you gauge the expected performance level and ensure your system meets the challenge's standards.

Evaluation Criteria

The evaluation will focus on three main aspects:

System Accuracy (40%):
- Evaluated using BERTScore based on the semantic similarity of your responses to the ground truth.
- The system's ability to handle various content types (text, code, technical documentation) will be tested.
Reproducibility (20%):
1. How straightforward it is to install and run your solution.
2. Clear instructions and scripts to set up the environment and dependencies.
3. Assessment of code readability, modularity, and documentation.
4. Use of best practices in coding and project structure.
Cost of Deployment (40%):
1. How efficiently the system runs on local hardware with minimum memory and CPU/GPU usage.
2. Optimization techniques used to reduce computational overhead.

Prizes : a total of $11,000 prize provided by Understand Tech

1st Prize: $4,000 cash and a 1-year team free subscription to Understand.Tech.
2nd Prize : $1, 000 cash and a 1-year team free subscription to Understand.Tech.
3rd, 4th and 5th Prizes : A 1-year team free subscription to Understand.Tech.

All participants will be able to access to a free account at Understand Tech (https://app.understand.tech) to create baseline answers.

Get Started!

Access to the challenge at https://app.trustii.io this Thursday October 10th at 8AM CET.

Conclusion

This challenge is a fantastic opportunity to contribute to the development of secure, flexible, and offline RAG systems using open-source technologies. By participating, you'll help create solutions that prioritize data privacy and security, capable of handling a variety of content types including text, code, and technical documentation. Your work will be shared with the community under the MIT license.

We can't wait to see your innovative solutions!

For any questions or further information, please contact us at challenges@trustii.io.