Skip to content

Latest commit

 

History

History
 
 

05-orchestration

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Data Preparation in RAG

Getting started

  1. Clone repository
  2. Run ./scripts/start.sh
git clone https://github.com/mage-ai/rag-project
cd rag-project
./scripts/start.sh

Once started, go to http://localhost:6789/

For more setup information, refer to these instructions

0. Module overview

1. Ingest

In this section, we cover the ingestion of documents from a single data source.

Todo: what if we only have custom code? How to edit it?

2. Chunk

Once data is ingested, we break it into manageable chunks. This section explains the importance of chunking data and various techniques.

TODO: why do we need chunking?

3. Tokenization

Tokenization is a crucial step in text processing and preparing the data for effective retrieval.

4. Embed

Embedding data translates text into numerical vectors that can be processed by models.

5. Export

After processing, data needs to be exported for storage so that it can be retrieved for better contextualization of user queries.

6. Test Vector Search Query

After exporting the chunks and embeddings, we can test the search query to retrieve relevant documents on sample queries.

7. Trigger Daily Runs

Automation is key to maintaining and updating your system. This section demonstrates how to schedule and trigger daily runs for your data pipelines, ensuring up-to-date and consistent data processing.

Homework

TBA

Notes

  • First link goes here
  • Did you take notes? Add them above this line (Send a PR with links to your notes)