Skip to content

Scaling NLP processing pipelines with Dask and PySpark, utilising Apache Kafka real-time data streaming, for optimal LLM training

Notifications You must be signed in to change notification settings

Daniel-Elston/real-time-reddit-scalable-processing

Repository files navigation

README

Goal

Build scalable text data processing pipelines for efficient model training with Dask and PySpark, utilising Apache Kafka for real-time data streaming.


Results

  1. Results can be found in the reports/result.xlsx file.
  2. This is partition number 2 of resultant Dask processing 4 partitions.

QRG:

  1. Load up docker app
  2. Load 2 separate WSL terminals (T1 and T2)
  3. In T1, run docker-compose up --build
  4. Open file config/settings and adjust the Config to either 'extract', 'transform' or 'results'
  5. Once all images are running, in T2, run python main.py
  6. Data is streamed in temrinal but also saved: data/temp/reddit_comments.json
  7. Sample result of PySpark and Dask processing can be found as SDOs in data/results/*.xlsx

Requirements:

  • WSL2
  • Ubuntu 24.04
  • Python 3.12.*

Ensure Java Runtime Env:

sudo apt update

sudo apt install openjdk-11-jdk

readlink -f $(which java)

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Add lines to .zshrc:

echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc

echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc


Apologies & Disclaimer

This project streams real-time Reddit comments, which are generated by users across various subreddits. As a result, some content may include offensive, inappropriate, or controversial language. Please note that I do not endorse or control the content of these comments.

I sincerely apologise for any offensive material that may appear during the data stream. If you come across content that is particularly concerning, I encourage you to report it directly to Reddit through their moderation tools.

Thank you for your understanding.

About

Scaling NLP processing pipelines with Dask and PySpark, utilising Apache Kafka real-time data streaming, for optimal LLM training

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published