README

Goal

Build scalable text data processing pipelines for efficient model training with Dask and PySpark, utilising Apache Kafka for real-time data streaming.

Results

Results can be found in the reports/result.xlsx file.
This is partition number 2 of resultant Dask processing 4 partitions.

QRG:

Load up docker app
Load 2 separate WSL terminals (T1 and T2)
In T1, run docker-compose up --build
Open file config/settings and adjust the Config to either 'extract', 'transform' or 'results'
Once all images are running, in T2, run python main.py
Data is streamed in temrinal but also saved: data/temp/reddit_comments.json
Sample result of PySpark and Dask processing can be found as SDOs in data/results/*.xlsx

Requirements:

WSL2
Ubuntu 24.04
Python 3.12.*

Ensure Java Runtime Env:

sudo apt update

sudo apt install openjdk-11-jdk

readlink -f $(which java)

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Add lines to .zshrc:

echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc

echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc

Apologies & Disclaimer

This project streams real-time Reddit comments, which are generated by users across various subreddits. As a result, some content may include offensive, inappropriate, or controversial language. Please note that I do not endorse or control the content of these comments.

I sincerely apologise for any offensive material that may appear during the data stream. If you come across content that is particularly concerning, I encourage you to report it directly to Reddit through their moderation tools.

Thank you for your understanding.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
config		config
dags		dags
docs		docs
reports		reports
src		src
utils		utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
dockerfile		dockerfile
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README

Goal

Results

QRG:

Requirements:

Ensure Java Runtime Env:

Apologies & Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Daniel-Elston/real-time-reddit-scalable-processing

Folders and files

Latest commit

History

Repository files navigation

README

Goal

Results

QRG:

Requirements:

Ensure Java Runtime Env:

Apologies & Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages