Build scalable text data processing pipelines for efficient model training with Dask and PySpark, utilising Apache Kafka for real-time data streaming.
- Results can be found in the
reports/result.xlsx
file. - This is partition number 2 of resultant Dask processing 4 partitions.
- Load up docker app
- Load 2 separate WSL terminals (T1 and T2)
- In T1, run
docker-compose up --build
- Open file
config/settings
and adjust the Config to either 'extract', 'transform' or 'results' - Once all images are running, in T2, run
python main.py
- Data is streamed in temrinal but also saved:
data/temp/reddit_comments.json
- Sample result of PySpark and Dask processing can be found as SDOs in
data/results/*.xlsx
- WSL2
- Ubuntu 24.04
- Python 3.12.*
sudo apt update
sudo apt install openjdk-11-jdk
readlink -f $(which java)
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Add lines to .zshrc:
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
This project streams real-time Reddit comments, which are generated by users across various subreddits. As a result, some content may include offensive, inappropriate, or controversial language. Please note that I do not endorse or control the content of these comments.
I sincerely apologise for any offensive material that may appear during the data stream. If you come across content that is particularly concerning, I encourage you to report it directly to Reddit through their moderation tools.
Thank you for your understanding.