Big Twitter Clustering & Classification

Streamed tweets into AWS S3 with Kinesis Firehose and combined it with a larger 55 mil.-tweet dataset (Not covered in this repo)
Utilized PySpark in DataBricks to build custom PySpark transformers, label sentiment with SparkNLP/VADER, explore SparkML RandomForest and Logistic Regression classifiers, and to perform Latent Drichlet Allocation topic modelling
Visualized results in AWS QuickSight through an Athena pipeline

Sales Analysis & Demand Forecasting