Threat Signal Pipeline | Al Imran Bhuyan Projects

Problem

Security teams are overwhelmed by the sheer volume of daily threat intelligence and cybersecurity news feeds. Reading, deduplicating, and manually assessing the relevance of hundreds of RSS items daily leads to alert fatigue and missed critical updates.

Approach

I built a lightweight, serverless Python pipeline to automate the ingestion, relevance filtering, and summarization of threat intelligence:

Ingestion: Fetches data from configured RSS feeds and deduplicates them using IDs and semantic checks.
AI Processing: Calls Google Cloud Vertex AI (Gemini) to assess relevance, generate intelligent summaries, and create text embeddings.
Storage: Inserts the processed news items seamlessly into a Google BigQuery dataset.
Notification: Optionally broadcasts the formulated, high-relevance summaries directly to a Slack channel via Webhooks.

Architecture & Deployment

The pipeline is designed to run in Google Cloud:

Compute: Deployed as an HTTP-triggered Google Cloud Function (or containerized in Cloud Run).
Triggers: Accepts HTTP POST payloads with different modes (e.g., daily_run or summarize_pending).
Data: Reads configuration (like feeds.csv) locally or from Google Cloud Storage (GCS).

To test locally:

python -m venv .venv
.\.venv\Scripts\Activate
pip install -r requirements.txt

# Set local environment variables for GCP
functions-framework --target run_news_job --port 8080

# Trigger via cURL
curl -X POST http://localhost:8080/ -H "Content-Type: application/json" -d '{"mode":"daily_run","lookback_days":1}'

What I Learned

LLM Orchestration: Designing prompts to reliably have Gemini perform both relevance classification and concise summarization in a single pass.
Data Deduplication: Combining standard ID checking with semantic checks prevents flooding the database with minor updates to the same news story.
Serverless Automation: Using Google Cloud Functions Framework to easily emulate cloud environments locally before deployment.