Time Series Forecasting Anomaly Detection Python

Time Series Analysis for Reddit Data: Trends, Seasonality, and Forecasting

By @timeseries_analyst | February 18, 2026 | 23 min read

Reddit data exhibits rich temporal patterns: daily posting cycles, weekly engagement rhythms, and event-driven spikes. This guide covers time series techniques for trend detection, seasonality decomposition, anomaly identification, and forecasting Reddit metrics.

What You'll Learn

Time series preparation for Reddit data, seasonal decomposition, trend analysis, anomaly detection, and forecasting with Prophet and statistical methods.

Reddit Temporal Patterns

Pattern Type Frequency Reddit Example
Daily seasonality 24-hour cycle Peak posting at 10am-2pm EST
Weekly seasonality 7-day cycle Higher engagement weekdays vs weekends
Annual seasonality Yearly cycle Holiday spikes, summer slowdowns
Trend Long-term Subreddit growth/decline over months
Event-driven spikes Irregular Product launches, news events

Data Preparation

$ pip install pandas statsmodels prophet pyod
Successfully installed statsmodels-0.14.0 prophet-1.1.5
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import Tuple

class RedditTimeSeriesPrep:
    """Prepare Reddit data for time series analysis."""

    def __init__(self, df: pd.DataFrame):
        self.df = df.copy()
        self._prepare_datetime()

    def _prepare_datetime(self):
        """Convert timestamps to datetime index."""
        if 'created_utc' in self.df.columns:
            self.df['datetime'] = pd.to_datetime(
                self.df['created_utc'],
                unit='s',
                utc=True
            )

    def aggregate_time_series(
        self,
        freq: str = 'H',
        agg_column: str = None,
        agg_func: str = 'count'
    ) -> pd.Series:
        """
        Aggregate data into time series.

        Args:
            freq: Resampling frequency ('H'=hourly, 'D'=daily, 'W'=weekly)
            agg_column: Column to aggregate (None for count)
            agg_func: 'count', 'sum', 'mean', 'median'
        """
        ts = self.df.set_index('datetime')

        if agg_column is None or agg_func == 'count':
            series = ts.resample(freq).size()
        else:
            series = ts[agg_column].resample(freq).agg(agg_func)

        # Fill missing periods with 0
        series = series.fillna(0)

        return series

    def create_post_volume_series(self, freq: str = 'H') -> pd.Series:
        """Create post volume time series."""
        return self.aggregate_time_series(freq=freq)

    def create_engagement_series(self, freq: str = 'D') -> pd.DataFrame:
        """Create multi-metric engagement series."""
        ts = self.df.set_index('datetime')

        engagement_df = pd.DataFrame({
            'post_count': ts.resample(freq).size(),
            'total_score': ts['score'].resample(freq).sum(),
            'avg_score': ts['score'].resample(freq).mean(),
            'total_comments': ts['num_comments'].resample(freq).sum(),
            'avg_comments': ts['num_comments'].resample(freq).mean()
        }).fillna(0)

        return engagement_df

    def create_sentiment_series(
        self,
        freq: str = 'D',
        sentiment_col: str = 'sentiment'
    ) -> pd.DataFrame:
        """Create sentiment composition time series."""
        ts = self.df.set_index('datetime')

        # Pivot sentiment counts
        sentiment_counts = ts.groupby([
            pd.Grouper(freq=freq),
            sentiment_col
        ]).size().unstack(fill_value=0)

        # Add percentages
        total = sentiment_counts.sum(axis=1)
        for col in sentiment_counts.columns:
            sentiment_counts[f'{col}_pct'] = sentiment_counts[col] / total * 100

        return sentiment_counts

# Usage
prep = RedditTimeSeriesPrep(posts_df)
volume_series = prep.create_post_volume_series(freq='D')
engagement_df = prep.create_engagement_series(freq='D')
print(f"Time series length: {len(volume_series)} periods")

Seasonal Decomposition

Decompose Reddit time series into trend, seasonal, and residual components:

from statsmodels.tsa.seasonal import seasonal_decompose, STL
import matplotlib.pyplot as plt

class RedditSeasonalAnalysis:
    """Analyze seasonality in Reddit data."""

    def __init__(self, series: pd.Series):
        self.series = series
        self.decomposition = None

    def classical_decomposition(
        self,
        period: int = None,
        model: str = 'additive'
    ):
        """
        Classical seasonal decomposition.

        Args:
            period: Seasonality period (7 for weekly, 24 for hourly daily)
            model: 'additive' or 'multiplicative'
        """
        if period is None:
            # Detect frequency
            freq = self.series.index.freq or pd.infer_freq(self.series.index)
            if 'H' in str(freq):
                period = 24  # Daily seasonality for hourly data
            elif 'D' in str(freq):
                period = 7   # Weekly seasonality for daily data
            else:
                period = 7

        self.decomposition = seasonal_decompose(
            self.series,
            model=model,
            period=period,
            extrapolate_trend='freq'
        )

        return self.decomposition

    def stl_decomposition(self, period: int = 7, robust: bool = True):
        """
        STL decomposition (more robust to outliers).

        Args:
            period: Seasonal period
            robust: Use robust fitting (recommended for Reddit data)
        """
        stl = STL(
            self.series,
            period=period,
            robust=robust
        )

        self.decomposition = stl.fit()
        return self.decomposition

    def get_components(self) -> pd.DataFrame:
        """Get decomposition components as DataFrame."""
        if self.decomposition is None:
            raise ValueError("Run decomposition first")

        return pd.DataFrame({
            'observed': self.series,
            'trend': self.decomposition.trend,
            'seasonal': self.decomposition.seasonal,
            'residual': self.decomposition.resid
        })

    def calculate_seasonality_strength(self) -> float:
        """Calculate strength of seasonality (0-1)."""
        if self.decomposition is None:
            raise ValueError("Run decomposition first")

        resid_var = np.var(self.decomposition.resid.dropna())
        deseasonalized_var = np.var(
            self.decomposition.resid.dropna() +
            self.decomposition.seasonal.dropna()
        )

        if deseasonalized_var == 0:
            return 0

        strength = max(0, 1 - resid_var / deseasonalized_var)
        return strength

    def plot_decomposition(self) -> plt.Figure:
        """Plot decomposition components."""
        if self.decomposition is None:
            raise ValueError("Run decomposition first")

        fig = self.decomposition.plot()
        fig.set_size_inches(12, 10)
        plt.tight_layout()
        return fig

# Usage
analyzer = RedditSeasonalAnalysis(volume_series)
decomp = analyzer.stl_decomposition(period=7)

print(f"Seasonality strength: {analyzer.calculate_seasonality_strength():.3f}")
analyzer.plot_decomposition()
plt.show()

Anomaly Detection

Detect unusual spikes or drops in Reddit activity:

from scipy import stats
import numpy as np

class RedditAnomalyDetector:
    """Detect anomalies in Reddit time series."""

    def __init__(self, series: pd.Series):
        self.series = series

    def zscore_detection(self, threshold: float = 3.0) -> pd.DataFrame:
        """
        Detect anomalies using Z-score method.

        Args:
            threshold: Z-score threshold (3.0 is common)
        """
        zscores = stats.zscore(self.series)

        anomalies = pd.DataFrame({
            'value': self.series,
            'zscore': zscores,
            'is_anomaly': np.abs(zscores) > threshold,
            'anomaly_type': np.where(
                zscores > threshold, 'spike',
                np.where(zscores < -threshold, 'drop', 'normal')
            )
        })

        return anomalies

    def iqr_detection(self, multiplier: float = 1.5) -> pd.DataFrame:
        """
        Detect anomalies using IQR method (robust to outliers).

        Args:
            multiplier: IQR multiplier (1.5 standard, 3.0 extreme)
        """
        q1 = self.series.quantile(0.25)
        q3 = self.series.quantile(0.75)
        iqr = q3 - q1

        lower_bound = q1 - multiplier * iqr
        upper_bound = q3 + multiplier * iqr

        anomalies = pd.DataFrame({
            'value': self.series,
            'lower_bound': lower_bound,
            'upper_bound': upper_bound,
            'is_anomaly': (self.series < lower_bound) | (self.series > upper_bound),
            'anomaly_type': np.where(
                self.series > upper_bound, 'spike',
                np.where(self.series < lower_bound, 'drop', 'normal')
            )
        })

        return anomalies

    def rolling_zscore_detection(
        self,
        window: int = 7,
        threshold: float = 2.5
    ) -> pd.DataFrame:
        """
        Detect anomalies using rolling window Z-score.
        Better for data with trends.

        Args:
            window: Rolling window size
            threshold: Z-score threshold
        """
        rolling_mean = self.series.rolling(window=window, center=True).mean()
        rolling_std = self.series.rolling(window=window, center=True).std()

        zscores = (self.series - rolling_mean) / rolling_std

        anomalies = pd.DataFrame({
            'value': self.series,
            'rolling_mean': rolling_mean,
            'rolling_std': rolling_std,
            'zscore': zscores,
            'is_anomaly': np.abs(zscores) > threshold,
            'anomaly_type': np.where(
                zscores > threshold, 'spike',
                np.where(zscores < -threshold, 'drop', 'normal')
            )
        })

        return anomalies

    def get_anomaly_summary(self, anomalies_df: pd.DataFrame) -> dict:
        """Summarize detected anomalies."""
        anomaly_mask = anomalies_df['is_anomaly']

        summary = {
            'total_anomalies': anomaly_mask.sum(),
            'anomaly_rate': anomaly_mask.mean() * 100,
            'spikes': (anomalies_df['anomaly_type'] == 'spike').sum(),
            'drops': (anomalies_df['anomaly_type'] == 'drop').sum(),
            'anomaly_dates': anomalies_df[anomaly_mask].index.tolist(),
            'max_spike': anomalies_df.loc[anomaly_mask, 'value'].max() if anomaly_mask.any() else None
        }

        return summary

# Usage
detector = RedditAnomalyDetector(volume_series)
anomalies = detector.rolling_zscore_detection(window=7, threshold=2.5)
summary = detector.get_anomaly_summary(anomalies)

print(f"Found {summary['total_anomalies']} anomalies ({summary['anomaly_rate']:.1f}%)")
print(f"Spikes: {summary['spikes']}, Drops: {summary['drops']}")

Anomaly Context

Not all statistical anomalies are meaningful. Cross-reference detected spikes with external events (product launches, news, holidays) to understand causes. A "spike" on Christmas Day might be expected behavior, not an anomaly.

Forecasting with Prophet

Facebook Prophet is excellent for Reddit data with multiple seasonalities:

from prophet import Prophet
from prophet.diagnostics import cross_validation, performance_metrics

class RedditForecaster:
    """Forecast Reddit metrics using Prophet."""

    def __init__(self, series: pd.Series):
        self.series = series
        self.model = None
        self.forecast = None

    def prepare_prophet_data(self) -> pd.DataFrame:
        """Convert series to Prophet format."""
        df = pd.DataFrame({
            'ds': self.series.index.tz_localize(None) if self.series.index.tz else self.series.index,
            'y': self.series.values
        })
        return df

    def train(
        self,
        yearly_seasonality: bool = True,
        weekly_seasonality: bool = True,
        daily_seasonality: bool = False,
        holidays: pd.DataFrame = None
    ):
        """
        Train Prophet model.

        Args:
            yearly_seasonality: Include yearly patterns
            weekly_seasonality: Include weekly patterns
            daily_seasonality: Include daily patterns (for hourly data)
            holidays: DataFrame with holiday effects
        """
        self.model = Prophet(
            yearly_seasonality=yearly_seasonality,
            weekly_seasonality=weekly_seasonality,
            daily_seasonality=daily_seasonality,
            changepoint_prior_scale=0.05,  # Flexibility of trend
            seasonality_prior_scale=10,    # Flexibility of seasonality
            interval_width=0.95            # 95% confidence interval
        )

        # Add holidays if provided
        if holidays is not None:
            self.model.add_country_holidays(country_name='US')

        # Add custom seasonality for Reddit
        self.model.add_seasonality(
            name='monthly',
            period=30.5,
            fourier_order=5
        )

        # Fit model
        prophet_df = self.prepare_prophet_data()
        self.model.fit(prophet_df)

        return self.model

    def predict(self, periods: int = 30, freq: str = 'D') -> pd.DataFrame:
        """Generate forecast for future periods."""
        if self.model is None:
            raise ValueError("Train model first")

        future = self.model.make_future_dataframe(periods=periods, freq=freq)
        self.forecast = self.model.predict(future)

        return self.forecast

    def cross_validate(
        self,
        initial: str = '365 days',
        period: str = '30 days',
        horizon: str = '30 days'
    ) -> pd.DataFrame:
        """Perform time series cross-validation."""
        if self.model is None:
            raise ValueError("Train model first")

        cv_results = cross_validation(
            self.model,
            initial=initial,
            period=period,
            horizon=horizon
        )

        metrics = performance_metrics(cv_results)
        return metrics

    def get_forecast_summary(self, periods: int = 7) -> dict:
        """Summarize forecast for specified periods."""
        if self.forecast is None:
            raise ValueError("Generate forecast first")

        future_only = self.forecast.tail(periods)

        return {
            'dates': future_only['ds'].tolist(),
            'predictions': future_only['yhat'].round(2).tolist(),
            'lower_bound': future_only['yhat_lower'].round(2).tolist(),
            'upper_bound': future_only['yhat_upper'].round(2).tolist(),
            'avg_prediction': future_only['yhat'].mean(),
            'trend_direction': 'increasing' if future_only['yhat'].iloc[-1] > future_only['yhat'].iloc[0] else 'decreasing'
        }

    def plot_forecast(self) -> plt.Figure:
        """Plot forecast with components."""
        if self.forecast is None:
            raise ValueError("Generate forecast first")

        fig1 = self.model.plot(self.forecast)
        fig2 = self.model.plot_components(self.forecast)

        return fig1, fig2

# Usage
forecaster = RedditForecaster(volume_series)
forecaster.train()
forecast = forecaster.predict(periods=14)

summary = forecaster.get_forecast_summary(periods=7)
print(f"Next 7 days: {summary['avg_prediction']:.0f} avg posts/day")
print(f"Trend: {summary['trend_direction']}")

# Cross-validation
metrics = forecaster.cross_validate()
print(f"MAPE: {metrics['mape'].mean():.2%}")

Model Evaluation Metrics

Metric Formula Good Value Use Case
MAE (Mean Absolute Error) avg|actual - predicted| Lower is better General accuracy, same units
RMSE (Root Mean Square Error) sqrt(avg(actual - predicted)^2) Lower is better Penalizes large errors
MAPE (Mean Absolute Percentage Error) avg|error/actual| * 100 <10% good, <20% acceptable Scale-independent accuracy
Coverage % actual within confidence interval Close to interval width (e.g., 95%) Uncertainty estimation

Pro Tip: Multiple Baselines

Compare your forecast against simple baselines: naive (yesterday's value), seasonal naive (same day last week), and mean (historical average). Your model should beat these to be valuable.

Track Reddit Trends Automatically

reddapi.dev provides built-in trend analysis and alerts. Monitor topics, detect emerging discussions, and track sentiment changes without building your own time series pipeline.

Explore Trend Analysis

Frequently Asked Questions

What time frequency should I use for Reddit data?

It depends on your use case. For real-time monitoring, use hourly data. For trend analysis, daily aggregation works well. For long-term patterns, weekly data reduces noise. Match your forecast horizon to your frequency—don't forecast months ahead with hourly data.

How do I handle missing periods in Reddit data?

Fill missing periods with zeros for count data (post volume) or interpolation for continuous data (sentiment scores). Prophet handles missing data automatically. For significant gaps, consider excluding that period or treating pre/post gap as separate series.

Why does my forecast not capture sudden spikes?

Statistical forecasts predict "normal" behavior, not one-time events. For event-driven spikes, you need: (1) external regressors (e.g., planned product launches), (2) anomaly detection separate from forecasting, or (3) ensemble models that combine statistical forecasts with event calendars.

How far ahead can I reliably forecast Reddit metrics?

Forecast reliability decreases rapidly with horizon. For Reddit data, 7-14 days is reasonable for daily data; beyond that, wide confidence intervals make forecasts less actionable. For longer horizons, focus on trends rather than specific values.

Should I forecast each subreddit separately?

Generally yes, if you have enough data per subreddit (3+ months of daily data). Each subreddit has unique posting patterns. However, hierarchical models can share strength across subreddits when individual data is sparse. Start with separate models and combine if needed.