Learning Prometheus, Thanos & Loki: Monitoring & Logging Notes from a Beginner
Table of Contents
- What is Prometheus – Overview of its role, components, and how it collects metrics
- What is Thanos – How it adds storage, scalability, and high availability to Prometheus
- What is Loki – How it handles logs and differs from Prometheus
- References – Sources used in this note
What is Prometheus?
Prometheus is an open-source time-series database (TSDB) built for monitoring and alerting. It collects numeric metrics from various systems at regular intervals and stores them with timestamps and labels. These metrics can then be queried using PromQL, its powerful and flexible query language.
Prometheus works especially well in cloud-native environments, and has strong support for Kubernetes. In many setups, it’s managed using the Prometheus Operator, which simplifies deployment and configuration.
Prometheus Components
Prometheus Server
The core of Prometheus. Responsible for scraping metrics from configured endpoints and storing them in its internal time-series database (TSDB).
PromQL
The query language used to extract and analyze time-series data. It supports filtering, aggregation, and mathematical operations, making it powerful and flexible.
Prometheus Scraping
Prometheus collects data by pulling (scraping) metrics from HTTP endpoints at regular intervals. Targets and intervals are defined in a YAML configuration file.
Alertmanager
Manages alerts triggered by Prometheus rules. It handles deduplication, grouping, silencing, and routes notifications to external systems such as Slack, email, or PagerDuty.
Exporters
Software components that expose metrics from third-party services (e.g., databases, hardware), so Prometheus can scrape them.
Pushgateway
Used when services can’t be scraped directly by Prometheus, allowing them to push metrics to Prometheus via a central gateway.
Prometheus Operator
A Kubernetes-native tool for automating the deployment and management of Prometheus and Alertmanager instances within Kubernetes environments.
Prometheus Storage
Prometheus stores metrics in its built-in time-series database. It is optimized for fast writes and queries but is not designed for long-term data retention. For long-term storage, tools like Thanos are commonly used.
What is Thanos?
Thanos is an open-source project that extends Prometheus to solve these kinds of problems—mainly around long-term storage, scalability, and high availability.
It’s not a replacement for Prometheus, but more like an add-on layer. Thanos integrates directly with Prometheus by adding a few lightweight components that make the system much more powerful and production-ready.
Benefits of Using Thanos
Problem with Prometheus | How Thanos Solves It |
---|---|
Limited Scalability | Prometheus instances are isolated by default. Thanos Querier provides a global query view by aggregating data from multiple Prometheus servers, making it easier to scale across clusters or regions. |
No Built-in High Availability | A failed Prometheus instance can result in data loss. Thanos Sidecar uploads data to remote object storage, providing redundant, durable storage and enabling highly available setups. |
Short-Term Data Retention | Prometheus stores data locally, which limits how long metrics can be retained. Thanos enables long-term storage by offloading old data to services like AWS S3 or GCS. This supports retention over months or years. |
No Downsampling or Deduplication | As data grows, queries slow down. Thanos automatically downsamples old data and deduplicates metrics collected from multiple Prometheus replicas. This keeps queries fast and accurate. |
Storage Extension & API Compatibility | Thanos provides a Store Gateway that reads historical data from object storage and exposes it via a Prometheus-compatible API, making it easy to integrate with existing dashboards like Grafana. |
Thanos Components
Thanos Sidecar
Runs alongside each Prometheus instance. It uploads metrics data to remote object storage (like S3 or GCS) and makes the local Prometheus data accessible to the rest of the Thanos system.
Thanos Querier
Provides a unified global query layer across multiple Prometheus instances and remote storage. This is where users send their queries.
Thanos Compactor
Optimizes and manages stored data by compacting time blocks and downsampling older metrics. This reduces storage usage and speeds up long-range queries.
Thanos Store Gateway
Connects to remote object storage and serves historical metric data back to the Querier. Even if local Prometheus no longer stores the data, it can still be queried.
Thanos Frontend
Improves query performance by splitting and parallelizing large queries. Useful in high-load or multi-user environments.
What Is Loki?
Loki is a horizontally scalable, highly available, and multi-tenant log aggregation system inspired by Prometheus. While Prometheus focuses on collecting and querying metrics, Loki focuses on logs. Another key difference is how data is collected—Loki uses a push-based model, meaning logs are pushed to it by agents like Promtail or Grafana Alloy, rather than being scraped like in Prometheus.
Unlike traditional log systems (e.g., ELK), Loki does not index the full content of logs. Instead, it indexes only a set of labels (metadata) for each log stream. This lightweight approach makes Loki more efficient and easier to operate at scale.
Loki vs. Prometheus
Data Types and Collection Methods
Feature | Prometheus | Loki |
---|---|---|
Data Type | Structured numerical metrics (time-series) | Unstructured log data (text) |
Collection Model | Pull-based (scrapes metrics from targets) | Push-based (agents send logs to Loki) |
Best Use Case | Real-time monitoring of performance and health | Debugging, incident investigation, forensic analysis |
Storage Approach and Efficiency
Feature | Prometheus | Loki |
---|---|---|
Indexing Strategy | Stores and indexes full metric series | Indexes only metadata (labels), not full log content |
Storage Method | Compressed time-series database (TSDB) | Compressed chunks stored in object storage (e.g., S3, GCS) |
Cost Model | Efficient for numeric data; storage grows with metric cardinality | Cost-efficient; users pay mainly for queries, not just storage |
Retention | Controlled by internal TSDB configuration | Object storage enables long-term retention at lower cost |