Monitoring Logs with Loki, Promtail & Grafana

By: Aayush Pokharel

CNCF Kathmandu - Dec 28, 2024

About Me

  • Aayush Pokharel
  • DevOps Engr.
  • STARTsmall Pvt. Ltd.

Today's Topic of Discussion

  • Background
  • How does Loki work
  • What is Promtail? What does it do?
  • Differences between the 3 Loki helm charts
  • Implement Loki-distributed helm chart in minikube cluster
  • Implement Promtail
  • How to configure Loki as a data source in Grafana
  • Visualize Loki Logs in Grafana

What are Logs?

  • A record of events happening in a system.

Type of Logs

  • They may be:
{
  "timestamp": "2024-10-27T10:30:00Z",
  "level": "INFO",
  "service": "order-processing",
  "component": "payment-gateway",
  "operation": "authorize-payment",
  "order_id": "12345",
  "customer_id": "67890",
  "amount": 25.99,
  "currency": "USD",
  "status": "SUCCESS",
  "message": "Payment authorization successful.",
  "details": {
    "transaction_id": "ABCDEFG123",
    "payment_method": "CreditCard",
    "auth_code": "XYZ123"
  }
}

Type of Logs

2024-10-27 10:30:00 INFO: Payment authorization successful for order 12345, customer 67890, amount $25.99. Transaction ID: ABCDEFG123.

Logs in Monolith Systems

  • Logs were stored as simple text files on a single server.

  • Accessed via SSH and commmand like tail -f /var/log/app.log'

  • Basic tools used for log management:

    • syslog - Centralized logging system
    • logrotate - Automated log file rotation.
    • grep / awk / sed - Manual searching and filtering

Challenges in Modern Times

  • Microservices & Containers: Logs are spread across multiple services and ephemeral containers.
  • Cloud Environment: Logs exist across multiple regions and instances.

What is Grafana Loki?

  • Loki is a log aggregation system designed for scalable and efficient log management.
  • Built by Grafana Labs, it integrates seamlessly with Grafana for log visualization.
  • Inspired by Prometheus, but for logs:
    • Label-based log organization.
    • No full-text indexing, making it cost-efficient.

Key Features

  • Scalability: Handles large volumes of logs.
  • Cost-Efficiency: Avoids full-text indexing.
  • Multi-Tenancy: Supports isolated log streams for different users.
  • Query Language: Uses LogQL, a Prometheus-like language for querying logs.
  • Integration: Works seamlessly with Prometheus and Grafana.

Loki Architecture

Core Components

  • Distributor
  • Ingestor
  • Query Frontend
  • Querier
  • Compactor
  • Ruler

Data Flow

Log Ingestion

  • Log ingestion is the entry point where logs are received by Loki.
  • HTTP API endpoint for receiving log streams

Role of Promtail:

  • Promtail is an agent that collects logs from local sources (e.g., log files, systemd journal) and ships them to a - Grafana Loki instance for storage and querying.

Log Discovery:

  • Promtail needs to discover which logs to collect. This is done through service discovery.
  • It supports static discovery (manual configuration) and Kubernetes discovery (fetching labels from the Kubernetes API server).

Log File Discovery:

  • Promtail can tail logs from local log files and the systemd journal (on ARM and AMD64 machines).
  • It uses scrape_configs to configure which files to monitor, similar to how Prometheus scrapes metrics.

Labeling and Metadata:

  • During discovery, Promtail attaches labels (metadata) to logs, such as the pod name, filename, or container name.
  • Labels help organize logs for easier querying in Loki.

Log Shipping:

  • Once Promtail has discovered the logs and attached labels, it tails the logs, continuously reading them.
  • When enough data is collected, it is flushed as a batch to Loki.

Handling Log Offsets:

  • Promtail tracks the last read position using a positions file (e.g., /var/log/positions.yaml).
  • This allows Promtail to resume reading from where it left off if it crashes or restarts.

Log Processing:

  • Promtail can parse logs and modify their content using pipeline stages.
  • This allows for more advanced operations like correcting timestamps, adding labels, or even rewriting log lines.

Receiving Logs from Syslog:

  • Promtail can also receive logs from Syslog by listening on a configured port.

Loki Push API:

  • Promtail can be configured to receive logs from other Promtail instances or Loki clients via the Loki Push API.
  • This is useful in complex network setups or for serverless environments.

Loki Components

center

Distributor

  • Receives logs from clients.
  • Splits logs into chunks and assigns them to ingesters.
  • Uses consistent hashing for log stream distribution.

Example

Log Collector Agent (Promtail) sends a batch of logs to Loki’s Distributor via an HTTP POST request. Each log stream includes:

  • Tenant ID: dev-team
  • Labels: {app="backend", env="production"}
  • Log Entries: Timestamps and log messages.

Distributor Validates Incoming Logs

  • Check Labels: Ensures that labels conform to Prometheus standards (e.g., no invalid characters or duplicates).
  • Verify Timestamps: Confirms that log timestamps are neither too old nor too far in the future.
  • Rate Limiting: Checks the data ingest rate for the dev-team tenant.
  • Normalize Labels: Sorts labels alphabetically to enable deterministic caching and hashing.

Hashing and Ingester Selection

  • The Distributor uses consistent hashing to determine which Ingesters should handle the log stream:
  • Looks up the hash in the hash ring, which maps hash ranges to Ingesters.
  • Selects n Ingesters (where n is the replication factor, typically 3).
  • The Distributor forwards the log stream to Ingester A, Ingester B, and Ingester C in parallel.

Example:

  • Hash of the stream: 42
  • Replication Factor: 3
  • Hash Ring:
  Token 10 -> Ingester A  
  Token 50 -> Ingester B  
  Token 90 -> Ingester C  
  • Hash 42 falls between Token 10 and Token 50, so Ingester A is the primary recipient.
  • The Distributor also selects the next two clockwise tokens: Ingester B and Ingester C.

Write Quorum & Handling Failures

  • The Distributor waits for acknowledgments from the Ingester
  • If one of the Ingesters (e.g., Ingester B) is unreachable the Distributor still succeeds if Ingester A and Ingester C acknowledge the write.
  • Collector Agent receives an acknowledgment that the logs were successfully ingested.

Ingester

  • The ingester is responsible for persisting logs and shipping them to long-term storage (e.g., Amazon S3, Google Cloud Storage) and for returning recent logs for queries.

Lifecycle Management

Each ingester has a state that governs its behavior:

  • PENDING: Waiting for a handoff of data from a leaving ingester.
  • JOINING: Inserting itself into the hash ring, preparing to receive data.
  • ACTIVE: Fully initialized, able to handle both reads and writes.
  • LEAVING: Shutting down, still serving read requests.
  • UNHEALTHY: Failed to heartbeat, marked as unhealthy by the distributor.

Data Handling

  • Logs are grouped into chunks in memory.
  • Once a chunk reaches its capacity or a certain time interval passes, it is compressed and marked as read-only.
  • A new writable chunk is created to handle incoming logs.

Data Persistence

  • Periodically, ingesters flush data to the backing storage (e.g., S3 or Google Cloud Storage).
  • Chunks are hashed based on tenant, labels, and content to ensure no duplication in storage.

Timestamp Ordering

Logs must be ingested in timestamp order by default.
If a log arrives out of order, it is rejected unless configured to accept out-of-order logs.

Replication:

  • To prevent data loss, logs are replicated to multiple ingesters (typically 3).
  • This means if one ingester fails, the data can still be retrieved from the others.

Querier

  • Executes LogQL queries.
  • Fetches data from ingesters and long-term storage.

Query Frontend (Optional)

  • Accelerates query execution.
  • Splits large queries into smaller subqueries.
  • Caches query results for efficiency.

Example

{app="backend"} |= "error"

What happens when Query Frontend Receives the Query?

  • The query specifies a time range of 6 hours.
  • The Query Frontend splits the query into smaller time-range sub-queries (e.g., 30-minute intervals).
  • These sub-queries are distributed to multiple Queriers for parallel processing.

Sub-Queries Generated:

Fetch logs from 00:00 to 00:30
Fetch logs from 00:30 to 01:00
Fetch logs from 01:00 to 01:30

Query Frontend Optimizes Performance

  • Caching:
    The Query Frontend checks its cache for any previously executed sub-queries. If Aayush had run a similar query earlier, some results might already be cached.

  • Batching and Parallelism:
    The sub-queries are sent to Queriers in parallel, reducing the overall query execution time.

Queriers Process Sub-Queries

  • Each Querier fetches logs from:

    • Ingesters: For recent logs still in memory.
    • Object Storage: For older logs stored as compressed chunks.
  • The Queriers process their assigned time ranges and filter logs containing the word "error".

Query Frontend Aggregates Results

  • Once all sub-queries are completed:
    • The Query Frontend aggregates the results from all Queriers.
    • It merges and deduplicates log entries across the time ranges.

Deployment Modes

1. Single Binary Mode

  • All Loki components run in a single process.
  • Suitable for small-scale deployments.

2. Simple Scalable Mode

  • Components run in logical groups
    • Read
    • Write
    • Backend
  • Ideal for medium-scale setups.

3. Microservices Mode

  • Fully distributed architecture:
  • Every component runs as as an independent service(s).
  • Best for large-scale, production-grade deployments.

Now Let's Look into the Helm Charts!

Promtail (or other agents): Promtail is the most commonly used log collector for Loki. It reads logs from files (e.g., /var/log) or receives them from systemd, Kubernetes, or other sources. Promtail adds metadata to the logs, such as Kubernetes pod labels, hostnames, or custom labels defined in its configuration. Logs are sent to Loki via the HTTP push API or through other ingestion mechanisms like Fluentd or Logstash. Ingestion API: Loki exposes an HTTP API endpoint for receiving log streams. Promtail and other log agents send log entries to this endpoint in batches.

Think of Promtail as a delivery driver who collects logs from various locations (servers, containers) and delivers them to a centralized log warehouse (Loki).

Example: Promtail is like a driver who needs to know where to pick up packages (logs) from. It can either be told exactly where to go (static) or automatically figure it out (Kubernetes).

Example: Promtail is like a worker who checks specific files (logs) for new entries (logs) to ship to Loki.

Example: Think of Promtail as a labeling machine that tags each log with useful information, like the name of the department (pod name) or item type (log file).

Example: Promtail is like a postal service that picks up logs, holds them until a certain amount is collected, and then sends them to the central warehouse (Loki).

Example: Promtail is like a delivery driver who marks the last stop made so they can pick up where they left off the next time.

Example: Promtail acts like a log editor, adjusting the logs as needed before sending them to the warehouse (Loki).

Example: Promtail can be configured to receive logs from a centralized syslog server and then forward them to Loki.

Example: Promtail can act as a hub that receives logs from multiple remote sources before shipping them to Loki.

The Distributor is the first component in Loki's backend that processes incoming logs. Responsibility: It validates and deduplicates incoming log data. It assigns the logs to specific tenants (if multi-tenancy is enabled). It hashes the log stream's labels and uses consistent hashing to determine which ingester will process the logs. Load Balancing: Distributors ensure even distribution of log streams across ingesters using the hash ring.

The Distributor performs the following validation steps:

Aayush has configured a global rate limit of 10MB/s for this tenant. The cluster has 5 Distributors, so each Distributor enforces a limit of 2MB/s for dev-team.

ensuring {app="backend", env="production"} is equivalent to {env="production", app="backend"}

Combines the Tenant ID (dev-team) and Label Set ({app="backend", env="production"}) to create a unique hash

With a replication factor of 3, the quorum is floor(3/2) + 1 = 2. At least two Ingesters must confirm the write for the Distributor to consider it successful.

Key Benefits in This Scenario Scalability: The Distributor is stateless, so Aayush can scale it horizontally to handle increased log traffic. Fault Tolerance: The replication factor ensures that logs are not lost even if one Ingester fails. Rate Limiting: Per-tenant rate limits protect the system from being overwhelmed by a single tenant's log traffic. Efficient Load Distribution: Consistent hashing ensures that logs are evenly distributed across Ingesters.

Role of Ingester:

Think of it like a warehouse that stores products (logs) temporarily before shipping them to a storage facility (cloud storage).

Example: If an ingester is ACTIVE, it can handle incoming logs (write requests) and serve recent logs for queries (read requests).

Example: Imagine a log entry from a web server is stored in a chunk. When the chunk is full, it's sealed (compressed), and a new chunk is created for further logs.

Example: If the same log is received by two ingesters, they will not write the same chunk to storage, avoiding data duplication.

Example: If a log from server1 arrives at 10:05:00 and another log arrives at 10:03:00, the latter will be rejected unless out-of-order writes are allowed.

Example: Think of backups for your logs — if one warehouse (ingester) loses its data, other warehouses (replicas) will have copies of it.

Failure Mitigation: If an ingester crashes, any unflushed data is lost. To mitigate this, Loki uses Write-Ahead Logs (WAL) and replication. Example: If a warehouse suddenly catches fire (ingester crash), the backup warehouse (replica) can still provide the data. Filesystem Support: Ingester can write to the filesystem using BoltDB in single-process mode. However, this approach is limited since multiple processes can't access the same BoltDB instance concurrently. Example: Think of single-tenant mode where only one process can manage the warehouse’s inventory at a time.

This query requests all logs labeled with app="backend" that contain the word "error" in their content.

The Query Frontend intercepts Aayush's query before it reaches the Querier.

For example: If I queried logs for 00:00 to 03:00 an hour ago, those results are fetched directly from the cache, skipping redundant processing.