AWS Logo
Menu

Understanding Cosine Similarity for Web Server Log Analysis

This is blog focuses on analyzing web server logs by converting log entries into vector embeddings and evaluating their semantic similarity using cosine similarity. The goal is to process input text, generate embeddings, and compare them against logged data to identify patterns or matches based on contextual meaning rather than exact text.

Published Apr 4, 2025

Introduction

In this personal project, I explored how cosine similarity can be used to analyze web server logs by comparing log entries based on their semantic meaning. The goal was to retrieve relevant logs by converting them into embeddings and measuring similarity against a given input query. I used a dataset from Kaggle (https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs) containing web server logs, processed them into embeddings, and implemented cosine similarity to find the closest matches.

What is Cosine Similarity?

Cosine similarity is a metric used to measure how similar two vectors are by calculating the cosine of the angle between them. In natural language processing (NLP), text is converted into numerical vectors (embeddings), and cosine similarity helps compare their semantic meaning. A value of 1 means identical vectors, 0 means no similarity, and -1 means completely opposite. The formula is:
Figure 1. Cosine Similarity Formula
Figure 1. Cosine Similarity Formula
where A and B are vectors.

Dataset and Preprocessing

I used a web server log dataset from Kaggle (https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs) containing thousands of log entries. For testing, I extracted 100 logs and converting them into embeddings using a pre-trained sentence transformer model (e.g., all-mpnet-base-v2 ).

Embedding and Storing Logs

Each log entry was converted into a dense vector representation (embedding) and stored in an array format in server memory. This allowed efficient retrieval when comparing against new input queries. Embedding models capture semantic meaning, so even logs with different wording but similar intent (e.g., "404 Not Found" vs. "Page missing") can be matched effectively.

Workflow: Input Matching with Cosine Similarity

  1. Input Conversion: A user query (e.g., "server error") is embedded into the vector space .
  2. Similarity Calculation: The cosine similarity is computed between the input embedding and all stored log embeddings.
  3. Top-K Retrieval: The system returns the top-K most similar logs based on the highest cosine scores.

Results and Observations

The test successfully retrieved relevant logs for different queries. For example:
  • Input: "404 error" → Matched logs with "Not Found" errors.
  • Input: "login failed" → Retrieved authentication-related logs.
The top-K approach (e.g., top 5 or 10) helped filter the most meaningful matches, improving usability.

The Python Code

Output Cosine Similarity

Theory Behind the Approach

This method relies on dense vector representations (embeddings) and cosine similarity, which is widely used in NLP for semantic search. Unlike keyword-based matching, embeddings capture contextual meaning, making them robust for log analysis where logs may vary in phrasing but refer to similar issues (Sidorov et al., 2014).

Conclusion and References

This experiment demonstrated how cosine similarity can enhance log analysis by enabling semantic search over raw logs. Future improvements could include scaling to larger datasets and integrating LLM.
References :

Comments