Security Data Lakes: A Modern Approach to Cybersecurity

The Ineffectiveness of Current Cybersecurity Spending
Despite an average annual expenditure of $18 million, most corporate security organizations struggle to effectively prevent breaches, intellectual property theft, and data loss. This inefficacy stems from a fragmented security approach commonly found within security operations centers (SOCs).
A Historical Perspective on Security Operations
To understand the current challenges, it’s important to review the evolution of security operations. Ten years ago, the primary method of protecting applications and websites involved monitoring event logs – comprehensive digital records detailing all activity within a cyber environment.
These logs encompassed a wide range of actions, including user logins, email communications, and system configuration changes. Regular audits were conducted, alerts were triggered by suspicious events, investigations were initiated, and data was archived to meet compliance requirements.
The Rise of SIEM Technology
As cyber threats became more prevalent and attacker methodologies more complex – their tactics, techniques, and procedures (TTPs) – a more advanced approach emerged: security information and event management (SIEM).
SIEM systems utilize specialized software to analyze security alerts in real-time, generated by both applications and network infrastructure. This software employs rule-based correlation and analytics to transform raw event data into actionable security intelligence.
While not a perfect solution, the ability to detect attacks in progress – to locate the proverbial “needle in the haystack” – represented a significant advancement in cybersecurity capabilities.
The Current State of SIEM and Evolving Needs
SIEM solutions remain prevalent today, with Splunk and IBM QRadar dominating the market. The technology itself has undergone substantial development, adapting to new use cases and incorporating advancements like cloud-native deployments and machine learning.
However, the number of new enterprise SIEM implementations is declining, implementation costs are increasing, and crucially, the requirements of the Chief Information Security Officer (CISO) and their SOC teams have fundamentally shifted.
Sophisticated behavioral analytics are now being leveraged, but the core issue of a fragmented approach persists, hindering optimal security outcomes.
The Limitations of SIEM in Addressing Modern Security Challenges
Data volumes have increased dramatically, and SIEM systems are proving to be insufficiently broad in their scope. Simply aggregating security events is no longer adequate due to the limited perspective this approach provides. Despite the potential for capturing and analyzing substantial event data, critical information sources are often overlooked.
These overlooked sources include OSINT (open-source intelligence), readily available external threat intelligence feeds, and crucial data like malware and IP reputation databases, alongside insights derived from dark web monitoring. The sheer number of potential intelligence sources exceeds the capabilities of traditional SIEM architectures.
Furthermore, escalating data volumes are driving up costs. The combination of data growth, hardware expenses, and licensing fees results in a rapidly increasing total cost of ownership. The proliferation of both physical and virtual infrastructure has led to an exponential increase in the amount of data being collected.
Machine-generated data has experienced a 50-fold increase, while typical security budgets only grow by 14% annually. Storing this vast quantity of information renders SIEM solutions financially unsustainable. The average annual cost of a SIEM now approaches $1 million, covering only licensing and hardware.
This economic pressure compels Security Operations Center (SOC) teams to limit the amount of data they capture and retain, further diminishing the SIEM’s effectiveness. A recent conversation with a SOC team revealed that querying extensive datasets for fraud detection within Splunk was prohibitively expensive and a cumbersome process, prompting them to investigate alternative solutions.
The inherent weaknesses of the current SIEM-centric security model present significant risks. A recent Ponemon Institute survey, encompassing nearly 600 IT security leaders, revealed a concerning statistic: 53% of respondents were unsure whether their security products were even functioning effectively, despite an average annual expenditure of $18.4 million and the deployment of 47 different products.
The Need for a New Approach
- Current SIEM architectures struggle with the scale and diversity of modern threat data.
- The cost of maintaining a comprehensive SIEM deployment is becoming unsustainable for many organizations.
- A significant percentage of security investments are failing to deliver demonstrable value.
It is evident that a fundamental shift in security strategy is necessary to address these challenges.
The Advancement of Security Architecture: Security Data Lakes
Traditionally, Security Information and Event Management (SIEM) systems have relied on data housed within data warehouses. These warehouses are characterized by isolated compartments of structured data, pre-processed and tailored for specific analytical tasks.
This conventional method of filtering, modeling, and transferring data from its origins into these segregated storage systems is both resource-intensive and costly. Ultimately, it significantly restricts the volume of data available for comprehensive security analytics.
Conversely, the security data lake represents a paradigm shift, focusing on the centralization of all vital threat and event data. This encompasses data from any source and in any format, all within a unified, easily accessible repository.
Key Differences Between Data Lakes and Data Warehouses
Unlike data warehouses, a security data lake preserves data in its original state – whether structured or unstructured – resulting in a dimensional, dynamic, and heterogeneous dataset. This inherent flexibility is what fundamentally differentiates data lakes and provides a distinct advantage.
A data lake enables the continuous ingestion of all security-related data streams. This includes log files, threat intelligence feeds, database tables, text files, and system logs. No data is discarded, ensuring complete retention for future analysis.
The automated data processing that occurs upon ingestion, often referred to as parsing, streamlines workflows. This allows security teams to concentrate their efforts on proactive threat prevention and incident response.
Compared to the constraints of data warehouse solutions, the data lake approach offers a significantly more cost-effective and adaptable solution. It delivers the agility and performance demanded by modern security operations.
- Data Warehouses: Structured, filtered, compartmentalized data.
- Security Data Lakes: Centralized, native format, dynamic, heterogeneous data.
The evolution towards security data lakes marks a substantial improvement in flexibility and effectiveness for security teams, enabling a more holistic and responsive security posture.
The Transformative Power of a Security Data Lake for Your SOC
Let's get straight to the point. Implementing a security data lake empowers your security team to concentrate on higher-level, more strategic initiatives.
Key Benefits of a Security Data Lake
- Proactive Threat Hunting: Advanced attackers are adept at concealing their activities and bypassing standard security defenses. Expert security teams utilize triggers – such as a questionable IP address or a specific event – to identify and neutralize threats before significant harm is done.
- The success of threat hunting hinges on the expertise of the team, but they heavily depend on extensive threat intelligence data. This allows for cross-referencing internal observations with the newest intelligence, enabling correlation and accurate attack detection.
- Data-Driven Investigations: When potential security incidents are flagged, analysts initiate investigations. Speed is paramount for effective incident response.
- The average organization employs 47 security products, creating challenges in accessing all necessary data. A security data lake streamlines this process by centralizing reconnaissance data.
This eliminates the laborious task of log collection and allows analysts to compare current behavior with historical patterns, potentially spanning a decade. Such extensive analysis would be financially impractical with a traditional SIEM.
Leveraging Software with Security Data Lakes
Implementing a security data lake necessitates the utilization of specialized software, as fully integrated, ready-to-use solutions are currently unavailable. Here are three innovative companies poised to significantly impact the industry and facilitate your security data lake deployment. (Please note: I have no employment affiliation with these companies, but possess familiarity with their offerings and anticipate substantial contributions to the field.)
Team Cymru: Unparalleled Internet Visibility
Team Cymru represents a leading security intelligence provider, though it remains relatively unknown to many. The organization maintains a comprehensive global network of sensors that monitor IP traffic traversing internet service providers, granting it an exceptional level of visibility – and consequently, knowledge – exceeding that of most security operations centers (SOCs).
Initially, the company’s core business involved selling this valuable data to prominent public security firms like Crowdstrike, FireEye, Microsoft, and, more recently, Palo Alto Networks following their acquisition of Expanse for $800 million. Furthermore, advanced SOC teams at organizations such as JPMC and Walmart are adopting the strategies outlined in this publication and utilizing Cymru’s telemetry data feed. Access to this same data is now available to you, offering over 50 data types and a decade of intelligence to enhance your team’s ability to identify adversaries and malicious actors based on characteristics like IP addresses or other identifying signatures.
Varada.io: Accelerating Data Lake Access
The primary benefit of a security data lake lies in its ability to provide easy, rapid, and unrestricted access to extensive datasets. This approach eliminates the need for data movement and duplication, delivering the agility and flexibility that users require. However, as data lakes expand, query performance can degrade, demanding significant data operations to meet evolving business needs. While cloud storage costs may be low, compute expenses can quickly escalate due to the reliance on full data scans by conventional query engines.
Varada addresses this challenge by automatically indexing all critical data across any dimension. Accelerated data is stored closer to the SOC – on solid-state drives – in a granular format, enabling data consumers to execute any query whenever necessary. This results in query response times up to 100 times faster, at a significantly lower cost by circumventing resource-intensive full scans. Varada facilitates the search for attack indicators, post-incident analysis, integrity monitoring, and proactive threat hunting. Essentially, Varada empowers your team with the data access they need, consistent and interactive performance, and relief from managing usage costs and complex data operations.
Panther Labs: Transforming Snowflake into a Security Platform
Snowflake has become a widely adopted data platform, primarily serving departmental needs within mid-sized to large enterprises. It is not, however, a Security Information and Event Management (SIEM) system and lacks inherent security functionalities. Recognizing this gap, experienced security engineers from AWS and Airbnb established Panther Labs, a modern, cloud-native security platform designed to streamline the ingestion of all security data into a centralized data lake, thereby simplifying detection and accelerating incident response investigations.
The company has recently integrated Panther with Snowflake, enabling data joining between the two platforms to transform Snowflake into a “next-generation SIEM” or, alternatively, to evolve Snowflake into a fully functional security data lake. While still a relatively new solution, I have observed substantial migration of Splunk customers to Panther. This represents a promising concept with significant potential for the future of the SOC.
Security teams are increasingly acknowledging their struggle against malicious actors. The trend of reducing reliance on traditional SIEMs is gaining momentum, alongside other significant shifts in the security landscape. The SIEM is not poised for immediate obsolescence, but its role is evolving rapidly, and it now shares the stage with the security data lake.
Although not a simple “off-the-shelf” solution, the security data lake provides a centralized repository for all critical threat and event data, offering simplified access. It can complement existing SIEMs, but the market is also introducing data-lake-native solutions that are more flexible and efficient. The security data lake represents an exciting and worthwhile consideration for your organization.
Dan Shoenbaum has held advisory roles with both Varada and Panther Labs.
Related Posts

NHS England Data Breach Confirmed by Tech Provider

Cisco Zero-Day Exploit: Chinese Hackers Targeting Customers

Pornhub Hacked: User Data Extorted by Hacking Group

Google and Apple Release Emergency Security Updates

700credit Data Breach: 5.6 Million Affected
