The best listening experience is on Chrome, Firefox, or Safari. Subscribe to Federal Drive’s daily audio interviews on Apple Podcasts or PodcastOne.
Cyber security and data management are closely linked. This is why many agencies are refining their strategies for collecting and managing large stores or network lakes and other data in the service of better cybersecurity. For ways to tackle the cyber data problem, the Federal Drive with Tom Temin spoke with R Street Institute’s senior cybersecurity and emerging threats researcher Bryson Bort.
Tom Temin: Since this type of data lake-like technology or any type of mass storage in the cloud is available, do agencies run the danger of collecting too much data such that it becomes difficult to identify and sift through? the needles you are actually looking for in the haystack.
Bryson Bort: Yeah, I call it the NSA problem. We collect all we can. And then you have the challenge of, with all you’ve got and as it builds up, is it easy for me to answer the questions I want to ask about this data? Part of the challenge, of course, is that I’m not always sure what question I want to ask before I go through it. And therefore a structuring that helps you get there more quickly. But that’s not always realistic. Things change. We have different things that we have learned from the data itself. But yeah, that first part is we start making some really big haystacks. And, Tom, here’s the worst part: You talk about finding a needle in a haystack. The worst part is sometimes you are looking through the haystack. And there is no needle to find.
Tom Temin: Yes, that could be a lot of spinning wheels and hourglasses of death I guess as you try to wait for a response to come out. And what is the best practice, first of all, for the architecture of a data lake today? I don’t think anyone wants to invest in the kind of storage hardware infrastructure they might have in the 80s, 90s, and 2000s.
Bryson Bort: Yes, so first of all, the concept of a data lake is possible because of the more cross-platform accessibility we get with the cloud. I don’t just have to connect to a particular server somewhere to access a client server, approach to retrieve this file. Rather, it is me who accesses the large crater in which I have filled all the data like a lake, hence the term data lake. So what are some of these challenges? First, it’s the same problem we’ve had since the 1980s: configuration management. What do I have? What is that? How to categorize it? How to maintain its status? There is a status, there is version management to that. That gets into the problem, so if I don’t have the ability to just maintain that status, I have duplication issues. And I am having problems with the current data. I’m looking at two same things that are different, which is first? And so I am not confused by the story. Being able to assess this current infrastructure – what structure seems best suited to this? So in terms of configuration management, the challenges, we’re doing it on something that already exists. There is a big beast of different data in different forms in different silos. And of course, there is no common Rosetta Stone to be able to understand all of this, and even what exists. And so a typical approach is usually program-based – sometimes it can be department-based – where you’re going to go in and you’re going to try to encapsulate as much of that as possible, even recognizing that you’re not going to. have it all caught and establish that process. So we go back there, we find the things that are already there. And we set up the process to identify the new things that are going to be created, so that we fill the lake. And then maintaining the water quality, I guess, is our analogy here for this data in this lake.
Tom Temin: Now, you can swim in a lake and you can’t swim in a haystack. So maybe that’s an advantage.
Bryson Bort: We’re mixing metaphors here, just like real data issues.
Tom Temin: Pretty much, because the data is generated by network sensors, your different types of traffic control: routers and switches, etc. But let’s also say that for the purpose of detecting fraud, which could be a clue of cybersecurity, you have transaction data from systems deployed to the public or other agencies. And so you have many, many different data formats from many different database programs. Or maybe it’s not database programs, or just data discarded during the operation of a piece of equipment. What’s the current best practice to streamline this, so that the data is searchable, from all of these different sources?
Bryson Bort: So when I think of streamlined, I think of what we can cut, and that’s always a challenge. No one ever wants to not have the data. The background is data retention: how long do we keep particular data? And there are liability issues that can be linked to this. So it’s not quite a question, both of rationalization and of standardization. How do I take disparate datasets that have different types – not everything simple, like it’s just digital? Some things are temporal, some things are geophysical, and there may be others. So how do I put them all in one common place where they can work and interact with each other? Data sources – so where do I have visibility, what is driving the data? Because when you spoke of network devices, you spoke of databases. But there is also the human aspect, people specifically generate data. There are other devices. Just throwing, like, comments on things like the [DoD Joint Artificial Intelligence Center]; we’re going to have machine learning and artificial intelligence that’s dependent on data from a training set perspective and some level of integrity. And the potential bias in that is going to affect that. But it will also create its own data accordingly, as a result of these operations. So with data, it starts with what are my sources? What is my visibility on my sources? What is the understanding of these sources, versus the questions I want to ask – what are these missions – and then my ability to standardize and centralize this data for analysis and use?
Tom Temin: Okay, and does that then involve a process of removing some of the formatting? What about some of the metadata around the data and accessing things that are then much more interoperable?
Bryson Bort: Yeah, I mean, so there’s a filter there to put it in a particular format that’s part of that normalization.
Tom Temin: And we mentioned the idea that it’s hard to find a needle in a haystack, if there is a needle in the first place. And there is an element of time in discovering and mitigating cybersecurity. And even for things that have a long lifespan, which they sometimes don’t. So how do you quickly analyze a data lake? What are some of the technologies or techniques for sorting large amounts of data in such a way as to react quickly to what might happen?
Bryson Bort: It is therefore a correlation. The data itself, think of it as singular atoms. And what I want to be able to do is apply structured queries or unstructured queries in different ways that match questions I already know or want to ask. These structures are where I identified a pattern. This and that together will always answer that question for me. In simple security terms, let’s just look in terms of threat hunting, if I see this particular host activity related to this host activity related to this network traffic, it’s a common attack chain for this type of Chinese spy campaign. And so I don’t want to keep asking myself this question; I identified this idea, which becomes a regular query now where the data with that query will now trigger an alert. So let’s get human intervention, the data does the job for us based on what we can see in the visibility to bring about the human interaction or the human interaction, to now go into the detection and response and correction of this. that we now know is a breach. Then there are the unstructured queries. So the questions that I don’t know yet. As we want to look at and identify an example, again using the security example. So I have this structured set. But what are the variables around this that I might start to look at? What are some things like, well, if the traffic is coming from that particular location, or if I identified that the round trip on that traffic, that’s actually how we could have identified SolarWinds? SolarWinds has been in the news a lot. The traffic for SolarWinds had to come back to someone, that someone was not in this country. And so the TTL on the round trip of the packet was actually longer than it should have been. And that’s the kind of thing that this data can give you that rich information where you don’t need to know it was three obscured bounce jumps through the internet to get to Moscow. I just know, the first thing looks good. But that data made you question that because there is a pattern out there that would have revealed it.