Detecting Domain Generating Algorithms – Black Cell Middle East LLC.

How can we prevent malware from communicating with a C&C server? You may think of using a CTI (Cyber Threat Intelligence) feed with a network blacklisting appliance. You may also think about blocking certain protocols or even using a Next-Generation Firewall to perform traffic inspection. But malware creators can be quite clever. They wrote malware that communicates through the HTTP(S) protocol, hiding among all the other legitimate web requests in your organization. After all, the best place to hide a tree is in a forest. Now, this still doesn’t get around a CTI blacklist, but there’s more to the story.

Instead of using a single domain that can easily be marked as malicious, malware operators began registering hundreds of domains daily as rendezvous points with their malware. But they didn’t simply hard code these domains into their malware, as these could be easily dumped from the malware and blacklisted. Instead, they began using domain generating algorithms to create a potentially unlimited number of domains in a way that the domains generated by the C&C server would be mostly identical to the domains generated by the malware. As such, the malware would simply communicate through a new domain each time and the previous domains would be discarded. This makes it infeasible to create a CTI feed, as each malware family or even instance would be generating up to tens of thousands of domains each day, that you need to collect, store and blacklist.

So, do we just give up and accept that there’s nothing we can do after a malware infection has occurred? Of course not! Let’s take a look at one of these domains to see if we can get any clues about how to detect them. Take a look at the below domain. Looks like gibberish, right?

ovyvwnkjserklcrjwwhcpucyurwjaelg.com

A CryptoLocker domain.

That’s because it is. It is simply a random sequence of characters. Now, we don’t need to go into too much detail, but there is a mathematical way we can quantify the randomness of a string. So, we could simply flag domains that seem to be completely random, right? Well, this is exactly what we tried to do in the early stages of developing our detection capabilities. But when testing with a real-world dataset, we couldn’t get accuracies above 65%. This wasn’t a problem with how we were quantifying randomness, but rather that malware creators had thought of this as well. We could have gotten away with it if it weren’t for these meddling malware developers.

Below you can see some domains also generated by a DGA. They seem completely normal, right? This is because malware developers began using dictionaries to generate their domains. They simply combine random words together, often resulting in completely inconspicuous domains.

journeyready.net

wouldinstead.net

sickhurry.net

darkhope.net

cloudthirteen.net

dutybegan.net

christianaashleigh.net

Example Suppobox domains.

Therefore, we need to come up with something more advanced. There are many approaches you could take, but the one we chose was the use of neural networks. Neural networks are made up of a number of interconnected artificial neurons, modeled on biological neural networks such as those found in animal brains. This sounds quite esoteric, but at its core, it’s really quite simple. Think of them as a system that can infer what aspects of a domain name are important when determining whether or not they were algorithmically generated. It can do all this without us having to dictate exactly what sort of details it should be looking for. These details can become highly intricate and nearly impossible for a human to program. All we have to do is set up a good neural network architecture, collect a good dataset, and mark each domain in the dataset as either algorithmically generated or legitimate. The neural network will do most of the heavy lifting when it is trained. Sounds easy, but there are quite a few quirks when deciding what architecture to use and the subsequent optimizations required can be quite complex. Depending on who you ask, creating a good dataset is the most difficult, time-consuming, and also the most important part.

When setting up our dataset, we collected and obtained many gigantic datasets containing tens of gigabytes of domain names, both algorithmically generated and legitimate. We then crafted a neural network that we trained using these datasets to achieve up to 98% accuracy. Below you can see the accuracy we were able to achieve for 92 malware families in our validation dataset

DGA Family	Accuracy	DGA Family	Accuracy	DGA Family	Accuracy	DGA Family	Accuracy
bamital	100.00%	pandabanker	99.99%	feodo	100.00%	suppobox	99.74%
banjori	99.97%	pitou	65.49%	fobber	98.70%	sutra	99.31%
bedep	99.40%	proslikefan	93.81%	gameover	99.92%	symmi	87.69%
beebone	100.00%	pushdotid	95.98%	gameover_p2p	99.99%	szribi	94.54%
blackhole	100.00%	pushdo	90.12%	gozi	95.86%	tempedrevetdd	96.23%
bobax	98.00%	pykspa2s	99.06%	goznym	91.76%	tempedreve	96.08%
ccleaner	100.00%	pykspa2	99.34%	gspy	100.00%	tinba	99.44%
chinad	99.79%	pykspa	97.47%	hesperbot	94.38%	tinynuke	99.63%
chir	100.00%	qadars	99.68%	infy	99.84%	tofsee	98.40%
conficker	97.10%	qakbot	99.45%	locky	94.11%	torpig	89.89%
corebot	99.64%	qhost	60.87%	madmax	99.74%	tsifiri	100.00%
cryptolocker	99.43%	qsnatch	42.93%	makloader	100.00%	ud2	100.00%
darkshell	87.76%	ramdo	99.98%	matsnu	74.42%	ud3	95.00%
diamondfox	76.96%	ramnit	97.67%	mirai	95.71%	ud4	91.00%
dircrypt	97.83%	ranbyus	99.75%	modpack	86.88%	urlzone	98.67%
dmsniff	91.00%	randomloader	100.00%	monerominer	99.99%	vawtrak	94.85%
dnsbenchmark	100.00%	redyms	100.00%	murofetweekly	99.99%	vidrotid	98.33%
dnschanger	97.20%	rovnix	99.83%	murofet	99.79%	vidro	97.40%
dyre	99.92%	shifu	97.90%	mydoom	93.65%	virut	97.69%
ebury	99.95%	simda	97.49%	necurs	97.39%	volatilecedar	94.18%
ekforward	99.73%	sisron	100.00%	nymaim2	67.74%	wd	100.00%
emotet	99.88%	sphinx	99.73%	nymaim	91.32%	xshellghost	100.00%
omexo	100.00%	padcrypt	99.33%	oderoor	97.92%	xxhex	100.00%

For comparison, you can find the accuracies produced by a method that merely looks at the randomness of the domain name. As you can see it performs a lot worse, especially with DGA’s that use dictionaries. It also struggles with very short domains, where there is not enough information to make a good prediction and the accuracy begins to devolve to a random guess or even worse.

DGA Family	Accuracy	DGA Family	Accuracy	DGA Family	Accuracy	DGA Family	Accuracy
bamital	97.40%	pandabanker	38.69%	feodo	89.58%	suppobox	13.37%
banjori	77.43%	pitou	0.01%	fobber	56.20%	sutra	57.33%
bedep	78.26%	proslikefan	5.96%	gameover	99.99%	symmi	43.93%
beebone	40.95%	pushdotid	15.42%	gameover_p2p	99.55%	szribi	4.90%
blackhole	80.02%	pushdo	10.13%	gozi	58.97%	tempedrevetdd	9.20%
bobax	47.67%	pykspa2s	25.92%	goznym	15.11%	tempedreve	20.10%
ccleaner	19.23%	pykspa2	26.37%	gspy	63.27%	tinba	33.36%
chinad	96.40%	pykspa	18.66%	hesperbot	43.82%	tinynuke	99.23%
chir	51.00%	qadars	78.16%	infy	10.35%	tofsee	0.00%
conficker	9.83%	qakbot	72.14%	locky	39.63%	torpig	6.94%
corebot	95.16%	qhost	26.09%	madmax	37.89%	tsifiri	0.00%
cryptolocker	63.56%	qsnatch	0.14%	makloader	100.00%	ud2	93.93%
darkshell	0.00%	ramdo	16.18%	matsnu	46.87%	ud3	88.33%
diamondfox	2.92%	ramnit	54.79%	mirai	67.14%	ud4	4.00%
dircrypt	56.96%	ranbyus	70.03%	modpack	9.38%	urlzone	64.79%
dmsniff	4.00%	randomloader	20.00%	monerominer	78.20%	vawtrak	10.19%
dnsbenchmark	100.00%	redyms	67.65%	murofetweekly	100.00%	vidrotid	32.67%
dnschanger	18.65%	rovnix	97.87%	murofet	68.08%	vidro	37.65%
dyre	98.60%	shifu	7.16%	mydoom	0.64%	virut	0.00%
ebury	85.20%	simda	3.12%	necurs	53.97%	volatilecedar	65.86%
ekforward	0.00%	sisron	11.82%	nymaim2	36.40%	wd	99.79%
emotet	77.75%	sphinx	80.71%	nymaim	12.63%	xshellghost	54.00%
omexo	100.00%	padcrypt	3.77%	oderoor	13.92%	xxhex	0.00%

We often associate machine learning with many graphics cards or even tensor processing units, and you may assume that our detection method would consume a load of resources to make predictions. However, this is not really the case. We tested the throughput of our implementation and summarized the results below. As you can see no special hardware is required to run these detection methods. Keep in mind that the throughput refers to unique domains. In a real-world scenario, with deduplication and a whitelist, you will struggle to saturate even a single vCPU.

vCPU	Minimum RAM	Throughput
1	2 GB	~ 28.000 domains / minute
2	3 GB	~ 62.000 domains / minute
4	4 GB	~ 108.000 domains / minute
8	4 GB	~ 188.000 domains / minute
16	7 GB	~ 300.000 domains / minute

We use the previously discussed neural network and many other tools in our Fusion Center to help protect our clients’ infrastructures. If you would like to learn more about how neural networks work and how we can use them to detect DGAs, read our new whitepaper on the topic.

We go into details about how DGAs work, what neural network architectures we can employ, and how these architectures perform when detecting these domains. If you’re in the mood for something less technical, learn more about the tools, techniques, and philosophies that set our Fusion Center aside from a regular SOC.

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days
YSC	session
yt-remote-connected-devices	never	These cookies are set via embedded youtube-videos.
yt-remote-device-id	never	These cookies are set via embedded youtube-videos.

Solutions

About us

Publications

Newsletter subscription

Newsletter subscription

Pin It on Pinterest