Crawling for cyber zama-zamas

Bitcoin-Spiderman

Wow

So, this post has everything. Crypto-currency. High-speed internet chases. Adventure. Excitement. Government negligence. Life lessons. Spiders. Morphing websites. Inside jobs. Ohhh boy. Crypto-mining: it’s like modern alchemy, turning electricity into gold. Tiny tiny tiny pieces of gold.

The uber-summary of this post is: run a good ad blocker.

Update: The Deen Store got back to me (the only page to get back to me so far), but they said they couldn’t see it on their site. On re-inspection, the implementation had been completely changed – details in their section, More Shopping.

More specifically

More specifically, this post is an experiment that (of course) grew out of control. Inspired by the recent spate of otherwise boring websites running crypto-miners (ol’ coinhive of course), I became curious as to what the South African sneaky crypto-coin mining scene was like. And I found some, which may be what most of you are interested in.

As such, this article will start with what I found. Technical details on lessons learned writing a crawler and using ElasticSearch will follow afterwards, so hang around if that’s what you’re interested in.

If you’re not interested in either, I guess I could include some cute pictures. I’ll try have something for everyone.

As a preview, the most interesting site that’s running a miner is the (presumably) Eskom-run Operation Khanyisa, which ironically deals with illegal electricity connections.

I’ve intentionally not linked to any of the infected pages, because I don’t want unwary travellers to accidentally stumble across there. Any good AdBlocker is blocking them nowadays in any case – I’m using uBlock Origin.

“The fields are alive… with the sounds of spiders.”

More than just radio

The site that kicked all of this off was Jozi FM – a South African radio station with a rather unfortunate tagline.

Oh yes they are.

They are indeed more than just radio, they’re a Monero coinhive implementation as well. The standard questions leap to mind – was it intentional? Was it an “enterprising” developer trying to make some cash on the side? Have they just been compromised, and have a bigger issue on their hands than merely draining fractions of their viewers’ battery lives?

There’s not much you can do about intentional. Cryptomining isn’t illegal (or even necessarily a bad thing, if done correctly) – it’s just a little unethical when you’re doing it unannounced.

In fact, as an informed option, it’s really better than flooding poor users with adverts. Ad networks have a history of delivering intrusive ads at best, and at worst adverts that are actively malicious and could potentially compromise viewers’ computers.

Considering that the cost of the bandwidth to download the adverts will probably outweigh the cost of the battery life used to mine the crypto, sponsoring your site with a cryptominer could be a better option – assuming your users know about it.

Otherwise you run the risk of, for example, an unsuspecting user leaving their device with the page open and having their battery drained – a battery which they no doubt meant to use for other things.

Candy Crush – one hell of a drug.

So, is JoziFM using it as a cunning alternative to flooding users with adverts? Well, they’re trying to load Google adverts as well, so probably not.

So, let’s assume that it isn’t a business decision, leaving us with the rogue developer or the compromised ideas. Worth mentioning? A tweet to them went unacknowledged.

Perhaps they’re busy, although they’ve posted since the tweet. Perhaps they aren’t concerned, or they have a different definition of the ‘social’ in ‘social media’. I made a casual attempt at looking up the people who made the website, but strangely for an IT company, their Twitter link doesn’t actually work. We tried.

In any case, I found JoziFM via PublicWWW’s search. They, quite reasonably, would like people to pay more money to see their other web crawling results, but I’d never written a web crawler before. As mentioned previously, details on the crawler are later – what follows are the other websites my l’il spider revealed.

Support?

The interwebz are a vast and complicated place, and many people who aren’t deeply involved in IT feel a justified amount of trepidation when dealing with it. Enter: technical support teams. Your office IT, your dial-a-nerds, your local computer store, who deal with day to day computer dealings for people who don’t have the time to. You should be able to trust them?

TechnoSupp seem to sell various electronic goods, from TVs to 3D printers. And, of course, they make additional cash on the side (or, at least, someone tries to), by mining some crypto.

Strangely, for an online-based company, their website has no Twitter and no Facebook. I submitted a query via their page – we’ll see if they get back to us. Crickets so far though – out celebrating in their mined gains most likely. (Probably not, you need a lot of traffic to make any money from these miners.)

Web Design?

I mean, I’m willing to give the general sector a bit of leeway. Security is really hard, perhaps too hard for humans. But if you’re a web design company, and “love taking on challenging projects that require full-on content strategy, thoughtful design, demanding development, and ongoing marketing”… I feel like perhaps the bar should be higher.

Or perhaps they just thought it was a good idea?

I guess the Any technically includes coinminers?

They’re also another web design company whose Twitter link doesn’t actually go anywhere. So I dropped them a FB message, but we’re still running at a solid 0% response-rate to these things.

Sensitivity: classifieds.

The classifieds were the internet before the internet, really. So it’s nice they’ve finally picked up on the whole crypto thing.

Another WordPress site compromised though – there may be a trend here. Their Twitter page isn’t super active, but it would be rude to not let them know I guess.

More shopping

The Deen Store, which seems relatively active based on their FB, joins the crowd. No Twitter account, but I messaged them.

They were the only website to get back to me – but with some surprising news. They said they couldn’t see it on their side, and indeed it was gone. And viewing the source, there was no trace of it.

But there was something different.

Here we see some JavaScript being loaded called jquery.js  (yes, there is already a jQuery loaded), which then calls a function named Anonymous , passing in a 'sup'  as an account number (friendly, I like it. Not insanely useful though?), and a throttle value. So at least this implementation is friendlier.

Viewing the fraudulent jquery.js, the Anonymous call is just a wrapped call to kick off the miner

All of this is being served off a Heroku application – Heroku is just a hosting platform that provides a free option though, so it’s probably just being used to serve the files. I’ve notified them about it.

On refreshing the page, we see a different one being pulled in, with the parameter ‘mine’ and hosted on a different platform.

Curious, I refreshed again. Now we’re back to the original implementation, with a wallet ID again.

Refreshing over and over showed that the wallet and the implementation keep changing, until they loop back around. Clearly something sneakier is happening here – most likely a compromised server, with a random snippet that gets injected.

The website is looking into it.

 

Give me the power!

Leaving the private sector for a moment, let’s have a look at where our tax is going. Everyone knows there’s a budget deficit, and some government departments seem to be coming up with novel ways to raise cash.

Operation Khanyisa, a government project which ironically deals with illegal electricity connections, is happily mining away. Even better, they have a really long website, which should allow the crypto miner to run for longer.

They, being the electricity producing Eskom, are essentially selling us electricity, then using it to run a crypto miner on their site. I almost hope they’ve put it there intentionally.

No response as of time of writing.

Relax and mine

Showing that cryptominers don’t discriminate, a lodge called the Wolwe Krans Eco Lodge is mining away. Their contact buttons just went to room bookings, so I threw a tweet out in an attempt to see if it would be ignored, like all the other sites. So far, so good.

Mining in Africa

A Nigerian music site is running one as well – no response as of time of writing however.

Crawwwwwling in the weeebbbbbb

This was an interesting project – I wasn’t certain if I’d find anything, so it was nice to find a couple of things. The crawling progressed much slower than expected – after several days, my total domain count had only just hit 20 000 (and that includes .com domains). And it had run out of further domains to crawl.

I assumed it was me, but tweak after tweak, and many unsuccessful attempts to add new seed URLs that would lead to the anticipated domain explosion finally lead me to suspect something – the South African internet just isn’t that big.

We hit one million registered .co.za domains last year, but it seems like they just aren’t hosting websites. They could be abandoned projects, email accounts, who knows. In an attempt to non-formally confirm this, I searched shodan.io and publicwww for co.za domains. PublicWWW, being a much more mature crawling system than my hacked-together attempt, brought back the most, with 93316.

Out of curiosity, I added their results (well, what I could see on my account) to mine and picked up another 6500ish domains. I’m still crawling on those, at around 40k now, but that will include a lot of duplicates.

But even that means that only 9.3% of those one million registered domains are actually websites – interesting.

And if you did find that interesting, then you may find things that I learnt during the process interesting too. We’ll get more strictly technical from here, so people who were keen on the found miners can leave if they’re bored. People still here for the pics should still be good though.

Crawlers and Discoveries

The code behind a crawler is incredibly basic – you pull a website, parse for links to other websites, add them, and repeat.

I tried to limit to .za , but it seems that there are too many people in .com for that to work. Eventually, I allowed a single jump to a .com  page, but didn’t crawl other .com  pages beyond that.

As an aid to later searching, and out of curiosity as I’d never deal with it before, I spun up an ElasticSearch instance to store the results. That was a learning experience in itself (which lead to a mini-Java framework being written around it, but that’s a post for another day).

For anyone else looking to write a crawler, here’s some weird stuff I came across:

Redirects and redirects and redirects…

People love their stats pages, which means occasionally the link you follow out from a site hits a server logging stats instead. Remember to follow all the 30x codes, but also remember to store intermediate pages.

Broken pages

People do terrible things on the web – like HTML escaping URLs, which is kinda weird. If you’re scraping, you’ll pick up slightly more pages if you run a decent URL unescaper on your results first.

Somtimes though, you’ll find proper broken pages – don’t forget to always catch errors and move on.

Java throws IllegalArgumentException if you can’t resolve a domain, that’s weird. And annoying. But blanket catch when you’re connecting to places.

Weird URL stuff

It’s kinda gross, but some pages link to weird ports. Don’t forget, when you’re parsing domain names, that they may include port numbers.

You’ll occasionally run into link tags that point to pdf, mp3, mp4 etc – you may as well skip those, just for speed’s sake.

Other findings

The Department of Human Settlements appear to have missed a desperate cry for attention.

KZN DHS hacked

I tried to point it out to them, but they’re not big on the social thing either. One must consider though – if they haven’t updated the page since 24 November 2017, is it that useful? Or perhaps they’ve just been locked out of it.

Conclusions

Let me know if you can think of any reasons why I initially plateaued out at 20k pages, or if you have any other insights / questions / hilarious pages running coinminers. All in all, crawling is a classically difficult issue, like with any pseudo-natural language processing issue.

Also, it’s really hard to get hold of any websites. As of writing, no one had gotten back to me. And they call this the digital age.

Tagged with: , , , , ,
Posted in Front-end, Java, Javascript, Security, Technology

Leave a Reply