Automate Linkedin Stalking Your Own Employees So You Can Have Awkward Conversations About Their Profile Updates

Amazing insight:

If your employees are updating their Linkedin profiles, it’s one of the better indicators that they’ve started looking for other work.

Now, your employees might be savvy enough to change this setting in their profile first:

If this is the case, you won’t just see their passive-aggressive profile updating in your news feed whilst you’re checking out other roles. You could take some time to stalk their profiles daily in incognito mode, but this could be time consuming. You might be in charge of a lot of employees! You could automate this. read more

Faster Google Penalty Removal

Your website is only as good as Google’s picture of it. So, if you’re working on a website under penalty and are actively trying to get it out of penalty (or just trying to preempt future updates), you should do everything you can to make sure Google is up to date with the link profile so that their image reflects reality.

I’ve used the following method for just over two years, though I haven’t seen it getting any serious coverage (though I’m sure it’s quite widely used). In short – get Googlebot to crawl the links that you’ve removed or disavowed. read more

Block Googlebot Crawl by Folder Depth

Some sites have deep, deep, duplicative architecture. Usually this is the result of a faceted navigation. This is especially true for enterprise platforms. And like any healthy relationship, you can’t go in expecting them to change. Sometimes you’ll need to admit defeat and use an appallingly ugly but kind of elegant band-aid.

In short – picking the appropriate robots.txt disallow rule from the following can work:

/
/*/
/*/*/
/*/*/*/
/*/*/*/*/
/*/*/*/*/*/
/*/*/*/*/*/*/
/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/
/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/*/ etc...

This blocks crawl by folder depth. You might be thinking “this is awful, why would you ever want to do this?”

Example Crawl Distribution

Whether this makes sense for you will depend on Googlebot’s requests by depth. The following crawl distribution is fairly typical for sites where the only landing pages are generated by faceted navigation:

++++++++++ ++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++ +++ ++++ ++++ ++ + + + + + + + + + + + + + read more

Blocking and Verifying Applebot

Earlier today Apple confirmed the existence of their web crawler Applebot. This means that we’ll be seeing it crop up a little more in server log analysis.

Filtering Server Logs to Applebot

As anyone can spoof their useragent to Applebot while crawling the web, we can use the IP range Apple have given us to validate these rogue visits. Currently legitimate Applebot visits will come from any IP between 17.0.0.0 and 17.255.255.255. The actual range is probably substantially smaller than this. We can pull the files we need from out server logs using the following {linux|mac|cygwin} commands in our bash terminal: read more

Server Logs, Subdomains, ccTLDs

Server logs have a few major drawbacks, one of which I’d hope to address today. It’s not an elegant solution but it (more or less) works. Firstly, please read this post for an overview on server logfile analysis for SEO and you’ll hopefully see where I’m coming from. I think access logs are probably the best source of information available for diagnosing onsite SEO issues.

A Problem

If you have a little experience with server logs, you’ve probably encountered the following:

188.65.114.122 - - [30/Sep/2013:08:07:05 -0400] "GET /resources/whitepapers/retail-whitepaper/ HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
188.65.114.122 - - [30/Sep/2013:08:07:06 -0400] "GET /resources/whitepapers/retail-whitepaper/ HTTP/1.1" 301 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Huh.

Server logfiles provide us with the URI Stem (the portion of the URL after the host and any port number),  rather than the full URL. For this blog post, that would be:

 /server-logs-subdomains-cctlds/

rather than:

https://ohgm.co.uk/server-logs-subdomains-cctlds/

As logs give us URI references rather than URLs, essentially you are getting everything from the third trailing slash and beyond.

One of my clients has all the of their ccTLDs configured so that they share server logs. Server logs do not allow us to see which domain serviced the request. If you’re dealing with a site with a blog.domain.com setup, you won’t be able to tell if the main site or the blog serviced the request from the URI reference alone. The same goes for the http:// and http://www. versions.

Solution

I use the following method to gain some insight.

Firstly, cut down your server logs to size using whatever tools you’re comfortable with. I like grep or the filters in Gamut Log Parser.

grep "Googlebot" filename.log > filteredoutput.log read more