Blog Scraping – How to deal with it?

This blog has been hit by blog scrapers for the past three or four months now. Initially, I did not take it seriously other than marking those pingbacks from scrapers as spam or just deleting them. But Donace’ comment on the last post here was a good wake up call and I thought of writing about scraping to self-educate and as well as to spread awareness.

What exactly is blog or feed scraping?

Blog scraping (or blog plagiarism) is the process of searching large number of blogs and copying blog contents via automated tools. Basically, it is a form of content theft except that it is not really copy-paste but the scrapers work using tags and keywords via tools working on the RSS feeds of target blogs. Usually, the scrapped content can be monitored when you get a susepected trackback/pingback on your WordPress console (and when you trackback you see the same content as your original post there)

How scraping can affect your blog?

Scraping may be relatively a harmless thing compared to other forms of piracies and hacks. However, the scraper can usually receive some advantages at your expense. The following are some of the benefits for the scrapers who use your copied content.

  • They can receive some traffic from your blog (or original post) using the trackbacks while traffic in the opposite direction is usually nothing
  • The scraped blog post can receive SEO advantage as it is in a good neighbourhood
  • The search engines may find your original content as duplicate if a smart sploger does an organized act
  • Since scraping is usually associated with splogs (spam blogs), it can affect your blog’s brand value when your name and tags appear on such blogs

How to deal with scrapers?

Though most of the trackbacks from the scrapers end up in Akismet, it alone may not be sufficient to arrest scraping. The following are some of the techniques to efficiently deal with scraping.

#1 Anti-leech plugin

The anti-leech WordPress plugin basically works the same way as in the method #3 (described below). It provides a wrong set of contents or to the sploggers while maintaining the actual feed for others. Sploggers are identified by the settings provided by you as IP addresses or user-agent strings.

#2 .htacces ban

You can effectively use the .htaccess entry deny from [IP ADDRESS] to block the spam blog bots from accessing your blog. However, please note that it will have a blanket effect if the sploggers is working from a shared hosting that has several hundred blogs (many good blogs as well) on an IP address.

#3 Feed obfuscation or cloaking

This is a technique that I found on another blog and I have not tried it myself. Basically the idea is to fool splog bots by providing them a different version of the feed content than what is seen by humans and clean bots. You may read about this feed cloaking technique here

#4 Contact the hosting service of scrapers

Since most of the splogs do not provide contact information or a contact page, once scraping is discovered it is a good idea to directly contact the hosting service provider of these spam bloggers. Most hosting services (unless they are blackhats themselves) would respond possitively to such support queries in an effort to arrest plagiarism.

#5 Legal action

If you run a professional blog that has some quality intellectual property, it is a good idea to maintain a copyright notice and also legally proceed when copyright infringement is discovered. This may be an expensive but effective process to protect you against organized splogging. This can also reveal the intentions behind splogging if it was done with the intention of maligning the brand, for example. Legal action usually something that is associated when big corporate blogs or famous bloggers are suffering from scraping.

Over to you

Now, I would like to know if your blog has been suffering at the hands of sploggers. If so, what mechanism are you using at the moment to prevent these content thieves from scraping? Are there any other mechanism (e.g. Creative Commons?) than those listed above to effectively deal with plagiarism?

Comments

  1. Raju :

    More often than not, a simple mail to the owner of the blog will suffice. But at times I had to contact the host giving them the exact details. Couple of times the account got suspended as well. I do not think IP ban, usage of plugin etc are effective and reliable.

    Raju’s last blog post… [TP : TimePass] Winnie the Pooh and Swine Flu

  2. Archie :

    Most scraper bots remove links from the post title to the blogger’s blog. But if you link to another post from your new post most of the time it doesn’t get removed. You can try linking to another post in your firs sentence or second since most scrapers only use a snippet. This way you have chances of scoring a link if the feed gets scraped. Not the best link but hey a link is a link. You double the fun if your older post was linked to by a keyword. Something I learned from good ol’ Vic from Blogger Unleashed.

  3. Nihar :

    Ajith,

    When i post some cricket related posts. Immediately i got trackbacks. I noticed it but never did much. I just don;t approve it.

    Just leaving like will affect my blog SEO? Which step should i follow.

    Which one should i choose to stop Blog Scraping…

    Nihar’s last blog post… Cannot dial 000 800 xxx Toll Free number from BSNL Landline Phone – Solution

  4. rads :

    Hmm is there a way to find this out in blogger? Coz I’m using blogger and I’ve no idea if anybody out there is copying and getting away with my content.

    rads’s last blog post… Wuthering Heights

  5. I get tons of these damn scraps everyday. Till now I have only been deleting them. I gonna try the plugin you gave above. There are mostly many of those made-for-google-trend ONLY blogs or suppose-to-look-like news websites that does this. Normally if someone has use a part of what i have written and added his opinion then I approve it but there are many when you go to have a look at the site, its clear that they are using wp-o-matic to process the feeds. I simple click spam them and then delete.

    But the antileech plugin seem fun. Gonna try. Thanks for sharing.

    Kurt Avish’s last blog post… World Photo Focus Weekly – Episode 2

  6. Hi,

    I would have dealt with some of them myself. I sent a simple email to many of them and it worked! Thats the easiest of all.

    If you are going with point 5, (the most interesting of all the above :D) I have some findings and resources here: Sue Copyright Infringers

    Leave them a polite email, else go with point 5! it works really well 😉

    Arun

    Arun Basil Lal’s last blog post… Stay Updated on Your Favorite RSS with Gmail

  7. One more option you’ve missed in your list. This option is not popular in the MMO and blogs-about-blogging world. This may not work for blogs like yours, as your readers are primarily other bloggers, who may get upset and unsubscribe if you change to partial feed.

    Option#6 – Don’t publish full feed. Instead publish summary only in your feed

    Personally, I had lot of issues with scrappers. I’m seriously considering option#4 and #5 you’ve mentioned above. But, it will take some time to really eliminate all those aggressive scrappers. Some splogs are so good, they look very professional and google even thinks their content is original and not mine.

    When I switched to partial-feed (recently), I’ve pretty much eliminated all the scrapping overnight. However, I’m sure few of regular readers would’ve unsubscribed.

    Down the road, once I take care of all scrappers using option#4 and #5, I will consider switching back to full-feed

  8. Though the sad part is point 4 and point 5 doesn’t work in most of the case because of hosting company TOS and blogging platform rules :|.

    Harsh Agrawal’s last blog post… How to use ultrasurf proxy software to unblock any blocked website

  9. Tech @ InkAPoint :

    Since most of the sploggers and spammers are using random ip’s and proxies, it is stil hard to block them.

  10. Hi everyone,
    Thanks for your views… In fact, the other option mentioned by Raju and Arun really works if you have contact information. I just proved it handy to remove one scraper yesterday :)

    I am yet to try the partial feed option as suggested by Ramesh though I still prefer providing full feed to the subscribers.

    @Rads, for blogger, I do not know any gadget that might show the backlinks but you could always use a backlink checker tool to see if scraping is taking place on your latest posts. I guess, usually it’s less found on atom feeds.

    @Luis (Archies), I didn’t quite understand your method. Could you elaborate?

    PS: I have highlighted the other options mentioned by the commenters so that readers can be benefited

  11. John :

    glad you like the post… I wrote an article about scraping via email. In it, I detailed how spam bloggers, especially those using Blogspot, .

    Exotic limo in FL

  12. Oops !! initially i was confirming these pingbacks n my blog thought that someone has linked to my post. But yes i know that these are auto targeted links and scarping you valuable contents.

  13. Raju :

    Partial feed option will surely pisses off your readers. I personally do not subscribe to any blog which offers partial feed (it defeats the purpose of Feed, isn’t it?).

  14. Chetan :

    Can you explain me how to change .htaccess file.

    Chetan’s last blog post… (Audi Snook Consept) : 4 wheelers,2 wheelers… Future might be 1 wheeler

  15. @John, thanks for your comment. I shall read your post on email scraping.

    @Shanker, I was doing much the same by just approving those trackbacks thinking that they are genuine.

    @Raju, you are right. I am all for full feed…

    @Chetan, how to edit htaccess is not in my scope :) However, you could use the Deny from ‘IP Address’ commands to deny selected clients.

  16. Jacques Snyman :

    It is a concern that one’s hard work can so easily be plagiarised and actually undo some of the good work you’ve done in the process. It is actually damn unfair, and the perpetrators should be held accountable and stripped of any search rankings they might have gained through indulging in this nonsense.

  17. I read somewhere that if the site contains google ads that you can also report them to google. If they lose their ad revenue, then there really isn’t much point to the scraping.

    I do not think that we should switch to partial feed. I don’t unsubscribe from sites that use them but it really annoys me. I also don’t like taking a service away from the “good” people as a way to combat the “bad” ones.

    Kim Woodbridge’s last blog post… 5 Most Popular (Anti) Social Posts of All Time

  18. Smokey :

    If someone refuses to take your content off their site you can also pursue different copyright/spam reporting avenues

    http://www.search-marketing.info/search-engines/report-spam.htm

Trackbacks

  1. […] of the blog readers asked me this question via email a couple of days back. In your post about feed scraping you talked about contacting anonymous scrapers. How can I really do […]

Speak Your Mind

*