Academic Publishers as Spammers
Posted on August 2nd, 2007The promise of high search engine rankings, and the ensuing traffic, is making some very large academic publishers use a black-hat spamming technique called 'cloaking' to attract visitors to their sites via the search engines. The idea of cloaking is simple: when the search engine indexing crawler requests a page, the website gives the crawler one version of the page. When a human visitor requests the same page, they see a different version. Distinguishing search engine crawlers from human traffic is very easy, and in the case of Google and MSN/Live, it's 100% fool-proof.
The publishers in question all behave in a similar pattern: the search engine results are for a PDF file, presumably a paper that Google thinks is relevant. When a user clicks the link to the PDF, they are instead presented with information about how to purchase the article. So in this case, the cloaked version of the page is the PDF and the real version is the purchase form.
I've been getting more and more annoyed by these publishers, and two days ago John Baez from the Department of Mathematics, University of California, Riverside, complained about this spam. The comments show just how annoyed people are about this. John's post prompted me to join the naming and shaming of these spammers, and so without further delay...
SpringerLink
They seem to be the best at gaming Google. The papers hosted on springerlink.com rank highly in many fields of knowledge. The simplest way to out them is the search for [site:springerlink.com filetype:pdf], which queries Google for all PDF files hosted on springerlink.com. A screenshot of the results I'm getting is below. Just click any of these purportedly PDF files and see what happens. For the first result in the screen shot, I'm getting this page.

IngentaConnect
Next up is ingentaconnect.com. They don't have much of a search engine presence, but they still cloak. An example is the search in the screenshot below using the search term [site:www.ingentaconnect.com intitle:"journal"]. The third result I get is a PDF that when I click on leads to this page.

Royal Society of Chemistry
Yep, rsc.org. The screenshot below shows how to find cloaked PDFs by searching for [site:rsc.org filetype:pdf "carbon dioxide"]. The very first result leads to page asking for £22.

Taylor & Francis
T&F host a lot (all?) of their journals on informaworld.com. So what does Google say for [site:www.informaworld.com filetype:pdf "carbon dioxide"]? The third result in the screenshot below takes me to a page asking for £18 this time.

Conclusion
Well, the publishers are clearly cloaking their pages for the Google crawler. This contravenes Google's webmaster guidelines, and lesser websites have been removed from the Google index for similar tactics. What's amusing is that Google keeps making a big fuss about how cloaking is bad but doesn't do anything about these big publishers.
There is a question here we have to discuss: is this really cloaking if paying customers eventually end up seeing the same content that the search engine crawler sees? Some think in these instances it is not cloaking, but I beg to differ: What the average user sees should be identical to what the search engine crawlers see. If a page is in the search engine index it means that anyone can have access to it, without the need to register (even if free) or paying. That's my 2c. Take it or leave it.
Subscribe to Blog of Science!
If you liked this post, please subscribe to the blogSci.com RSS feed: ![]()




Social bookmark post (digg, delicious, reddit, etc)
August 2nd, 2007 at 5:48 pm
Ah, but what you or I may think “should be” isn’t necessarily the way “it is”. I’m fairly certain this is perfectly acceptable to the search engines, because the content is the same to paying or registered visitors as it is to the search engines.
August 2nd, 2007 at 7:58 pm
Well Donna, there is a key problem with this type of cloaking: When a user goes to a typical web page, they can decide whether to buy/subscribe/click an ad/etc AFTER they read the content. The difference with the publications listed above is that the monetization of the traffic occurs BEFORE the content is seen instead of after. I don’t think that’s a legitimate use of cloaking.
Also, in a way, the publishers are giving their content for free to Google (to index it) but you and I have to pay for it. They can do whatever they want with their content, but I don’t think it’s right for it to be included in the index.
Pierre
August 2nd, 2007 at 8:25 pm
I completely agree with you about whether or not it is “right”. It makes me mad, in fact. However, under the current “state of being” that is Google, it doesn’t constitute spamming. It may be unacceptable to us as users, but not to Google.
August 2nd, 2007 at 8:39 pm
Doesn’t this mean that, if I could disguise myself as a search engine, I should be able to see the original pdf? Unfortunately my feeble attempts to try that got only “Page moved to here” messages.
August 2nd, 2007 at 9:05 pm
Pops,
You can’t disguise yourself as a search engine because you will never be coming from the right IP addresses. The full details are in the authentication tutorial I linked to in the post.
August 2nd, 2007 at 10:34 pm
I’d imagine Google also uses this data to show the ’scholarly papers’ for the terms.
As far as I remember, showing paid stuff to SEs is cloaking. I see no reason why they don’t make a good landing page on that, instead. Maybe because everyone links to the .PDFs?
August 2nd, 2007 at 11:37 pm
hmmmm - seems more of a paid wall issue than a true full blown balls out cloaking situation. Wall Street Journal makes you register (granted it’s free).
How else would you recommend information services market their wares through natural search?
btw - notice some of them you can view as html and read for free. If they were truly smart enough to be IP cloaking I doubt they’d leave that loose thread out there.
AND cloaking does not equal black hat spamming.
August 3rd, 2007 at 1:40 am
It’s cloaking, yes. But it’s with Google’s full cooperation. These publishers are part of the Google Scholar program that specifically allows them to do this, annoying though it is. They’ve been doing it since 2004.
August 3rd, 2007 at 7:51 am
Thanks for dropping by everyone.
Danny:
I’ve left some comments about this on Sphinn, so I won’t repeat myself here.
Oilman:
You don’t need to fully index the papers to know what they’re talking about. The title and abstract are vetted by peer review and so they are an accurate reflection (and summary) of the paper. On top of that, the search engines can do link analysis by looking at the citations in the literature. Each paper’s references are freely available (on 95% of journals anyway) and so a link graph is doable. No doubt, the SEs will find papers in the near vicinity that are freely available, and so the niche and context of a lot of the citations will be known.
As for some having free HTML: really depends on the CMS they use. It’s rare to be able to see the whole paper as HTML but not PDF. What you can (almost) always get is the abstract in HTML.
Cloaking made its name by being used by black hatters. You’re right they’re not equal, but cloaking is a form of spam IMHO.
Pierre
August 3rd, 2007 at 8:39 am
>Cloaking made its name by being used by black hatters. You’re right they’re not equal, but cloaking is a form of spam IMHO.
I disagree. Cloaking is just a technology, hence not good or bad. I do a lot of search engine friendly cloaking, and technically even geotargeting is cloaking.
August 3rd, 2007 at 8:52 am
True, point taken.
However, the examples I cited above are polluting the SERPs just as email spam is polluting inboxes. Let’s not quibble over semantics
August 3rd, 2007 at 10:27 am
Right now when I click on these cloaked PDF files put out by Springer, I get a moronic general-purpose portal like this. What does this accomplish, besides annoying me? Do they really think I’m going to take the trouble to search their entire database for the article they’re not going to give me?
Other people, with other IP addresses, get something a bit more helpful, like this: an abstract, with bibliographical information, and a link to click on if you want to buy the article.
All I’m asking for is a bit of honesty. If I type a phrase into Google and Springer has a journal article containing this phrase, which they are willing to sell me, I wouldn’t mind seeing a link to an abstract of that article. What I don’t want is a link to what pretends to be a PDF file of the article, but actually is just an abstract or a moronic general-purpose portal.
On a different note, Sebastian wrote:
Cloaking is just a technology, hence not good or bad.
How about this?
The Virgin of Nuremberg is just a technology, hence not good or bad.
I’m not trying to compare cloaking with this medieval torture device, just questioning the “logic” of Sebastian’s argument. Technology is not always value-neutral!
August 3rd, 2007 at 11:32 am
Trying to end the debate on semantics by not starting a philosophical debate. That instrument is not a technology. Better analogy: one could say frames are an evil technology. Also not true, architectural use of frames in Web design is evil.
Back to the topic. These search results should not exist without a proper label.
August 3rd, 2007 at 11:42 am
[…] Einige Beispiele von blogsci […]
August 3rd, 2007 at 2:32 pm
re: How else would you recommend information services market their wares through natural search?
Not sure I agree with Pierre’s response - just indexing the title or abstract wont give you a hit for related info found in the full text. But either way, I think there is a bigger question here: How can content owners and Search Engines provide access to information that is not completely free? (subscription, registration, verification, advice).
See here for more:
http://www.cre8asiteforums.com/forums/index.php?showtopic=52710&st=0&p=236354&#entry236354
And here for a current example of content cloaking by Science magazine:
http://www.cre8asiteforums.com/forums/index.php?showtopic=52710&st=0&p=236532&#entry236532
August 3rd, 2007 at 2:37 pm
Change My Name to Springer
Dear academic publishing industry: play nice, and we won’t crush you under our advancing wall of ice.
Google Scholar’s publisher policies insist that people searching journal articles through Google “must be offered at least a comple…