Brain Whack

Posted on September 22nd, 2007

Some puzzles and brain teasers are not simple exercises - they're bordering on torture. But that's what makes a good puzzle good: it seems impossible and gets you all addicted till you solve it. So now that the days are getting shorter and the weather getting colder (at least for us folks in the Northern Hemisphere...), here are some logic and brain teasers to pass keep you company.

  • wu: riddles: the more mainstream kind of logic problems classified as easy, medium, and hard. Prepare to waste hours on this collection.
  • Classic Computer Science Puzzles from Coding Horror. If mathematical algorithms are your thing, these should be easy. Ha ha.
  • Project Euler is a collection of math problems that require programming to actually solve.
  • Finally, the shameless plug: blogSci's very own Daily Mental Calisthenics gives you word of the day, a Stroop effect puzzle and a Sudoku to kick start your day.

Please let me know your favorite collection of puzzles and brain teasers either in the comments below or by email.

Summary of Academic Publishers Cloaking Discussion

Posted on August 5th, 2007

So the dust seems to have settled a bit about the issue of academic publishers cloaking their pages to Google. This post is a summary of the facts that emerged and the observations made, a quick recap tying it all together, and a suggestion for the next step.

For reference, the links around the web are:

Facts and Observations

There are three facets to this debate:

  1. Technical side re how this cloaking is implemented
  2. Google's policies regarding this issue
  3. User perception of this issue

So with that, the following points have been made:

    • These publishers are part of the Google Scholar program. Google initially contacted a few major publishers to join the program.
    • The cloaking is IP-based as a simple switch of the User Agent to Googlebot's doesn't work. That's not surprising to us in the field; I linked to how you can do that (with full code) in my previous post.
    • Definition of cloaking from Google's guidelines:

      Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.

      So there is no doubt this is an example of cloaking. The question is whether this is acceptable or not.
    • A relevant quote from the Google Scholar Publisher Policies:

      Google users must be offered at least a complete abstract.This is a crucial component of our indexing program. For papers with access restrictions, a full author-written abstract will help users choose among the results which paper is the most likely to have the information they are looking for.

      Some people pointed out that this is not always happening with some SpringerLink articles.
    • A lot of academics are annoyed by this cloaking. The sentiments of the comments on John Baez's original post speak loudly. The blogosphere has other posts from annoyed academics.
    • I get the feeling that people would be happy to keep the for-pay results in the Scholar search results but keep them away from the main search results. If that happens (and I think it should), the for-pay results need to be labelled clearly. John Mueller wrote a comment on the Sphinn story about how this already happens with Google News.

So what now?

Some people are clearly upset. Some people are upset at expensive publishers in general (and so having them in Google's results make things worse) and some people are upset that Google is letting some publishers break its terms of service/policies so obviously without any perceived reward for the user.

Fundamentally, I believe the question of what's acceptable cloaking and what isn't boils down to user perception and expectation. If users expect to say for-pay content in the search results, they are OK with it, but please label it properly. Pubmed, a major aggregator of bioscience papers, has two icons to depict whether the paper is freely available (via an Open Access license) or only the abstract is freely available. There is no reason why Google shouldn't do this too.

The key question is what happens when cloaked results appear unexpectedly. Clearly people find this (very) annoying. Of course, Google's policy has so far been to ignore it as they sort of need it for Google to be able to index the papers for Google Scholar (and thus allow it). Well, Google, consider this set of posts as very vocal customer feedback: Take out for-pay content from the main search engine results pages. We're OK to keep them in the Scholar results, but label them.

And academics, you can do something about it! There are three things you can do:

  • In the short term, file a spam report with Google. Very inconveniently, there are two ways to do this. You can use the so-called unauthenticated submission form, and that's publicly accessible. Owners of websites can use the so-called authenticated form using their webmaster central form. More details about spam reporting from the horse's mouth.

    The spam details are as follows: state that you have found evidence for cloaking in the main search engine results pages (SERPs). Submit the full URL of the results page, state the apparent URL of the result (right click and copy the link location - exact wording varies in each browser), state that the result is labelled as a PDF file, and submit the URL you actually end at. This gives the spam team a full audit trail. If you can submit more than one example, do so. And tell them this is annoying you if it is.

  • Stop using Google! If their search results are not useful to you, use another search engine. MSN has a great Academic Search, and for general searches, try Yahoo!. I recommend Hakia as a decent search engine (it's still in beta, so the results can be spammy or a bit irrelevant) and there are hundreds of alternatives. Take your pick and vote with your feet!
  • In the long-term, if access is important to you, publish in prestigious journals that have an Open Access policy you agree with. If enough people do that, the Open Access journals will get an increase in their impact factor and the administrators will be happy again. Having a debate about it in the journals themselves is also helpful. This question is about awareness but it can happen with time.

So that's it for now. I've already submitted an authenticated spam report to Google. Let's hope there is a response!

Academic Publishers as Spammers

Posted on August 2nd, 2007

The promise of high search engine rankings, and the ensuing traffic, is making some very large academic publishers use a black-hat spamming technique called 'cloaking' to attract visitors to their sites via the search engines. The idea of cloaking is simple: when the search engine indexing crawler requests a page, the website gives the crawler one version of the page. When a human visitor requests the same page, they see a different version. Distinguishing search engine crawlers from human traffic is very easy, and in the case of Google and MSN/Live, it's 100% fool-proof.

The publishers in question all behave in a similar pattern: the search engine results are for a PDF file, presumably a paper that Google thinks is relevant. When a user clicks the link to the PDF, they are instead presented with information about how to purchase the article. So in this case, the cloaked version of the page is the PDF and the real version is the purchase form.

I've been getting more and more annoyed by these publishers, and two days ago John Baez from the Department of Mathematics, University of California, Riverside, complained about this spam. The comments show just how annoyed people are about this. John's post prompted me to join the naming and shaming of these spammers, and so without further delay...

SpringerLink

They seem to be the best at gaming Google. The papers hosted on springerlink.com rank highly in many fields of knowledge. The simplest way to out them is the search for [site:springerlink.com filetype:pdf], which queries Google for all PDF files hosted on springerlink.com. A screenshot of the results I'm getting is below. Just click any of these purportedly PDF files and see what happens. For the first result in the screen shot, I'm getting this page.

SpringerLink spamming Google

IngentaConnect

Next up is ingentaconnect.com. They don't have much of a search engine presence, but they still cloak. An example is the search in the screenshot below using the search term [site:www.ingentaconnect.com intitle:"journal"]. The third result I get is a PDF that when I click on leads to this page.

ingentaconnect.com spamming Google

Royal Society of Chemistry

Yep, rsc.org. The screenshot below shows how to find cloaked PDFs by searching for [site:rsc.org filetype:pdf "carbon dioxide"]. The very first result leads to page asking for £22.

rsc.org spamming Google

Taylor & Francis

T&F host a lot (all?) of their journals on informaworld.com. So what does Google say for [site:www.informaworld.com filetype:pdf "carbon dioxide"]? The third result in the screenshot below takes me to a page asking for £18 this time.

T&F spamming Google

Conclusion

Well, the publishers are clearly cloaking their pages for the Google crawler. This contravenes Google's webmaster guidelines, and lesser websites have been removed from the Google index for similar tactics. What's amusing is that Google keeps making a big fuss about how cloaking is bad but doesn't do anything about these big publishers.

There is a question here we have to discuss: is this really cloaking if paying customers eventually end up seeing the same content that the search engine crawler sees? Some think in these instances it is not cloaking, but I beg to differ: What the average user sees should be identical to what the search engine crawlers see. If a page is in the search engine index it means that anyone can have access to it, without the need to register (even if free) or paying. That's my 2c. Take it or leave it.

Science Bloggers Must Not Be Anonymous!

Posted on December 11th, 2006

We must stand behind our words for readers to trust us.

Over at Science Blogs, Steve asks:

Should I become anonymous? Is it going to affect my job search in a few more years?

That's a good question: how does academia perceive bloggers? Are we just a bunch lazy lab escapees who can type or are we a more serious bunch that are bringing a passionate new voice to complex technical issues? I believe we're in the latter group and here is why.

This very blog started because everyone was asking me questions when I started my PhD. Most of the questions I got were of two types:

  • Someone watched something on TV or read in the newspaper and just didn't get why it's important.
  • Someone developed their own theory to explain something in life... and wanted me to rubber stamp it.

Clearly, scientists in general are really bad at explaining to the layman what science is really about. Why are stem cells important? Just what on Earth is evolution? There is so much sensationalized misinformation out there, it's sickening! Science bloggers should be a voice that everyone can understand and trust.

So should I be anonymous? No way! Quacks hide behind a veil of anonymity. Standing behind what I say in public about important issues adds to my credibility (hopefully). I reference everything I write about, and answer every question. You can hold me accountable in the comments below every post or on your own blog.

The last point is what it's all about. Science bloggers are like teachers: I wouldn't trust an anonymous teacher I couldn't call out.

As for career prospects: sadly, academia hires based on educational pedigrees. Regardless if we blog or not, where we did our research and who we know count a lot more than our blogs.

What Will Happen in the Next 50 Years?

Posted on December 4th, 2006

Experts forecast.

In celebration of it's 50th anniversary, New Scientist asked 70 world-renowned experts and scientists about their thoughts on the next 50 years. The future-gazing writings make very good reading. Enjoy!

Technorati Tags: , , ,

Mental Date Arithmetic

Posted on November 30th, 2006

Ever wondered how to calculate the day of of the week for any date in the past or future? Turns out, it's quite easy. Full details over here.

Technorati Tags:

Get Your Questions Answered Online

Posted on November 19th, 2006

This post is a bit of a bookmark for me, but I thought I'd share it. If you have a random burning question you'd like to answer, there are some websites that let you post your questions and let (millions) others answer it. What's nice is that the questions can be very diverse, deep, relating to technical issues, everyday life... well, anything!

By far, the best such service out there is Yahoo! Answers. Yahoo promoted this service heavily, and the end result is that millions of people use it, both asking questions and anwering them. You can also subscribe to RSS feeds for the categories or even searches. It's all free too. Very cool.

Update: Google Answers is no more. Google has a similar offering, called Google Answers. You have to pay to get answers. Personally, I don't like the interface (typical of the spartan Google designs). It should be your second port of call.

If found another service today, called Quick Answers. They don't have a science category (gasp!), but cover pretty much everything else. You have to pay a minimum of $20 for a question.

Technorati Tags: , ,

MSN Live’s new academic search engine

Posted on April 19th, 2006

Scientific literature search

A very cool new service part of the up-coming MSN Live website: Windows Live Academic Search. The same functional (if a bit slow) and information-dense interface is used as the main Live search site, but with focus on academic papers. Check it out.