Crawling the Best Pages on the Web

12 Nov 2009

Today, we’d like to share some thoughts on one of the major components of any search engine, and something we’ve been focusing on a lot lately, namely the Crawler.

How does a search engine find pages to include in the database? By automatically mass-surfing the web, jumping from page to page and grabbing the information found in the process. This is done by a program called a Crawler, Bot or Spider. Entireweb’s crawler, affectionately called Speedy Spider, simultaneously runs on a large number of machines and visits tens of thousands of pages at the same time, all day, every day.

"Garbage in, Garbage out" is an old engineering proverb: To build a great search engine, you need to start at the source, namely the way you select pages to include, at what schedule you visit them, and what you do with the stuff you find on the pages (that’s what’s often called Indexing). This is a problem we’re taking very seriously.

The ideal crawler makes sure to grab every page on the net that is of any interest to human searchers, and does so as soon as it has been created or updated. Our goal is to approximate this as closely as we can.

We feel there is no point in including pages that don’t carry any tangible information. For example, many search engines will include pages that basically contains nothing more than advertisements, simply because the search engine may benefit financially from showing such ads. These pages, which include things like parked domains, scraper sites and various kinds of web-spam, do little more than dilute the search results. We’ve also made innovations to duplicate content removal, so that the information you find in our search results will be truly unique, diverse and useful.

Share:
  • Digg
  • Facebook
  • Twitter
Print This Post

6 Responses to Crawling the Best Pages on the Web

Avatar

riju

November 22nd, 2009 at 5:33 pm

I am looking foroward to ur SE. As now the whole net is dominated by a single SE, it is high time that another competitor to be there. and all other copied Engines from the king SE has failed…as it is just a blatant copy of every thing….i am waiting for the d day of ur release…and would like to be the few first one to try a search in ur engine…good luck to all developers…and happy slaying the goliath :)

Avatar

Help and Assistance

November 25th, 2009 at 6:13 pm

Hello from the US.

I have tried out many different areas of the entireweb site and services.

I really like the search spy page and I like the targeting ability of speedy ads (though the US presence is still a little skimp)

However when I use the search I can’t find my own site. To top it off the speedy spider visits my site more then any other bot, yet every other search engine is really good to me sending hundreds upon hundreds of visitors my way regularly. But no love on this end. I have submitted my site. but no love.

What can I do to make speedy like me?

Avatar

Entireweb

December 1st, 2009 at 6:16 pm

@Help and Assistance: We are always interested in recieving feedback about our crawler. Please contact us at speedyspider(at)entireweb.com and let us know wich domain you are referring to, and we’ll look into it!

Avatar

ChrisW

December 11th, 2009 at 4:33 am

I have tried your Express Inclusion for many clients and I have yet to see any traffic in my analytics from EntireWeb or any other partners. It would be great to show how many times our listings appeared in search results and how many clicks you see from your side.

Avatar

Entireweb

December 11th, 2009 at 5:03 pm

@ChrisW: We are currently rebuilding our Express Inclusion as well, and statistics is one of the new features that will be included!

Avatar

Help and Assistance

December 20th, 2009 at 3:24 am

I will be emailing you folks and thanks for the response.

I am really pulling for you guys, I have always admired the usability of your back end. It has the potential to catch on in my opinion.

I got to tell you that the task of doing what you folks have set out to do is mind numbing. I wish you folks all the luck in the world.

I am curious if you guys have any sort of internship programs or other means of getting involved. keep in mind I am from the US. But I would love to make the trip.

Comment Form