The Internet is big. To give an example, 3 exabytes (one exabyte is 1,073,741,824 gigabytes) of data are generated every day.Recent estimates of the total amount of data in the world put it at 300 or so exabytes in 2007. From this we can conclude that every year the world generates more information than had existed in the entirety of history up to 2007. That’s quite a lot. It’s the biggest haystack in the world, and everything we want to know is a needle.
How do you find something on the Internet? Well, most of the time you would use a search engine, which is an intricate combination of computers, software, and math that helps you find what you’re looking for. There are a lot of search engines to choose from, some of the more famous being Google, Bing, and Yahoo. What makes them different? Well, that’s a secret. They have computer programs called algorithms that they use to search the web and give you the thing you’re looking for, or at least what they think you’re looking for. These algorithms are what make each search engine special, and these companies don’t like to tell people how they work. Despite this, you can still try to figure out what makes these search engines different by comparing what they give you when you search for something.
The main thing that determines the usefulness of a search engine is the relevance of the results, meaning, how close are the results to what it is that you wanted to find? What we are interested in is how well search engines take what they know about the internet and direct you to what you’re looking for. This is actually a very hard problem, and many people spend a lot of time working out better ways to solve it.
You have something in your head you want to see or learn more about. So, you type some words to try to communicate that to the search engine. These are called search terms. Inevitably, you have experienced a time when the search engine offered results that weren’t what you wanted. Sometimes the thing you’re looking for is referred to in different terms. Or maybe the word you use has many meanings and contexts, but you only want one of them. Search engines have look at these nuances and possibilities and use them as a reference to sift through exabytes of data to find sites containing the specific thing you’re seeking.
Now, search engine technology has become very sophisticated, and most of the easy searches, like “cat,” will likely get you exactly what you’re looking for. An easy search term won’t help you figure out how the search engines are different. To find the differences between search engines, you want to try to think of more complicated search terms that could refer to multiple different things. You can try searching for your friends, that one movie you saw with that guy, your favorite spot in the city where you live, etc.
Problem:What makes search engines different?
- Internet Access
- Pencil & Paper
- Make a list of search terms to use on all of the search engines you want to test.
- Use each search term on each engine and record the results you get. You can save a copy of the webpage, or just take notes on how relevant the results are to your search.
- Keep trying search terms, especially if you found what you were looking for the first time. What kinds of search terms always find relevant sites? What kinds don’t? Were there any results that were unexpected to you? How did the search engines differ?
Search engines are really good at looking for web pages that include a combination of words. If you come up with enough words related to the same concept, the search engines will often find exactly what you’re looking for. What search engines aren’t so good at is synthesizing information. If you’re looking for a website based on a detail unrelated to the content, like what color the logo is, you’re out of luck. Search terms that indicate something very specific may sometime yield desired results, but they can also fail to generate anything related to what you have in mind. The search terms that will allow search engines to produce the most consistently relevant results are those desiring a straightforward and simple factual answer that is commonly discussed; “the radius of the earth” or “average car length” are easy to find, but something like “how many cars do I have to line up end-to-end to span the radius of the earth” doesn’t work quite as well.
For search engines to dig through all the webpages on the internet, they use special program called a web crawler, also known as a spider. A web crawler goes to a webpage and follows every link it can find, and then searches the next webpage for more links, and on and on. Its goal is to look at all the websites on the Internet and make a map for them. It then saves most of the text it sees as well as most of the pictures and some of the more data-intensive things such as video, depending on how it was programmed. This process is called caching. Once the web pages are cached, another program moves through them and indexes them. Indexing is much like what a librarian does. The program goes through every cached page and links it to any relevant item in the index. Most of the time individual items point to multiple websites. Indexing makes it faster for a computer to look for a specific search term. If the pages weren’t indexed it would take just as long to do your search as it took to crawl all the websites (that could take weeks!).
After everything is indexed, it is stored until a user searches for it. When you click “search,” the engine essentially finds all of the sites that have index terms for the search terms. The engine then has to decide in which order to present these findings to the user. This is where relevancy and many of these secret algorithms come into play. One of the approaches to figuring out the correct order is to sort the matches by which ones people clicked on after typing in certain words and phrases. Google was one of the first companies to use this approach. If people keep clicking the same link halfway down the page for a given search, then it’s likely that link is relevant to that search, and so the engine will move it higher up on the results page. Another approach is to look at things the user has searched for and clicked on before, and to adjust what they see on the page depending on that history. This approach isn’t perfect, however, since it can cause the user to only see things relevant to what they had done before, and keep them from running into anything new. There’s a lot of subtlety to searching the internet, as you noticed from your experiment, and the art of the search engine is continually developing.