Excluding Junk Results From Search

booleanstrings Uncategorized

This is going to be a little bit technical… I am going to talk about pages that often show up in Google search results, when I do sourcing, that I’d rather skip. (If you don’t care to go through the technical stuff, please just skip to the last paragraph, or simply try my new Custom Search Engine.)

Many Internet researchers know that there’s the “surface web”, i.e. websites that can be found by crawling, and the “deep web”, i.e. all the other sites. The Deep Web is much, much bigger than the surface web.

Well, several years ago Google started putting some “deep web” pages into its index as well! Namely, when Google ran across forms to be filled out, it attempted entering random information – as if it were a user – and reviewing the resulting page. If the page “made sense” it was included in the index. Today, we can still encounter quite a few dynamically generated pages; try something like this (add keywords if you like) inurl:index.php.src and you will see that.

In a while after this clever addition to its index Google became somewhat dissatisfied with the quality of the dynamic pages. It comes back to re-index them less often than it visits static pages. However, there’s a widely used trick that SEO specialists have started to implement since, called URL rewriting. It works like this: any (reasonable) search executed on those sites creates a page with a static URL.

Here is an example:

Someone searches on pipl.com for Dave Smith. http://pipl.com/search/?q=Dave+Smith is a dynamic page. A side effect of the search will be a static page with a nice, clean URL:

http://pipl.com/directory/name/Smith/Dave

(I must say that I don’t know for a fact how pipl.com creates its static pages. But I am pretty sure they are generated as I describe.)

Further on, websites with “deep web” content that is “deep” because of membership and login requirements expose their content as well, to attract new users. They do this by providing static links that are visible on the “surface”. As an example, LinkedIn has recently generated a large number of special static pages like  http://www.linkedin.com/title/distributor/in-us-752-savannah – that are converted into people search queries if you are logged in! (Try clicking on this link with and without being logged in).

A seemingly “static” page that will lead you to a query is usually an irrelevant search result. Most of the times it will display a list of internal search results that, in addition to being a list, may not even have your search keywords any more. Why rewritten URLs show high up in search results is a mystery to me!

Here’s my new Custom Search Engine adjusted to not show junk results. I have removed dynamic pages as well as a list of “content farms” that I partially copied from the initial Blekko‘s list. (Blekko is now blocking more sites than any CSE would be able to include…but my search engine has a much larger index.) I have also blocked some sites that we have all seen while looking for people, that promise to find “…everything!“.

If you see sites or site templates you think should also be “banned” please let me know.

Happy searching! 🙂