Where the (Wild) Files Are

booleanstrings Boolean

When data is exposed to search engines due to an incorrect site configuration, that data becomes available for Sourcing for anyone who knows how to find it – including you and me.

About ten years ago, Sourcing techniques like “flipping” or “peeling” still worked, providing creative Researchers with the data to find and parse – sometimes, folders full of files with desired professional data. These techniques are not as effective any longer. As of today, we rarely see websites that would show anything like this (an exposed file directory):

We can try to look for those by Googling for something like

intitle:”index of” name “last modified” “size” <add keywords>,

but we won’t find a whole lot, due to the modern site protection, built into many site-creating platforms.

However, the web is full of new data sources that were not available in the past. With the rising popularity of BIG DATA in the CLOUD, we can Google for a different set of files. For example, if a file is stored in the Amazon Cloud and is public (for example, is referenced from a public document), we can locate that file.

Would you like to see some examples? These are resumes stored in the Amazon Cloud. Another example, these are attendee lists.

There are tons more sources beyond the Amazon cloud storage. Here is another source of files:

inurl:wp-content/uploads. (Why? You should be able to explain). Example Google search –

inurl:wp-content/uploads “member directory” ext:PDF – 

finds some interesting data!

Learn about other platforms full of uploaded data that may find its way into Google index (and much more!) at the

Sourcing Methodologies Lecture and Practice.