Tuesday, 4 November 2014

There is plenty of information around but how much of it can we practically find?

'Another haystack' by Maxine
under a CC license
The frank answer is: it depends; on many things.

First of all, I'm talking about information that is available on the internet. That excludes books that are not available online, databases that run locally, etc. More specifically, I'm talking about information that has been indexed by at least one search engine, at least at the level of general content description. I'm not differentiating among the different types of information, though.

Estimates of the size of the internet in 2013 spoke about 759 million websites, of which 510 active, which - in turn - host some 14.3 trillion webpages. Google has indexed about 48 billion of those and Bing about 14 billion. The amount of accessible data is estimated to be about 672 million Tb (Terabytes), which likely includes the indexed and part of the deep web content.

On top of that, we have the dark internet - but this is a different thing.

So, there is a lot of information indexed (and much more that lies beyond indexes). Year-by-year we are getting more-and-more used to using and relying on the internet. But how "much" useful information can we normally find?

Assuming we are talking about seeking for "general information" the main search tool is a search engine. While common search queries return tens of millions of results, most users tend to focus on the first few hits. SEO experts often talk about users sticking to the first 5 search engine hits or - at most - the results of the first page. Some disagree but still very few users go through all the results. Of course, persistent people seeking for specific information do tend to try different search queries in order to reach reasonably relevant information.

The interesting point regarding search engines and their results is that those results on the 1st page are very valuable. So the question is: if some invest in placing their content on the top of the search results, how can the user find relevant content, if that content is maintained by people not willing to invest on SEO, e.g. by a non-profit or just enthusiast individuals?

Of course, search engines use result ranking algorithms that take into consideration a very long list of factors. Content quantity and quality are amongst those factors; popularity is another, etc. However, the way those ranking algorithms work (the exact formula is kept secret) may include - e.g., in the case of Google - a ranking bonus for content of the user's Google plus contacts. They may also include a fading mechanism, where very old, possibly unmaintained, information is ranked below the recent one.Websites offering content over secure connection (via https instead of plain http) get a bonus, too, etc.

All those twists and fine-tuning are meant to help the "average user" (I guess) reach the content they need, while at the same time giving a change to content providers (including companies investing in advertising and SEO) reaching their target audience. Most of the time, advanced users will employ additional tricks to refine their searches but (I assume) any ranking algorithms work in the same way for them, too.

Needless to say, that when search engines (and Google in particular) modify their ranking algorithm, many people worry and many people get busy.

To make things slightly more challenging, content on the internet tends to change with time. Webpages may disappear due to technical reasons. Links to content may be hided in some regions due to the right to be forgotten (a very interesting topic, on its own). Or content may be removed due to a variety of reasons, e.g. copyright violations or, even, DMCA takedown notices.

The point is that finding the information one wants needs persistence, intuition, imagination, good knowledge on how the search engines work, sufficient time and luck (not necessarily in this order). The problem that remains is that this information is very likely to represent part of the whole picture.

Some will say that this has always been the case when seeking for information. True.But now that accessible information feels "abundant", the temptation to stop looking for new data after the first few relevant search engine hits is really strong.

Unfortunately, responsibility still falls onto the user, to be wary of gaps or biases of any kind and keep looking until the topic in question is properly (or reasonably?) addressed. It's not an easy task. With time, however, it's likely that we'll develop additional practical norms to handle it.

