Version 0.15 October 2012 (working draft)
Intended audience: research managers and researchers, information managers and professionals
Producing digital materials and placing them in a repository or on a website does not ensure ‘visibility’ on the internet. If the research materials are not easily accessible to the potential user their value is not being fully realised. This Pathway introduces the importance of submitting pages or sitemap files to internet search engines for indexing and how to carry out optimization for search engines. Using these techniques enhances the visibility and therefore the accessibility of the materials collected in a repository.
What do you need to know?
Search engines collect material by sending out web crawlers, i.e. computer programs, that browse the World Wide Web in a systematic way. These programs (sometimes referred to as robots or spiders) mimic the behaviour of a human user who follows all the links on a page in order to crawl further. As a consequence:
- items in databases that are retrieved by putting terms into a search box will not be found. To get that content indexed you will need to actively submit pages for indexing or use sitemap files.
- a site should be designed in such a way that it can be easily crawled. This is one of the things that the process of search engine optimization (SEO) attempts to accomplish.
Queries in the major search engines often return thousands of items and the search engines do their best to present the most important items at the top of the list. This ranking of results is determined by several things. Some of them are obvious, e.g. items rank higher if the terms in the user’s query are frequent in the item. But items also rank higher if there are many links from other items to that item. If items that have received more links from other items get a higher weight, and if they link to a page at your site, they hand over more weight to that page than items that have received fewer links. This mechanism is referred to as Pagerank (named after Larry Page, one of the inventors of Google) and getting a higher Pagerank is the other thing that search engine optimization (SEO) attempts to accomplish.
Apart from the general search engines there are several specific search engines for repositories. These search engines use the common protocol for interoperability between repositories, Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH), to harvest the content from repositories. So a repository needs to produce the necessary file format for it to be harvested.
What do you need to do?
Submitting pages and sitemap files
A sitemap file is a file in a particular (XML-based) format that allows webmasters to inform search engines about the addresses (URL’s) of pages on their own site that can be crawled. Different search engines can use the same sitemap file. If the URL’s of all the items in your website/repository are in a database, the sitemap file can be generated on a regular basis from that database. Submitting a sitemap file does not guarantee that an item is indexed by the search engines, but it will certainly help.
Search engine optimization
Search engine optimization is meant to make sure that web crawlers can find the items on your site, and improve their ranking in the results from the search engine. As this is very important in commercial environments search engine optimization has become quite an industry. There are two forms:
- White hat optimization which works on search engine accessibility that is ‘transparent’
- Black hat optimization which tries to trick search engines in various ways. These ‘black hat’ methods may involve ‘cloaking’, which is exposing different content to the web crawler than is exposed to human visitors (by using specific tags in the source, or by having the font colour the same as the background colour). It may also involve link farms, which is creating artificial links to your items to improve the Pagerank.
If you hire a consultant to improve the performance of a site for search engines, it is important to make sure that they do not use these ‘black hat’ methods. A rule of thumb might be that if you understand the trick, the search engine will certainly find out about it. Search engines have penalized sites that used such practices by excluding them entirely.
There are a number of recommended approaches for White hat search engine optimization of your site. The most important ones are:
- Think carefully about which words you want the user to enter in order to find your page. Check that these words do in fact lead them to your page.
- Use clear titles for your pages, because that is what the user clicks on in the results from the search engine[1].
- Make sure each item can be reached from a static link[2]. Dynamic links within your site will cause difficulties.
- If you choose a Content Management System (CMS) to maintain your website, an important requirement should be that search engines are able to crawl sites which use this CMS.
- Provide the user with sitemap pages. If these pages get too large it is helpful to split them.
To improve the ranking of your pages the best recommendation is to work together with peer sites that work in the same subject area and link to each other’s pages wherever appropriate.
Getting harvested
In the world of document repositories there are two important roles:
- Data providers, i.e. repositories that expose their data to harvesting;
- Service providers who harvest the repositories, combine them in a database, and develop value-added services using that data. Examples of such services are OAISTER (that brings together the world's scholarly repositories), AGRIS and AgOAI (for agricultural publications) or Orgprints (a thematic repository in the field of organic agriculture).
To control the flow of information between data providers and service providers a common protocol is used, OAI-PMH. If it is stated that a service or a piece of software or a service is OAI-PMH compatible it can mean that it is a repository that can be harvested, or that it can harvest repositories, or both. The M in OAI-PMH stands for metadata. Only metadata (information about information objects such as electronic documents) is exchanged, not the objects themselves. There are different formats to exchange metadata and OAI-PMH can handle any format. The most common format for electronic resources, Dublin Core, is mandatory, but other formats are used in some OAI-PMH networks. For example in networks for educational resources the IEEE/LOM format may be used, and FAO promotes the use of the Agris Application Profile. It is important to note that OAI-PMH is not a search protocol - it allows selective harvesting based on date ranges or sets that were predefined by the data provider.
To be harvested by service providers all a repository needs to do is to submit addresses (Host URL) to the service. As the protocol has been standardized, the service provider then knows what to harvest from the repository.
References
1. Search Engine Optimization.
2. Search engine ranking factors V2.
3. OAI for Beginners - the Open Archives Forum online tutorial. Includes information on OAI-PMH.
[1] In technical terms, the <title> in the <head> of your web page
[2] In technical terms: dynamic links are usually links to a URL that contains a question mark. One should also be careful with links that are generated by Java scripts on a page