So, we’ve been talking about search, and people no doubt wonder if their site will be found with OSU’s Google Search.

Most cases, the answer is yes, but in some cases the answer is no.  For the no’s there are reasons why and is what I want to talk about in this post.

The Exceptions

Why are there exceptions?  There are exceptions for a few reasons.  First is the license limit on our Appliance, which is currently one million documents.  549,998 is our current document amounts and we are still indexing sites as we are made aware.  So if a site has a large number of documents, for example a site that has an individual page show up for a dictionary, where each entry of the dictionary is considered a document, then that will eat up the million document limit fairly quickly.  Relating to the previous example, some exceptions are because of the applications users or departments use.  For example, currently a Joomla CMS results in a large number of documents returned because of the way the application works.   Second, if there are sites that are not maintained, which get hacked or spammed, we don’t want to index sites that have spam inserted into it which may likely show up in the search descriptions.  Third, if crawling the site results in an endless loop, where documents in the site refer to itself, so the crawler basically gets stuck, don’t crawl those.  Fourth, if a site returns a large number of errors, then there is something wrong with the site and that is consuming the Appliance resources, such as CPU and memory.  Fifth encompasses all these aspects, which is the administration overhead.  With all the other functions CWS supports, if a particular search aspect would result in significant administration overhead, we would need to make the best decision to minimize that overhead.

So what are our current exceptions?

1.  ONID home directories are not searched.  Why?  Mostly because some users do not maintain their sites, and the sites result in spam entries, and across twenty thousand or more, it’s too much of an overhead to manage.  A policy decision was made for this.
2.  http://ecampus.oregonstate.edu/ask-ecampus/knowledge-base/  Why? This site returned over 250 thousand documents.
3.  http://www.cof.orst.edu/org/iawa/  Why?  This site returned over 160 thousand documents.
4.  http://oregonstate.edu/tac/index.php?option=  Why?  This site returned over 600 thousand documents (due to the way the application handles pages)
5.  Group sites at http://oregonstate.edu/groups/ Why?  This is for the same reason as #1.  As part of the move to people.oregonstate.edu for group sites, we will be reevaluating this.
6.  http://oregonstate.edu/webprojects/wiki Why?  has 2 million errors
7.  http://www.familybusinessonline.org/index.php? Why?  This site returned over 400 thousand documents (due to the way the application handles pages).
8.  http://oregonstate.edu/cla/anthropology/gallery/kingston/main.php? Why?  This site was caught in a loop.
9.  http://bioe.oregonstate.edu/reservations/ Why?  This site was caught in a loop.
10.  http://oregonstate.edu/aepcore/index? Why?  This site was caught in a loop.
11.  http://hort.oregonstate.edu/event/  Why?  This site was caught in a loop.
12.  http://recycle.oregonstate.edu/EarthDay/eventCalendar.cfm?  Why?  This site was caught in a loop.
13.  http://extension.oregonstate.edu/clackamas/announcement/  Why?  This site was caught in a loop.
14.  http://physics.oregonstate.edu/event/  Why?  Events list returning excessive results.
15.  regexp:http://www\\.osualum\\.com/?.*cid=[0-9]+.*?  This is a regular expression statement that if it has the url form specified then it is not being crawled.  Why?  This site was caught in a loop.
16.  http://oregonstate.edu/sli/aggregator/announcement/  Why?  This site was caught in a loop.
17.  regexp:http://oregonstate\\.edu/womenscenter/library.*browse=*  This is a regular expression statement that if the url has the aspects specified within it, then it is not being crawled.  Why?  This site was caught in a loop.

If your site is on this list, and you want to discuss this, then contact us.  We do want to reevaluate sites periodically

We also do not index every type of file extension.  Image files, media files, archive or binary files are not crawled.  There would just be way too many that would exceed our license.

So those are the exceptions are reasons why.  We don’t necessarily expect everyone to be happy or agree with the exceptions made, however, we have to make the best decisions to support OSU as a whole and keep in mind the limitations of our search engine.  However, stating that, we do want to periodically review our decisions, and also determine if alternative solutions can be implemented.  So if there is a concern, then please contact us.

Print Friendly, PDF & Email

Comments are closed.