So, we’ve been talking about search, and people no doubt wonder if their site will be found with OSU’s Google Search.

Most cases, the answer is yes, but in some cases the answer is no.  For the no’s there are reasons why and is what I want to talk about in this post.

The Exceptions

Why are there exceptions?  There are exceptions for a few reasons.  First is the license limit on our Appliance, which is currently one million documents.  549,998 is our current document amounts and we are still indexing sites as we are made aware.  So if a site has a large number of documents, for example a site that has an individual page show up for a dictionary, where each entry of the dictionary is considered a document, then that will eat up the million document limit fairly quickly.  Relating to the previous example, some exceptions are because of the applications users or departments use.  For example, currently a Joomla CMS results in a large number of documents returned because of the way the application works.   Second, if there are sites that are not maintained, which get hacked or spammed, we don’t want to index sites that have spam inserted into it which may likely show up in the search descriptions.  Third, if crawling the site results in an endless loop, where documents in the site refer to itself, so the crawler basically gets stuck, don’t crawl those.  Fourth, if a site returns a large number of errors, then there is something wrong with the site and that is consuming the Appliance resources, such as CPU and memory.  Fifth encompasses all these aspects, which is the administration overhead.  With all the other functions CWS supports, if a particular search aspect would result in significant administration overhead, we would need to make the best decision to minimize that overhead.

So what are our current exceptions?

1.  ONID home directories are not searched.  Why?  Mostly because some users do not maintain their sites, and the sites result in spam entries, and across twenty thousand or more, it’s too much of an overhead to manage.  A policy decision was made for this.
2.  http://ecampus.oregonstate.edu/ask-ecampus/knowledge-base/  Why? This site returned over 250 thousand documents.
3.  http://www.cof.orst.edu/org/iawa/  Why?  This site returned over 160 thousand documents.
4.  http://oregonstate.edu/tac/index.php?option=  Why?  This site returned over 600 thousand documents (due to the way the application handles pages)
5.  Group sites at http://oregonstate.edu/groups/ Why?  This is for the same reason as #1.  As part of the move to people.oregonstate.edu for group sites, we will be reevaluating this.
6.  http://oregonstate.edu/webprojects/wiki Why?  has 2 million errors
7.  http://www.familybusinessonline.org/index.php? Why?  This site returned over 400 thousand documents (due to the way the application handles pages).
8.  http://oregonstate.edu/cla/anthropology/gallery/kingston/main.php? Why?  This site was caught in a loop.
9.  http://bioe.oregonstate.edu/reservations/ Why?  This site was caught in a loop.
10.  http://oregonstate.edu/aepcore/index? Why?  This site was caught in a loop.
11.  http://hort.oregonstate.edu/event/  Why?  This site was caught in a loop.
12.  http://recycle.oregonstate.edu/EarthDay/eventCalendar.cfm?  Why?  This site was caught in a loop.
13.  http://extension.oregonstate.edu/clackamas/announcement/  Why?  This site was caught in a loop.
14.  http://physics.oregonstate.edu/event/  Why?  Events list returning excessive results.
15.  regexp:http://www\\.osualum\\.com/?.*cid=[0-9]+.*?  This is a regular expression statement that if it has the url form specified then it is not being crawled.  Why?  This site was caught in a loop.
16.  http://oregonstate.edu/sli/aggregator/announcement/  Why?  This site was caught in a loop.
17.  regexp:http://oregonstate\\.edu/womenscenter/library.*browse=*  This is a regular expression statement that if the url has the aspects specified within it, then it is not being crawled.  Why?  This site was caught in a loop.

If your site is on this list, and you want to discuss this, then contact us.  We do want to reevaluate sites periodically

We also do not index every type of file extension.  Image files, media files, archive or binary files are not crawled.  There would just be way too many that would exceed our license.

So those are the exceptions are reasons why.  We don’t necessarily expect everyone to be happy or agree with the exceptions made, however, we have to make the best decisions to support OSU as a whole and keep in mind the limitations of our search engine.  However, stating that, we do want to periodically review our decisions, and also determine if alternative solutions can be implemented.  So if there is a concern, then please contact us.

So Search Beta was released in conjunction with the new top hat design for OSU (another change as part of future upcoming changes).  A great effort between Central Web Services (otherwise known as CWS) and Web Communications.  The same collaborative group that introduced OSU Mobile.  Don’t know about OSU Mobile?  Well for that, visit m.oregonstate.edu (iPhone, Palm Pre, Android and some Blackberry), and I’m sure we’ll be talking about that in other OSU CWS blog posts, so stay tuned.

So what is Search Beta?  It’s just that, it’s really Beta.  We are transitioning information, crawls, features from the Google Search Appliance to the User Interface for search.  It’s not perfect, not everything will be found right now.

So you might be wondering about how that affects search on your site pages, which uses the central code provided by CWS.  Because we have a front end to search, we are able to make it as transparent as possible to site owners.  The goal is sites shouldn’t need to be modified, if they use the search module integration CWS has provided and made available previously.  Integration with Drupal sites will be upcoming, so if your site is not showing results because it has not been indexed, do not worry, we’ll be rolling the Drupal change in soon.  After that there are a couple things that need to happen.  First if you are running what we call a virtual host, like hmsc.oregonstate.edu or in a path in oregonstate.edu/, the Google search has to find your site possibly linked from other sites and index the site.  This is the engine part of the appliance, and Google does a fairly good job with this.  The process could take anywhere from a few hours to a few days, depending on the algorithms Google uses to find new pages.  Anything in the oregonstate.edu/ area is continuously crawled.  There are exceptions (and reasons for exceptions), which we’ll be noting in the days to come and which we’ll talk more about in another post.   Second, if your site is not found after a reasonable amount of time, then we can look at explicitly crawling your site.  This is more common with virtual hosts.  If that is the case just contact us using our online contact form, but first read the next post about exceptions to sites being crawled.

We’ll also be looking to get some input from users.  You can comment here, or you can comment on the Web Communications blog, where there will also be information about the new home page that will be introduced this year.  In addition to commenting, there will be some focus groups, which is another avenue to provide feedback.  The focus groups will look at search among other things.

So when it comes to the OUS Search, we say Search Me.

Search Introduction

The OSU search tool has been updated to provide a better long-term web search solution for Oregon State University.

The purpose of the OSU Search Category in the CWS blog is to examine the evolution of OSU’s web search, and offer more detailed information about OSU’s current search capabilities and keep OSU current on the happenings with OSU Search.

Background

History

OSU has seen three search solutions through the course of its web history.

Inktomi – 1998-2002
Inktomi was OSU’s first search engine. Inktomi’s base technology was initially developed at Berkeley, and during the mid-to-late 90’s became the driving force behind the Yahoo and HotBot search engines.

Google – 2002-2004
Google originated at Stanford university as project BackRub, named for its weighting of backlinks in its search algorithm. In a few years, it developed into the most popular search engine in the world. As a natural expansion to the search engine, Google developed standalone search appliances aimed towards large organizations with a substantial web presence. Google Search Appliances provide a solution-in-a-box for searching large intranets and offering more specific content filtering than is possible with google.com’s web interface. One of these appliances powered the OSU search for almost two years.

Nutch – 2004-2009
In August of 2004, at the end of the Google contract, Central Web Services evaluated Nutch as a replacement search service. Installed on OSU hardware, running software built, configured, and maintained through ardent cooperation between CWS and Nutch programmers at the beginning of Nutch’s history with OSU, Nutch powered the search.oregonstate.edu service for many years. During this period Google advanced many features of their appliance and search capabilities. Because of the advanced capabilities of Google and the overhead to deal with the issues existing in the nutch release, the decision was made to sunset nutch for OSU and return to a much improved Google Search Appliance.

Google Search Appliance – 2010-

On January 1st, 2010, Central Web Services unveiled the new Google Search Appliance.

Why Change?

The migration to Nutch in 2004 was initiated to improve flexibility and extensibility, and as an open source product access to the code was available. As other advancements occurred in technology, there was not adequate time or personnel to be able to focus on code changes to have the nutch search engine reach a stable state and meet the growing needs of OSU. In October of 2009, the decision was made to let the search experts take care of search, while the administration and the front end design and other enhancements to search management would be maintained by the good people of Central Web Services.

Support

Contact Central Web Services with questions, comments or concerns.