OSU Search is powered by a Google Search Appliance. One of issues we’ve had to overcome from day one is the relevance of search results. One of the main criteria for search result relevance is how many pages link back to a page to figure out how relevant a search result is. This is one of the areas where OSU Search can’t keep up with external search engines like Bing, Yahoo or Google because OSU Search crawls, and is only aware of, OSU related websites.

In other words, if a site is being linked to by many external websites or groups this information is not used by OSU Search to improve results.

The good news is that the Google Search Appliance has a feature called Self Scorer. With this functionality turned on, the search appliance can improve the search results relevance by observing which links the users click on after they do a search. We had this feature turned on, but since we don’t use the search appliance directly, we weren’t taking advantage of it. In the latest version of OSU Search, we ported this feature over. Now, whenever you do a search in search.oregonstate.edu, the search appliance will make a note of what search result you clicked on and if enough people click that search result, it will move it up the list. This should make a difference in the relevance of search results end users see.

Another advantage of having the Self Scorer enabled is that we can run advanced search reports. What this means is that we’ll now be able to get reports that tell us things such as:

  • The ranking of the search results that people are clicking on, or
  • How often people use the next/prev links to find what they’re looking for instead of finding it on the first page

This extra data will allow us learn how useful the information that OSU Search is for different types of search queries, so that we can improve them.

I’m glad to be writing about some exciting updates to OSU Search. In version 0.4.2, you will find an updated look and feel and some usability features. Among the new things you will find are:

  • The links to different types of search (collections) are located on a left sidebar instead of above the search box.
  • Filter by url – Users can now filter by urls by clicking the domains that we currently crawl located in the left sidebar.
  • Header and footer updates – they now include the same content as the homepage
  • The search box has moved down closer to the results area
  • Faster results!

We think these additions to OSU Search will make finding what you’re looking for easier. We will keep bringing new features to OSU Search to help users explore more advanced search features they may not be aware of. Some of the future improvements will include: people search, location search and speed improvements. Our goal is to let people just type what they are looking for without having to worry about what filters to use or what options they need to select. OSU Search should be doing all the heavy lifting for users.

If you have any questions or comments, feel free to post below.

Search v0.4 was released. In this release, we fixed a few bugs as well as a few features that incorporate some of the remaining out of the box Google Search Appliance features including:

Display a link to ‘more results from …’
Indent results if they are related to each other
Specify the format of the search results (pdf, text file, etc)
suggestion box close link

In addition to how the out of the box features function, we have also provided a method for doing an exact phrase search without having to put quotes around it or having to go to advanced search. Most of our user base, looking through the search reports, do not put quotes around items. Is this because they do not want to find an exact search, or just that the understanding that search by defaults tries to find all words in a page, not necessarily the exact phrase? We hope by providing this, it may make those who are in the latter group, the ability to easily perform an exact search with a click, rather than typing quotes.

Next steps for search are we will be working with Web Communications and University Advancement on some possible UI changes. The goal is to see how to make search more robust and feature rich, and the right UI for it. We’ll be looking to perform some focus groups to search to see what it is people are expecting, and as always, we welcome any feedback.

In addition to UI, we are looking to add more CWS features, such as the ability to see and filter results based on the list of domains we index. This allows individuals to also see what sites we actually are indexing, and if your sites are not there, to let us know. Some sites are excluded based on the number of results, as our license limit is only 1 million docs.  You can read our previous article about this.

Please check this page frequently.  Updates to code for integrating search for sites will be provided here.

OSU Search lets users search for your content across all of OSU. No matter where they type in their query, results from across OSU will be displayed for any visitors.

As of January 1st, 2010.  OSU has switched to using the Google Search Appliance.

How can you tell if you are using the old search?

Simple.  If your search results goes to search.oregonstate.edu/web in the url string, you are on the old search.  If you are using Drupal (see below), it means custom code was inserted and will need to be changed by you or whomever helped develop your site.  If you need assistance, contact CWS.

What does this mean to you?

If you are a site hosted by Central Web Services using OSU Drupal 5 or OSU Drupal 6 hosted solutions, the search switchover will be transparent and no action needs to be taken.  Please note, if you have installed drupal or other software solutions yourselves, and not using the centrally hosted CWS solution, you will be responsible for switching these over.

If you are maintaining static sites, whether hosted with CWS or not, with html code embedded for using the Nutch search engine (as referenced in the form with a url of search.oregonstate.edu/web) then you will need to replace the code with one of the two following options:

Code

We have two methods for you to include a search box into your website. The first method and preferred method is using PHP or another programming language. The first method ensures that when we add new features to search boxes in the future, your website we’ll get the updates. The second method is only recommended if you have static html websites.

Using PHP (preferred)

To search all of OSU, paste the code below in your php file. If you are
using any other programming language, you basically have to create an array
and turn it into a json string. Then request the url over the web.


$options = array(
'url' => 'http://oregonstate.edu/cws'
);
$json = json_encode($options);
$url = 'http://search.oregonstate.edu/libs.php?q=osusearch_searchbox&o='.urlencode($json);
echo file_get_contents($url);

Using HTML

For sites using static HTML, please use the link below to contact us and we will help you set up a search box for your site.

If you have questions regarding this process, use the contact form and select Search to send us your question.

A maintenance release was pushed to production for Search, version 0.3.1.

For those who haven’t noticed the capabilities of search, look at the options, if a word is typed in, there may be some suggestions offered.  At the bottom is related searches, if you misspell a word, you may see a “Did you mean…”.  Also, domain search has been added to the Advanced Search.  Did you notice these changes?  We implemented all this in a front end, meaning we look at the features Google has, create an interface to the appliance, and then place things where we need to place it.  The front end also serves a dual purpose, if for future reasons, the backend, aka the Google Appliance is replaced, we simply rewrite the front end to work with the new back end.  More importantly, the front end allows us to do other certain aspects, like implement our feedback module.

Now with a front end, it also means we may have minor bugs, which is why we release the maintenance version for minor bugs.  It may simply be things like formatting, or some case we did not handle that the appliance handles.

We’ll be continuing to look at enhancing search, and integrate other aspects with it as we move ahead.  There has been minimal feedback to date, and without feedback, we cannot know how we can improve it.  So if you have a comment let us know.

Thanks!  Your Central Web Services Group

CWS Let Us Know Module
CWS Let Us Know Module

As part of the next search release, we now have a quick feedback module.  If you are looking for a particular search which you know is on a specific site, for example, on forestry.oregonstate.edu, and you do not find that, what do you do?  Well, let us know.

It is entirely possible that sites are not indexed into our search engine, so just send us a quick feedback with the url.  If you want to be contacted, provide us with contact information as well, or instead use our full help form.

If you want to let us know anything else about the layout or design, then just tell us more.  Enter the information and just click Send Feedback, and your feedback will come to us.  We try to be mind readers as best as possible, but there are particulars which we may not be able to pick up on.  If there is something we can do about your feedback, if it helps many students, faculty, and staff, we do want to look at how we could accomplish it.

Thank You, Your Central Web Services Team

keyword

You’ve probably come here maybe because you clicked on the Keyword link from our search page, or you came across our blog, or from blogs.o.e.  But however you arrived here is secondary to the information that you want.

What is a Keyword, how do I get one?  The first part is easy, a Keyword is a prominent link that is top of results based on, well, keywords.  Google calls it Keymatch, but we are keeping our prior terminology of Keyword.  So for example, a keyword could be “academic calendar” which when you type in search, displays in a shaded area above the result set, that when you click the link takes you to the catalog for the academic calendar.

So how do you get one?  In our previous search engine, we would have to enter keywords into a database manually, and there was no policy on establishing keywords, and as the keywords grew and grew, the maintenance in ensuring links were fresh was too much of an overhead.

With the new Google Search Appliance, we are operating in a different mode, we have data to look at, and with good search engine optimization for your pages, organic results should be improved.

It was previously necessary for keywords for many users, but with a better organic result set, we can minimize the number of keywords we have to maintain.  With the ability now to see what are the top queries that both get and don’t get results, we can make some intelligent determination on what should be keywords.  For others, we do recommend that you optimize your page for search engines, and there is information about it on Google’s site.  The basics on it though if you don’t want to read all about it is, one, relevant content, and two, other sites to link to your site.

We’ll be looking at other approaches in the future to build upon the need for additional promotion, but for now, a data-driven approach is what we will be looking at for a fresh approach to searching by keywords.  So for now, we will not be taking user requests for keywords.  Stay tuned to this blog for changes.  In the future we hope to make the search reports accessible via a web interface that any user can visit.  If you have feedback, please contact us, or leave a comment here.

Advanced Search

With the release of the Advanced Search function, we have released Search as production.  Hmm, what does that mean?  Well, now if you go to the original search site, at search.oregonstate.edu, it is the new Google Search Appliance search.  So does that mean that’s it you ask?  Well, no.  There’s still more features coming, like Narrow Your Search and Keywords, which we’ll discuss more about in the near future.  For now, look at the advanced search.  If you really are looking to search, and you don’t find what you are looking for, don’t give up.  Try the advanced search feature, add more words, exclude words, pick a specific file format.  There are several file formats you can look to find.

Advanced Search

Now if you don’t find what you are looking for, it could be that the site you are looking to search might be new and not crawled yet, or it might not be hosted with us, so we are not aware it needs to be crawled.  In this case, all you have to do is just let us know what the site is, and we’ll look to crawl.  Read our other post on exceptions to get additional information as well on what we don’t crawl.

So, we’ve been talking about search, and people no doubt wonder if their site will be found with OSU’s Google Search.

Most cases, the answer is yes, but in some cases the answer is no.  For the no’s there are reasons why and is what I want to talk about in this post.

The Exceptions

Why are there exceptions?  There are exceptions for a few reasons.  First is the license limit on our Appliance, which is currently one million documents.  549,998 is our current document amounts and we are still indexing sites as we are made aware.  So if a site has a large number of documents, for example a site that has an individual page show up for a dictionary, where each entry of the dictionary is considered a document, then that will eat up the million document limit fairly quickly.  Relating to the previous example, some exceptions are because of the applications users or departments use.  For example, currently a Joomla CMS results in a large number of documents returned because of the way the application works.   Second, if there are sites that are not maintained, which get hacked or spammed, we don’t want to index sites that have spam inserted into it which may likely show up in the search descriptions.  Third, if crawling the site results in an endless loop, where documents in the site refer to itself, so the crawler basically gets stuck, don’t crawl those.  Fourth, if a site returns a large number of errors, then there is something wrong with the site and that is consuming the Appliance resources, such as CPU and memory.  Fifth encompasses all these aspects, which is the administration overhead.  With all the other functions CWS supports, if a particular search aspect would result in significant administration overhead, we would need to make the best decision to minimize that overhead.

So what are our current exceptions?

1.  ONID home directories are not searched.  Why?  Mostly because some users do not maintain their sites, and the sites result in spam entries, and across twenty thousand or more, it’s too much of an overhead to manage.  A policy decision was made for this.
2.  http://ecampus.oregonstate.edu/ask-ecampus/knowledge-base/  Why? This site returned over 250 thousand documents.
3.  http://www.cof.orst.edu/org/iawa/  Why?  This site returned over 160 thousand documents.
4.  http://oregonstate.edu/tac/index.php?option=  Why?  This site returned over 600 thousand documents (due to the way the application handles pages)
5.  Group sites at http://oregonstate.edu/groups/ Why?  This is for the same reason as #1.  As part of the move to people.oregonstate.edu for group sites, we will be reevaluating this.
6.  http://oregonstate.edu/webprojects/wiki Why?  has 2 million errors
7.  http://www.familybusinessonline.org/index.php? Why?  This site returned over 400 thousand documents (due to the way the application handles pages).
8.  http://oregonstate.edu/cla/anthropology/gallery/kingston/main.php? Why?  This site was caught in a loop.
9.  http://bioe.oregonstate.edu/reservations/ Why?  This site was caught in a loop.
10.  http://oregonstate.edu/aepcore/index? Why?  This site was caught in a loop.
11.  http://hort.oregonstate.edu/event/  Why?  This site was caught in a loop.
12.  http://recycle.oregonstate.edu/EarthDay/eventCalendar.cfm?  Why?  This site was caught in a loop.
13.  http://extension.oregonstate.edu/clackamas/announcement/  Why?  This site was caught in a loop.
14.  http://physics.oregonstate.edu/event/  Why?  Events list returning excessive results.
15.  regexp:http://www\\.osualum\\.com/?.*cid=[0-9]+.*?  This is a regular expression statement that if it has the url form specified then it is not being crawled.  Why?  This site was caught in a loop.
16.  http://oregonstate.edu/sli/aggregator/announcement/  Why?  This site was caught in a loop.
17.  regexp:http://oregonstate\\.edu/womenscenter/library.*browse=*  This is a regular expression statement that if the url has the aspects specified within it, then it is not being crawled.  Why?  This site was caught in a loop.

If your site is on this list, and you want to discuss this, then contact us.  We do want to reevaluate sites periodically

We also do not index every type of file extension.  Image files, media files, archive or binary files are not crawled.  There would just be way too many that would exceed our license.

So those are the exceptions are reasons why.  We don’t necessarily expect everyone to be happy or agree with the exceptions made, however, we have to make the best decisions to support OSU as a whole and keep in mind the limitations of our search engine.  However, stating that, we do want to periodically review our decisions, and also determine if alternative solutions can be implemented.  So if there is a concern, then please contact us.

So Search Beta was released in conjunction with the new top hat design for OSU (another change as part of future upcoming changes).  A great effort between Central Web Services (otherwise known as CWS) and Web Communications.  The same collaborative group that introduced OSU Mobile.  Don’t know about OSU Mobile?  Well for that, visit m.oregonstate.edu (iPhone, Palm Pre, Android and some Blackberry), and I’m sure we’ll be talking about that in other OSU CWS blog posts, so stay tuned.

So what is Search Beta?  It’s just that, it’s really Beta.  We are transitioning information, crawls, features from the Google Search Appliance to the User Interface for search.  It’s not perfect, not everything will be found right now.

So you might be wondering about how that affects search on your site pages, which uses the central code provided by CWS.  Because we have a front end to search, we are able to make it as transparent as possible to site owners.  The goal is sites shouldn’t need to be modified, if they use the search module integration CWS has provided and made available previously.  Integration with Drupal sites will be upcoming, so if your site is not showing results because it has not been indexed, do not worry, we’ll be rolling the Drupal change in soon.  After that there are a couple things that need to happen.  First if you are running what we call a virtual host, like hmsc.oregonstate.edu or in a path in oregonstate.edu/, the Google search has to find your site possibly linked from other sites and index the site.  This is the engine part of the appliance, and Google does a fairly good job with this.  The process could take anywhere from a few hours to a few days, depending on the algorithms Google uses to find new pages.  Anything in the oregonstate.edu/ area is continuously crawled.  There are exceptions (and reasons for exceptions), which we’ll be noting in the days to come and which we’ll talk more about in another post.   Second, if your site is not found after a reasonable amount of time, then we can look at explicitly crawling your site.  This is more common with virtual hosts.  If that is the case just contact us using our online contact form, but first read the next post about exceptions to sites being crawled.

We’ll also be looking to get some input from users.  You can comment here, or you can comment on the Web Communications blog, where there will also be information about the new home page that will be introduced this year.  In addition to commenting, there will be some focus groups, which is another avenue to provide feedback.  The focus groups will look at search among other things.

So when it comes to the OUS Search, we say Search Me.