ETD Symposium Notes
July 23-25, 2014
Leicester, UK
Kevin Schurer Keynote
He’s the VC for Research at the University of Leicester and formerly ran the UK thesis union catalog called Ethos–http://ethos.bl.uk/Home.do;jsessionid=91158AD5613E0FD5ADC18541C9EF4CE3.
Schurer is a a big proponent of OA, as everyone seems to be in Europe, not just librarians, but he was very critical of the benefits of OA as described in the Research Councils of the UK Open Access Policy–http://www.rcuk.ac.uk/research/openaccess/. Two of the benefits as described by the UK government: 1. Create benefits for economic growth, 2. Increase public understanding of research. Basically says that there is no evidence to back up the claim that open access benefits business or the general public. Suggested that the general public is not intelligent enough to understand published research and wouldn’t expect them to bother trying. Suggests that OA is good, but that public access is not a primary benefit. The public would be better served by authors creating a “dumbed down” (my words) version of articles. Something he is in fact doing for an article that he just sent to a major science journal (guessing PLOS).
Schurer said the principal benefits of open access are in text mining to make research discoveries. Also mentioned value of metadata interoperability, presumably to enable better discovery of academic scholarship. Mentioned the value of OA for better internal auditing of research. Says open data is the holy grail for open and that when data is more open, more scientific discoveries can be made and big problems solved.
Gabrielle Michaelek (Carnegie Mellon) talked about managing ETDs with associated complex digital objects. CMU uses Digital Commons (BePress) which she says is not robust enough to handle anything but PDFs. Also mentioned there is no support for ORCIDs or DOIs in Digital Commons (surprising), so can’t use Digital Commons as a repository for supplemental data. They use ArchivalWare for research data, although they aren’t capturing much at this point. It supports multiple formats and metadata schemes (especially important for data).
Implications of managing ETD supplemental data:
• changes to workflows
• will probably require different set of personnel to those currently working with ETDs
• will require close relation between ETD staff and data management staff
Sees the first wave of capturing research data and making it more widely available. The next wave will be to enable better access and interoperability of it through linked data.
Discussion of transforming ETDs as PDFs into machine actionable objects like XML. PDFs are limited in what you can do with them.
Presentation about ETD download statistics. 56% of all downloads were ETDs, and ETDs represent only 10% of total repository holdings at Lignan University in Hong Kong. Found that 60% of their usage came from Google. Some referals from the catalog and a small percentage from Summon.
Several presentations about national thesis catalogs. Many other countries (Brazil, Peru, UK, France, Czech) either have a national catalog of their nation’s theses or are building one. The closest thing to a U.S. catalog of dissertations is ProQuest, but as more and more libraries drop ProQuest, it is less comprehensive. There is nothing in the US for theses. The NDLTD union catalog harvests metadata from repositories around the world but doesn’t include bibliographic data for T/Ds that are not open access.
Peter Murray-Rust presented about making ETDs more useful. The UK is close to requiring deposit of ODF documents (I think this is Open PDF?). Claims it is easy to make PDFs less accessible for purposes of making money. Great quote: “Publishers forbid access to 99.9% of the world, research that is largely paid for by taxpayers, created and peer-reviewed by us for free.”
“Science is communicated in 19C ways”. Not taking advantage of 21st Century techonologies. Says there needs to be a move toward open notebook science where lab notebooks are openly available. Elsevier is already looking for ways that it can control open data in order to make money from it. Says that they acquired Mendeley to acquire its user data and to destroy or coopt an open science icon that threatens its business model.
New ways for theses:
- Content mining is now legal in the UK, as long as content is readable (open or not, and regardless of licenses associated with it).
- Has developed contentmine.org. Pulling in open content for text/data/content mining purposes.
- Generating machine-readable xml from PDFs. Converting text and image tables to spreadsheet and xml.
- Extracting nuggets of machine-readable information from previously unformatted PDF text.
- Open content mining of facts from research.
Theses in the U.K. (Sara Gould—British Library)
RIOXX Metadata Profile v.2 which includes guidelines for capturing grant number information to be released soon. We need to check to be sure we are capturing this information in prescribed way. Suggested that thesis identifiers be established. Provide match key and reduce duplication in union catalogs and databases, allow for easier citation, citability, and reduce link rot. Working on automated assignment of LCSH using text mining and heuristics. One advantage of union catalogs is that you can text mine within them.
PIRUS, which aimed to consolidate usage statistics for articles from publishers and repositories, didn’t move forward because publishers weren’t interested in participating. IRUS-UK is moving forward. Intention is to produce reliable usage stats for items in British repositories. IRUS processes raw usage stats, removes robots and “bad” robots that don’t announce who they are (good robots like google announce who they are so they can be easily removed), runs through Counter to remove further stats, and returns the stats cleaned up.
This is work that individual libraries are doing at differing degrees of success. Small piece of code is added to repository that sends the stats to irus server for cleanup. Publishers use Counter to get reliable usage stats, and pay a lot of money for the service. Libraries could do the same but it is extremely expensive. This is needed for all libraries. Lots of potential.
Testing feasibility of devising set of algorithms to identify and filter robots and unusual activity.
Developed a portal that provides statistics for all participating repositories. Allows to compare usage at different institutions.
Texas A&M and a British programmer presented about ORCID. Texas A&M creates ORCIDs for all graduate students. Requires them to verify. 21% do so. They have a great libguide that explains ORCIDs that we could borrow from. Discussion about whether ORCID should be used as portal of research publications or primarily as identification creation tool. There are a plethora of portals out there, such as ResearchGate, We need to explain to faculty what these are and their differences.
This post written by Michael Boock.