5 DSpace repository usage statistics questions answered

02/03/2020

What is the most valuable piece of research at my institution? Who is the most productive member of faculty? These are multifaceted questions that can't be answered by download and pageview figures. However, usage stats can support or break a number of hypotheses.

What's wrong with your repository configuration or indexing?

Pageviews or download counts are an excellent tool to identify problems with how your repository is configured for indexing or harvesting by search engines and aggregators.

Search engines like Google and Google Scholar are typically responsible for a sizeable chunk of your repository traffic. So if your numbers are down, or if they have never been "up" to begin with, it might be an indication of a repository configuration issue, preventing these systems to send you traffic, rather than an indication of the usefulness of your content.

Are our numbers up or down?

What is "down" or "up" to begin with? The first thing you can do here is establish a baseline level of performance by reviewing your historical usage statistics. If your repository is just getting started, week by week comparisons is something you can do early on. If your repository has been along longer, you can compare on a month by month basis, taking into account typical peaks and drops of academic intrest. For example Christmas is typically not a busy period for repository traffic: almost every repository posts higher numbers in January compared to December.

How does our repository traffic compare to other institutions?

Aside from establishing a baseline, compared to your own repository earlier in time, you can also establish a baseline or benchmark with other repositories. This is much harder to do, but also potentially much more valuable. First of all, with the exception of IRUS and RAMP, usage data from other repositories is not easily accessible.

Even if you find your self ahead or in the middle of the pack of institutions reporting usage data to IRUS, you still need more context to establish the baseline. The most useful comparison is one where you find a match both with the type of institution, as well as the type of repository and included contents.

Two examples that illustrate how useless benchmarking downloads and pageviews can be, in case of a non-match in these areas:

  • If you compare your repository to an institution or school that is much larger or smaller in size, there is a high chance the volumes of repository content will also be very different to begin with. More content generally drives more traffic.
  • Secondly, let's say you find the perfect companion institution that has similar areas of activities and head count, both in terms of staff as well as in terms of students. But your repository caters both for peer reviewed articles, AND for student theses, while the other institution only has the former in scope of the repository. Again, more content drives more traffic, and repository services can be very different among insitutions that are overall very similar.

To conclude and bring this back to the first point: after you find similar institutions with similar repositories, the goal of the comparison is NOT (or at least in our humble opinion SHOULD NOT) to determine who has the "best" repository. The key added value here is to detect whether your baseline is "normal", compared to these others. This comparison enables your team to learn from optimisations others have made, so you can apply them on your own repository.

Where should we focus our efforts to make more materials openly available?

If all of your repository items already contain an openly available associated file (bitstream), Congratulations! Skip this hint, as it is not intended for you.

Very often, repositories contain items where the version of record of the article is under embargo. In some cases, this version is already uploaded in the repository, configured with an embargo and has the associated "request a copy" feature activated. In other cases, the item only exists as a metadata record with no bitstream associated at all.

It is in this specific case that evaluating the list of most popular viewed items is particularly useful. If you see that certain items have hundreds or thousands of visits, but zero downloads, this data might be worth to convince the author or other involved stakeholders, to at least get SOME version of this item openly accessible. The past traffic on the item page is a very convincing data point.

Atmire's Content and Usage Analysis module, and in particular the "Most Popular Items" page, supports this usecase. Any user can filter this list on item pageview count, bitstream download count, or the sum of the two.

Should we look at DSpace SOLR statistics or Google Analytics?

You benefit from looking at multiple sources of usage data. Both validation of findings, as well evaluating results of technical improvements become more robust when you cross reference different sources of data.

While Google Analytics offers one of the most versatile administrator dashboards in the industry, its main goal and inception point is driving "conversions" on commercially oriented websites. Out of the box, it is unable to aggregate downloads of specific files, with their corresponding item page, author, DSpace collection or community.

As an administrator using Google Analytics, you can't get your hands on the actual "raw" usage data, including IP address information or other elements associated with a single pageview. The practices that Google does or doesn't apply to filter out robot or crawler traffic, are undisclosed until today.

In comparison, DSpace SOLR stats store the detailed usage events. For detecting robot traffic, Atmire and others use the COUNTER Robot user agent definitions, openly shared on https://github.com/atmire/COUNTER-Robots This doesn't make DSpace repositories more effective at identifying robots by definition.

However, the state of the art is that neither Google, nor any other party on the web, has a full proof "golden" standard for detecting and eliminating robot traffic, as the state of the art in developing these robots is evolving as well.

All in all, even if you are 100% sure that a certain download came from a human user, that doesn't tell you to which extent the user has actually read or used the material. So in that sense, the ability for download counts to completely replace citations or other metrics, as a metric for human usage, is very limited. A file download remains a limited proxy for human usage and scientific impact.

Need help with your baseline or statistics related issues?

For years, Atmire has worked with institutions around the globe on repository usage statistics. Contact us today to learn more about our Content and Usage Analysis module for DSpace, or to receive technical assistance in usage statistics related issues.

If you are not using DSpace as a repository solution yet, usage statistics is only one of the areas where DSpace excels. Contact us for more details.