Preprints and Crossref’s metadata services

We’re putting the final touches on the changes that will allow preprint publishers to register their metadata with Crossref and assign DOIs. These changes support Crossref’s CitedBy linking between the preprint and other scholarly publications (journal articles, books, conference proceedings). Full preprint support will be released over the next few weeks.

I’d like to mention one change that will be immediately visible to Crossref members who use our OAI based service to retrieve CitedBy links to their content.

This API, show in an example here, is intended to retrieve large quantities of data detailing all the CitedBy links to a given publication. The example request shows pulling the data for an IEEE conference proceeding.

example:

http://oai.crossref.org/OAIHAndler?verb=ListRecords&usr=*** pwd=****&set=B:10.1109:1070762&metadataPrefix=cr_citedby

With the new change, results will now identify the type of content that is doing the citing. The example results below shows that the DOI 10.1109/CSMR.2012.14  is cited by five other items and displays the DOIs of those items and their content type.

Screen Shot 2016-08-29 at 9.12.24 AM

When preprint content that cites other scholarly work starts being registered with Crossref, members using this API will start seeing data like the following:

Screen Shot 2016-08-29 at 9.20.15 AM

For many users of Crossref metadata the introduction of preprints will be transparent until preprint content starts being registered. However, a few changes like the one above have benefits not limited to just preprints.

The article nexus: linking publications to associated research outputs

Crossref began its service by linking publications to other publications via references. Today, this extends to relationships with associated entities. People (authors, reviewers, editors, other collaborators), funders, and research affiliations are important players in this story. Other metadata also figure prominently in it as well: references, licenses and access indicators, publication history (updates, revisions, corrections, retractions, publication dates), clinical trial and study information, etc. The list goes on.

What is lesser known (and utilized) is that Crossref is increasingly linking publications to associated scholarly artifacts. At the bottom of it all, these links can help researchers better understand, reproduce, and build off of the results in the paper. But associated research objects can enormously bolster the research enterprise in many ways (e.g., discovery, reporting, evaluation, etc.).

With all the relationships declared across all 80+ million Crossref metadata records, Crossref creates a global metadata graph across subject areas and disciplines that can be used by all.

Continue reading “The article nexus: linking publications to associated research outputs”

Using the Crossref Metadata API. Part 1 (with Authorea)

Did you know that we have a shiny, not so new, API kicking around? If you missed Geoffrey’s post in 2014 (or don’t want a Cyndi Lauper song stuck in your head all day), the short explanation is that the Crossref Metadata API exposes the information that publishers provide Crossref when they register their content with us. And it’s not just the bibliographic metadata either–funding and licensing information, full-text links (useful for text-mining), ORCID iDs and update information (via CrossMark)–are all available, if included in the publishers’ metadata.

Interested? This is the kickoff a series of case studies on the innovative and interesting things people are doing with the Metadata API. Welcome to Part 1. Continue reading “Using the Crossref Metadata API. Part 1 (with Authorea)”

Getting ready to run with preprints, any day now

While preprints have been a formal part of scholarly communications for decades in certain communities, they have not been fully adopted to date across most disciplines or systems. That may be changing very soon and quite rapidly, as new initiatives come thick and fast from researchers, funders, and publishers alike. This flurry of activity points to the realization from these parties of preprints’ potential benefits:

  • Accelerating the sharing of results;
  • Catalyzing research discovery;
  • Establishing priority of discoveries and ideas;
  • Facilitating career advancement; and
  • Improving the culture of communication within the scholarly community.

To acknowledge them as a legitimate part of the research story, we need to fully build preprints into the broader research infrastructure. Preprints need infrastructure support just like journal articles, monographs, and other formal research outputs. Otherwise, we (continue to) have a two-tiered scholarly communications system, unlinked and operating independently. Continue reading “Getting ready to run with preprints, any day now”

Using AWS S3 as a large key-value store for Chronograph

One of the cool things about working in Crossref Labs is that interesting experiments come up from time to time. One experiment, entitled “what happens if you plot DOI referral domains on a chart?” turned into the Chronograph project. In case you missed it, Chronograph analyses our DOI resolution logs and shows how many times each DOI link was resolved per month, and also how many times a given domain referred traffic to DOI links per day.

We’ve released a new version of Chronograph. This post explains how it was put together. One for the programmers out there.

Big enough to be annoying

Chronograph sits on the boundary between normal-sized data and large-enough-to-be-annoying-size data. It doesn’t store data for all DOIs (it includes only those that are used on average once a day), but it has information on up to 1 million DOIs per month over about 5 years, and about 500 million data points in total.

Storing 500 million data points is within the capabilities of a well-configured database. In the first iteration of Chronograph a MySQL database was used. But that kind of data starts to get tricky to back up, move around and index.

Every month or two new data comes in for processing, and it needs to be uploaded and merged into the database. Indexes need to be updated. Disk space needs to be monitored. This can be tedious.

Key values

Because the data for a DOI is all retrieved at once, it can be stored together. So instead of a table that looks like

10.5555/12345678 2010-01-01 5
10.5555/12345678 2010-02-01 7
10.5555/12345678 2010-03-01 3

Instead we can store

10.5555/12345678 {“2010-01-01”: 5, “2010-02-01”: 7, “2010-03-01”: 3}

This is much lighter on the indexes and takes much less space to store. However, it means that adding new data is expensive. Every time there’s new data for a month, the structure must be parsed, merged with the new data, serialised and stored again millions of times over.

After trials with MySql, MongoDB and MapDB, this approach was taken with MySQL in the original Chronograph.

Keep it Simple Storage Service Stupid

In the original version of Chronograph the data was processed using Apache Spark. There are various solutions for storing this kind of data, including Cassandra, time-series databases and so on.

The flip side of being able to do interesting experiments is wanting them to stick around without having to bother a sysadmin. The data is important to us, but we’d rather not have to worry about running another server and database if possible.

Chronograph fits into the category of ‘interesting’ rather than ‘mission-critical’ projects, so we’d rather not have to maintain expensive infrastructure if possible.

I decided to look into using Amazon Web Services Simple Storage Service (AWS S3) to store the data. AWS itself is a key-value store, so it seems like a good fit. S3 is a great service because, as the name suggests, it’s a simple service for storing a large number of files. It’s cheap and its capabilities and cost scale well.

However, storing and updating up to 80 million very small keys (one per DOI) isn’t very clever, and certainly isn’t practical. I looked at DynamoDB, but we still face the overhead of making a large number of small updates.

Is it weird?

In these days of plentiful databases with cheap indexes (and by ‘these days’ I mean the 1970s onward) it seems somehow wrong to use plain old text files. However, the whole Hadoop “Big Data” movement was predicated on a return to batch processing files. Commoditisation of services like S3 and the shift to do more in the browser have precipitated a bit of a rethink. The movement to abandon LAMP stacks and use static site generators is picking up pace. The term ‘serverless architecture’ is hard to avoid if you read certain news sites.

Using Apache Spark (with its brilliant RDD concept) was useful for bootstrapping the data processing for Chronograph, but the new code has an entirely flat-file workflow. The simplicity of not having to unnecessarily maintain a Hadoop HDFS instance seems to be the right choice in this case.

Repurposing the Wheel

The solution was to use S3 as a big hash table to store the final data that’s served to users.

The processing pipeline uses flat files all the way through from input log files to projections to aggregations. At the penultimate stage of the pipeline blocks of CSV per DOI are produced that represent date-value pairs.

10.5555/12345678 2010-01 2010-01-01,05
2010-02-01,02
2010-01-03,08
10.5555/12345678 2010-02 2010-02-1,10
2010-02-01,7
2010-02-03,22

At the last stage, these are combined into blocks of all dates for a DOI

10.5555/12345678 2010-01 2010-01-01,05
2010-02-01,02
2010-01-03,08

2010-02-1,10
2010-02-01,7
2010-02-03,22

The DOIs are then hashed into 12 bits and stored as chunks of CSV

day-doi.csv-chunks_8841:

There are 65,536 (0x000 to 0xFFFF) possible files, each with about a thousand DOIs worth of data in each.

When the browser requests data for a DOI, it is hashed and then the request for the appropriate file in S3 is made. The browser then has to perform a linear scan of the file to find the DOI it is looking for.

This is the simplest possible form of hash table: simple addressing with separate linear chaining. The hash function is a 16-bit mask of MD5, chosen because of availability in the browser. It does a great job of evenly distributing the DOIs over all 65,536 possible files.

Striking the balance

In any data structure implementation, there are balances to be struck. Traditionally these concern memory layout, the shape of the data, practicalities of disk access and CPU cost.

In this instance, the factors in play included the number of buckets that need to be uploaded and the cost of the browser downloading an over-large bucket. The size of the bucket doesn’t matter much for CPU (as far as the user is concerned it takes about the same time to scan 10 entries as it does 10,000), but it does make a difference asking  user to download a 10kb bucket or a 10MB one.

I struck the balance at 4096 buckets, resulting in files of around 100k, which is the size of a medium sized image.

It works

The result is a simple system that allows people to look up data for millions of DOIs, without having to look after another server. It’s also portable to any other file storage service.

The approach isn’t groundbreaking, but it works.

2016 upcoming events – we’re out and about!

Check out the events below where Crossref will attend or present in 2016. We have been busy over the past few months, and we have more planned for the rest of year. If we will be at a place near you, please come see us (and support these organizations and events)!

Upcoming Events
SHARE Community Meeting, July 11-14, Charlottesville, VA, USA
Crossref Outreach Day – July 19-21 – Seoul, South Korea
CASE 2016 Conference – July 20-22 – Seoul, South Korea
ACSE Annual Meeting 2016 – August 10-11 – Dubai, UAE
Vivo 2016 Conference – August 17-19 – Denver CO, USA
SciDataCon – September 11-17 – Denver CO, USA
ALPSP – September 14-16 – London, UK
OASPA – September 21-22 – Arlington VA, USA
3:AM Conference – September 26 – 28 – Bucharest, Romania
ORCID Outreach Conference – October 5-6 – Washington DC, USA
Frankfurt Book Fair – October 19-23 – Frankfurt, Germany (Hall 4.2, Stand #4.2 M 85)
**Crossref Annual Community Meeting #Crossref16 – November 1-2 – London, UK**
PIDapalooza – November 9-10 – Reykjavik, Iceland
OpenCon 2016 – November 12-14 – Washington DC, USA
STM Digital Publishing Conference – December 6-8 – London, UK

The Crossref outreach team will host a number of outreach events around the globe. Updates about events are shared through social media so please connect with us via @CrossrefOrg.
 

DOI-like strings and fake DOIs

TL;DR

Crossref discourages our members from using DOI-like strings or fake DOIs.

discouraged

Details

Recently we have seen quite a bit of debate around the use of so-called “fake-DOIs.” We have also been quoted as saying that we discourage the use of “fake DOIs” or “DOI-like strings”. This post outlines some of the cases in which we’ve seen fake DOIs used and why we recommend against doing so.

Using DOI-like strings as internal identifiers

Some of our members use DOI-like strings as internal identifiers for their manuscript tracking systems. These only get registered as real DOIs with Crossref once an article is published. This seems relatively harmless, except that, frequently, the unregistered DOI-like strings for unpublished (e.g. under review or rejected manuscripts) content ‘escape’ into the public as well. People attempting to use these DOI-like strings get understandably confused and angry when they don’t resolve or otherwise work as DOIs. After years of experiencing the frustration that these DOI-like things cause, we have taken to recommending that our members not use DOI-like strings as their internal identifiers.

Using DOI-like strings in access control compliance applications

We’ve also had members use DOI-like strings as the basis for systems that they use to detect and block tools designed to bypass the member’s access control system and bulk-download content. The methods employed by our members have fallen into two broad categories:

  • Spider (or robot) traps.
  • Proxy bait.

Spider traps

spider trap

A “spider trap” is essentially a tripwire that allows a site owner to detect when a spider/robot is crawling their site to download content. The technique involves embedding a special trigger URL in a public page on a web site. The URL is embedded such that a normal user should not be able see it or follow it, but an automated bot (aka “spider”) will detect it and follow it. The theory is that when one of these trap URLs is followed, the website owner can then conclude that the ip address from which it was followed harbours a bot and take action. Usually the action is to inform the organisation from which the bot is connecting and to ask them to block it. But sometimes triggering a spider trap has resulted in the IP address associated with it being instantly cut off. This, in turn, can affect an entire university’s access to said member’s content.

When a spider/bot trap includes a DOI-like string, then we have seen some particularly pernicious problems as they can trip-up legitimate tools and activities as well. For example, a bibliographic management browser plugin might automatically extract DOIs and retrieve metadata on pages visited by a researcher. If the plugin were to pick up one of these spider traps DOI-like strings, it might inadvertently trigger the researcher being blocked- or worse- the researcher’s entire university being blocked. In the past, this has even been a problem for Crossref itself. We periodically run tools to test DOI resolution and to ensure that our members are properly displaying DOIs, CrossMarks, and metadata as per their member obligations. We’ve occasionally been blocked when we ran across the spider traps as well.

Proxy bait

proxy bait

Using proxy bait is similar to using a spider trap, but it has an important difference. It does not involve embedding specially crafted DOI like strings on the member’s website itself. The DOI-like strings are instead fed directly to tools designed to subvert the member’s access control systems. These tools, in turn, use proxies on a subscriber’s network to retrieve the “bait” DOI-like string. When the member sees one of these special DOI-like strings being requested from a particular institution, they then know that said institution’s network harbours a proxy. In theory this technique never exposes the DOI-like strings to the public and automated tools should not be able to stumble upon them. However, recently one of our members had some of these DOI-like strings “escape” into the public and at least one of them was indexed by Google. The problem was compounded because people clicking on these DOI-like strings sometimes ended having their university’s IP address banned from the member’s web site. As you can imagine, there has been a lot of gnashing of teeth. We are convinced, in this case, that the member was doing their best to make sure the DOI-like strings never entered the public. But they did nonetheless. We think this just underscores how hard it is to ensure DOI-like strings remain private and why we recommend our members not use them.

Pedantry and terminology

Notice that we have not used the phrase “fake DOI” yet. This is because, internally, at least, we have distinguished between “DOI-like strings” and “fake DOIs.” The terminology might be daft, but it is what we’ve used in the past and some of our members at least will be familiar with it. We don’t expect anybody outside of Crossref to know this.

To us, the following is not a DOI:

10.5454/JPSv1i220161014

It is simply a string of alphanumeric characters that copy the DOI syntax. We call them “DOI-like strings.” It is not registered with any DOI registration agency and one cannot lookup metadata for it. If you try to “resolve” it, you will simply get an error. Here, you can try it. Don’t worry- clicking on it will not disable access for your university.

http://doi.org/10.5454/JPSv1i220161014

The following is what we have sometimes called a “fake DOI”

10.5555/12345678

It is registered with Crossref, resolves to a fake article in a fake journal called The Journal of Psychoceramics (the study of Cracked Pots) run by a fictitious author (Josiah Carberry) who has a fake ORCID (http://orcid.org/0000-0002-1825-0097) but who is affiliated with a real university (Brown University).

Again, you can try it.

http://doi.org/10.5555/12345678

And you can even look up metadata for it.

http://api.crossref.org/works/10.5555/12345678

Our dirty little secret is that this “fake DOI” was registered and is controlled by Crossref.

Why does this exist? Aren’t we subverting the scholarly record? Isn’t this awful? Aren’t we at the very least hypocrites? And how does a real university feel about having this fake author and journal associated with them?

Well- the DOI is using a prefix that we use for testing. It follows a long tradition of test identifiers starting with “5”. Fake phone numbers in the US start with “555”. Many credit card companies reserve fake numbers starting with “5”. For example, Mastercard’s are “5555555555554444” and “5105105105105100.”

We have created this fake DOI, the fake journal and the fake ORCID so that we can test our systems and demonstrate interoperable features and tools. The fake author, Josiah Carberry, is a long-running joke at Brown University. He even has a Wikipedia entry. There are also a lot of other DOIs under the test prefix “5555.”

We acknowledge that the term “fake DOI” might not be the best in this case- but it is a term we’ve used internally at least and it is worth distinguishing it from the case of DOI-like strings mentioned above.

But back to the important stuff….

As far as we know, none of our members has ever registered a “fake DOI” (as defined above) in order to detect and prevent the circumvention of their access control systems. If they had, we would consider it much more serious than the mere creation of DOI-like strings. The information associated with registered DOIs becomes part of the persistent scholarly citation record. Many, many third party systems and tools make use of our API and metadata including bibliographic management tools, TDM tools, CRIS systems, altmetrics services, etc. It would be a very bad thing if people started to worry that the legitimate use of registered DOIs could inadvertently block them from accessing content. Crossref DOIs are designed to encourage discovery and access- not block it.

And again, we have absolutely no evidence that any of our members has registered fake DOIs.

But just in case, we will continue to discourage our members from using DOI-like strings and/or registering fake DOIs.

This has been a public service announcement from the identifier dweebs at Crossref.

Image Credits

Unless otherwise noted, included images purchased from The Noun Project