The Organization Identifier Project: a way forward

The scholarly communications sector has built and adopted a series of open identifier and metadata infrastructure systems to great success.  Content identifiers (through Crossref and DataCite) and contributor identifiers (through ORCID) have become foundational infrastructure to the industry.  The OI Project, organization identifierBut there still seems to be one piece of the infrastructure that is missing.  There is as yet no open, stakeholder-governed infrastructure for organization identifiers and associated metadata.

In order to understand this gap, Crossref, DataCite and ORCID have been collaborating to:

  • Explore the current landscape of organizational identifiers;
  • Collect the use-cases that would benefit our respective stakeholders in scholarly communications industry;
  • Identify those use-cases that can be more feasibly addressed in the near term; and
  • Explore how the three organizations can collaborate (with each other and with others) to practically address this key missing piece of scholarly infrastructure.

The result of this work is in three related papers being released by Crossref, DataCite and ORCID for community review and feedback. The three papers are:

  • Organization Identifier Project: A Way Forward (PDFGDoc)
  • Organization Identifier Provider Landscape (PDF; GDoc)
  • Technical Considerations for an Organization Identifier Registry (PDF; GDoc)

We invite the community to comment on these papers both via email ( and at PIDapalooza on November 9th and 10th and at Crossref LIVE16 on November 1st and 2nd. To move The OI Project forward, we will be forming a Community Working Group with the goal of holding an initial meeting before the end of 2016. The Working Group’s main charge is to develop a plan to launch and sustain an open, independent, non-profit organization identifier registry to facilitate the disambiguation of researcher affiliations.

Crossref Use Cases

Crossref has also been discussing the needs of its members over the last year and there is value in focusing on the affiliation name ambiguity problem with research outputs and contributors. In terms of the metadata that Crossref collects, something that is missing has been affiliations for the authors of publications. Over the last couple of years, Crossref has been expanding what it collects – for example, funding and licensing data and ORCID iDs – and this enables a fuller picture of what we are calling the “article nexus”. In order to continue to fill out the metadata we collect – and for our publisher members to use in their own systems and publications – we need an organization identifier.

Another use case for Crossref is identifying funders as part of collecting funder data to enable connecting funding sources with the published scholarly literature. In order to enable the reliable identification of funders in the Crossref system we created the Open Funder Registry that now has over 13,000 funders available as Open Data under a CC0 waiver. While this has been very successful, it is a very narrowly focused registry and is not suitable for a broad, community-run organization identifier registry that addresses the affiliation use case.  In future, our goal will be to merge the Open Funder Registry into the identifier registry that the Organization Identifier Working Group will work on.

By working collaboratively we can define a pragmatic and cost-effective service that will meet a fundamental need of all scholarly communication stakeholders.

Geoffrey Bilder will be focusing his talk at Crossref LIVE16 this week on this initiative, dubbed The OI Project. The talk is scheduled for 2pm UK time and will be live streamed along with the rest of that day’s program.

Announcing PIDapalooza – a festival of identifiers

sideAThe buzz is building around PIDapalooza – the first open festival of scholarly research persistent identifiers (PID), to be held at the Radisson Blu Saga Hotel Reykjavik on November 9-10, 2016.

PIDapalooza will bring together creators and users of PIDs from around the world to shape the future PID landscape through the development of tools and services for the research community. PIDs support proper attribution and credit, promote collaboration and reuse, enable reproducibility of findings, foster faster and more efficient progress, and facilitate effective sharing, dissemination, and linking of scholarly works. Continue reading “Announcing PIDapalooza – a festival of identifiers”

Auto-Update Has Arrived! ORCID Records Move to the Next Level

Crossref goes live in tandem with DataCite to push both publication and dataset information to ORCID profiles automatically. All organisations that deposit ORCID iDs with Crossref and/or DataCite will see this information going further, automatically updating author records. 

Continue reading “Auto-Update Has Arrived! ORCID Records Move to the Next Level”

Many Metrics. Such Data. Wow.

many_metrics CrossRef Labs loves to be the last to jump on an internet trend, so what better than than to combine the Doge meme with altmetrics? Want to know how many times a CrossRef DOI is cited by the Wikipedia?

Or how many times one has been mentioned in Europe PubMed Central?

Or DataCite?


Back in 2011 PLOS released its awesome ALM system as open source software (OSS). At CrossRef Labs, we thought it might be interesting to see what would happen if we ran our own instance of the system and loaded it up with a few CrossRef DOIs. So we did. And the code fell over. Oops. Somehow it didn’t like dealing with 10 million DOIs. Funny that.

But the beauty of OSS is that we were able to work with PLOS to scale the code to handle our volume of data. CrossRef contracted with Cottage Labs  and we both worked with PLOS to make changes to the system. These eventually got fed back into the main ALM source on Github. Now everybody benefits from our work. Yay for OSS.

So if you want to know technical details, skip to Details for Propellerheads. But if you want to know why we did this, and what we plan to do with it, read on.


There are (cough) some problems in our industry that we can best solve with shared infrastructure. When publishers first put scholarly content online, they used to make bilateral reference linking agreements. These agreements allowed them to link citations using each other’s proprietary reference linking APIs. But this system didn’t scale. It was too time-consuming to negotiate all the agreements needed to link to other publishers. And linking through many proprietary citation APIs was too complex and too fragile. So the industry founded CrossRef to create a common, cross-publisher citation linking API. CrossRef has since obviated the need for bilateral linking arrangements.

So-called altmetrics look like they might have similar characteristics. You have ~4000 CrossRef member publishers and N sources (e.g. Twitter, Mendeley, Facebook, CiteULike, etc.) where people use (e.g. discuss, bookmark, annotate, etc.) scholarly publications. Publishers could conceivably each choose to run their own system to collect this information. But if they did, they would face the following problems:

  • The N sources will be volatile. New ones will emerge. Old ones will vanish.
  • Each publisher will need to deal with each source’s different APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the publishers and for the sources.
  • If publishers use different systems which in turn look at different sources, it will be difficult to compare results across publishers.
  • If a journal moves from one publisher to another, then how are the metrics for that journal’s articles going to follow the journal? This isn’t a complete list, but it shows that there might be some virtue in publishers sharing an infrastructure for collecting this data. But what about commercial providers? Couldn’t they provide these ALM services? Of course – and some of them currently do. But normally they look on the actual collection of this data as a means to an end. The real value they provide is in the analysis, reporting and tools that they build on top of the data. CrossRef has no interest in building front-ends to this data. If there is a role for us to play here, it is simply in the collection and distribution of the data.

No, really, WHY?

Aren’t these altmetrics an ill-conceived and meretricious idea? By providing this kind of information, isn’t CrossRef just encouraging feckless, neoliberal university administrators to hasten academia’s slide into a Stakhanovite dystopia? Can’t these systems be gamed?


takes deep breath. wipes spittle from beard

These are all serious concerns. Goodhart’s Law and all that… If a university’s appointments and promotion committee is largely swayed by Impact Factor, it won’t improve a thing if they substitute or supplement Impact Factor with altmetrics. Amy Brand has repeatedly pointed out, the best institutions simply don’t use metrics this way at all (PowerPoint presentation). They know better.

But yes, it is still likely that some powerful people will come to lazy conclusions based on altmetrics. And following that, other lazy, unscrupulous and opportunistic people will attempt to game said metrics. We may even see an industry emerge to exploit this mess and provide the scholarly equivalent of SEO. Feh. Now I’m depressed and I need a drink.

So again, why is CrossRef doing this? Though we have our doubts about how effective altmetrics will be in evaluating the quality of content, we do believe that they are a useful tool for understanding how scholarly content is used and interpreted. The most eloquent arguments against altmetrics for measuring quality, inadvertently make the case for altmetrics as a tool for monitoring attention.

Critics of altmetrics point out that much of the attention that research receives outside of formal scholarly communications channels can be ascribed to:

  • Puffery. Researchers and/or university/publisher “PR wonks” over-promoting research results.
  • Innocent misinterpretation. A lay audience simply doesn’t understand the research results.
  • Deliberate misinterpretation. Ideologues misrepresent research results to support their agendas.
  • Salaciousness. The research appears to be about sex, drugs, crime, video games or other popular bogeymen.
  • Neurobollocks. A category unto itself these days.

In short, scholarly research might be misinterpreted. Shock horror. Ban all metrics. Whew. That won’t happen again.

Scholarly research has always been discussed outside of formal scholarly venues. Both by scholars themselves and by interested laity. Sometimes these discussions advance the scientific cause. Sometimes they undermine it. The University of Utah didn’t depend on widespread Internet access or social networks to promote yet-to-be peer-reviewed claims about cold fusion. That was just old-fashioned analogue puffery. And the Internet played no role in the Laetrile or DMSO crazes of the 1980s. You see, there were once these things called “newspapers.” And another thing called “television.” And a sophisticated meatspace-based social network called a “town square.”

But there are critical differences between then and now. As citizens get more access to the scholarly literature, it is far more likely that research is going to be discussed outside of formal scholarly venues. Now we can build tools to help researchers track these discussions. Now researchers can, if they need to, engage in the conversations as well. One would think that conscientious researchers would see it as their responsibility to remain engaged, to know how their research is being used. And especially to know when it is being misused.

That isn’t to say that we expect researchers will welcome this task. We are no Pollyannas. Researchers are already famously overstretched. They barely have time to keep up with the formally published literature. It seems cruel to expect them to keep up with the firehose of the Internet as well.

Which gets us back to the value of altmetrics tools. Our hope is that, as altmetrics tools evolve, they will provide publishers and researchers with an efficient mechanism for monitoring the use of their content in non-traditional venues. Just in the way that citations were used before they were distorted into proxies for credit and kudos.

We don’t think altmetrics are there yet. Partly because some parties are still tantalized by the prospect of usurping one metric for another. But mostly because the entire field is still nascent. People don’t yet know how the information can be combined and used effectively. So we still make naive assumptions such as “link=like” and “more=better.” Surely it will eventually occur to somebody that, instead, there may be a connection between repeated headline-grabbing research and academic fraud. A neuroscientist might be interested in a tool that alerts them if the MRI scans in their research paper are being misinterpreted on the web to promote neurobollocks. An immunologist may want to know if their research is being misused by the anti-vaccination movement. Perhaps the real value in gathering this data will be seen when somebody builds tools to help researchers DETECT puffery, social-citation cabals, and misinterpretation of research results?

But CrossRef won’t be building those tools. What we might be able to do is help others overcome another hurdle that blocks the development of more sophisticated tools; getting hold of the needed data in the first place. This is why we are dabbling in altmetrics.

Wikipedia is already the 8th largest referrer of CrossRef DOIs. Note that this doesn’t just mean that the Wikipedia cites lots of CrossRef DOIs, it means that people actually click on and follow those DOIs to the scholarly literature. As scholarly communication transcends traditional outlets and as the audience for scholarly research broadens, we think that it will be more important for publishers and researcher to be aware of how their research is being discussed and used. They may even need to engage more with non-scholarly audiences. In order to do this, they need to be aware of the conversations. CrossRef is providing this experimental data source in the hope that we can spur the development of more sophisticated tools for detecting and analyzing these conversations. Thankfully, this is an inexpensive experiment to conduct – largely thanks to the decision on the part of PLOS to open source its ALM code.

What Now?

CrossRef’s instance of PLOS’s ALM code is an experiment. We mentioned that we had encountered scalability problems and that we had resolved some of them. But there are still big scalability issues to address. For example, assuming a response time of 1 second, if we wanted to poll the English-language version of the Wikipedia to see what had cited each of the 65 million DOIs held in CrossRef, the process would take years to complete. But this is how the system is designed to work at the moment. It polls various source APIs to see if a particular DOI is “mentioned”. Parallelizing the queries might reduce the amount of time it takes to poll the Wikipedia, but it doesn’t reduce the work. Another obvious way in which we could improve the scalability of the system is to add a push mechanism to supplement the pull mechanism. Instead of going out and polling the Wikipedia 65 million times, we could establish a “scholarly linkback” mechanism that would allow third parties to alert us when DOIs and other scholarly identifiers are referenced (e.g. cited, bookmarked, shared). If the Wikipedia used this, then even in an extreme case scenario (i.e. everything in Wikipedia cites at least one CrossRef DOI), this would mean that we would only need to process ~ 4 million trackbacks.

The other significant advantage of adding a push API is that it would take the burden off of CrossRef to know what sources we want to poll. At the moment, if a new source comes online, we’d need to know about it and build a custom plugin to poll their data. This needlessly disadvantages new tools and services as it means that their data will not be gathered until they are big enough for us to pay attention to. If the service in question addresses a niche of the scholarly ecosystem, they may never become big enough. But if we allow sources to push data to us using a common infrastructure, then new sources do not need to wait for us to take notice before they can participate in the system.

Supporting (potentially) many new sources will raise another technical issue- tracking and maintaining the provenance of the data that we gather. The current ALM system does a pretty good job of keeping data, but if we ever want third parties to be able to rely on the system, we probably need to extend the provenance information so that the data is cheaply and easily auditable.

Perhaps the most important thing we want to learn from running this experimental ALM instance is: what it would take to run the system as a production service? What technical resources would it require? How could they be supported? And from this we hope to gain enough information to decide whether the service is worth running and, if so, by whom. CrossRef is just one of several organizations that could run such a service, but it is not clear if it would be the best one. We hope that as we work with PLOS, our members and the rest of the scholarly community, we’ll get a better idea of how such a service should be governed and sustained.

Details for Propellerheads

Warning, Caveats and Weasel Words

The CrossRef ALM instance is a CrossRef Labs project. It is running on R&D equipment in a non-production environment administered by an orangutang on a diet of Redbulls and vodka.

So what is working?

The system has been initially loaded with 317,500+  CrossRef DOIs representing publications from 2014. We will load more DOIs in reverse chronological order until we get bored or until the system falls over again.

We have activated the following sources:


  • PubMed
  • DataCite
  • PubMedCentral Europe Citations and Usage


We have data from the following sources but will need some work to achieve stability:


  • Facebook
  • Wikipedia
  • CiteULike
  • Twitter
  • Reddit


Some of them are faster than others. Some are more temperamental than others. WordPress, for example, seems to go into a sulk and shut itself off  after approximately 1,300 API calls.

In any case, we will be monitoring and tweaking the sources as we gather data. We will also add new sources as we get requested API keys. We will probably even create one or two new sources ourselves. Watch this blog and we’ll update you as we add/tweak sources.

Dammit, shut up already and tell me how to query stuff.

You can login to the CrossRef ALM instance simply using a Mozilla Persona (yes, we’d eventually like to support ORCID too). Once logged-in, your account page will list an API key. Using the API key, you can do things like:

And you will see that (as of this writing), said Nature article has been cited by the Wikipedia article here:

<a href=""></a>

PLOS has provided lovely detailed instructions for using the API– So, please, play with the API and see what you make of it. On our side we will be looking at how we can improve performance and expand coverage. We don’t promise much- the logistics here are formidable. As we said above, once you start working with millions of documents, the polling process starts to hit API walls quickly. But that is all part of the experiment. We appreciate your helping us and would like your feedback. We can be contacted at:


DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right?


The South Park movie , “Bigger, Longer & Uncut” has a DOI:


So does the pornographic movie, “Young Sex Crazed Nurses”:


And the following DOI points to a fake article on a “Google-Based Alien Detector”:


And the following DOI refers to an infamous fake article on literary theory:


This scholarly article discusses the entirely fictitious Australian “Drop Bear”:


The following two DOIs point to the same article- the first DOI points to the final author version, and the second DOI points to the final published version:



This following two DOIs point to the same article- there is no apparent difference between the two copies:



Another example where two DOIs point to the same article and there is no apparent difference between the two copies:



These journals assigned DOIs, but not through CrossRef:




These two DOIs are assigned to two different data sets by two different RAs:



This DOI appears to have been published, but was not registered until well after it was published. There were 254 unsuccessful attempts to resolve it in September 2012 alone:


The owner of prefix, ‘10.4223,’ who is responsible for the above DOI had 378,790 attempted resolutions in September 2012 of which there were 377,001 failures. The top 10 DOI failures for this prefix each garnered over 200 attempted resolutions. As of November 2012 the prefix had only registered 349 DOIs.

Of the above 16 example DOIs 11 cannot be used for CrossCheck or CrossMark. 3 cannot be used with content negotiation. To search metadata for the above examples, you need to visit four sites:

The 14 examples come from just 4 of the 8 existing DOI registration agencies (RAs) It is virtually impossible for somebody without specialized knowledge to tell which DOIs are Crossref DOIs and which ones are not.


So DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right? Wrong.

The examples above are useful because they help elucidate some misconceptions about the DOI itself, the nature of the DOI registration agencies and, in particular issues being raised by new RAs and new DOI allocation models.

DOIs are just identifiers

Crossref’s dominance as the primary DOI registration agency makes it easy to assume Crossref’s *particular* application of the DOI as a scholarly citation identifier is somehow intrinsic to the DOI. The truth is, the DOI has nothing specifically to do with citation or scholarly publishing. It is simply an identifier that can be used for virtually any application. DOIs could be used as serial numbers on car parts, as supply-chain management identifiers for videos and music or as cataloguing numbers for museum artifacts. The first two identifiers listed in the examples (a & b) illustrate this. They both belong to MovieLabs and are part of the EIDR (Entertainment Identifier Registry) effort to create a unique identifier for television and movie assets. At the moment, the DOIs that MoveLabs are assigning are B2B-focused and users are unlikely to see them in the wild. But we should recall that Crossref’s application of DOIs was also initially considered a B2B identifier- but it has since become widely recognized and depended on by researchers, librarians and third parties. The visibility of EIDR DOIs could change rapidly as they become more popular.

Multiple DOIs can be assigned to the same object

There is no International DOI Foundation (IDF) prohibition against assigning multiple DOIs to the same object. At most the IDF suggests that RAs might coordinate to avoid duplicate assignments, but it provides no guidelines on how such cross-RA checks would work.

Crossref, in its particular application of the DOI, attempts to ensure that we don’t assign two different copies of the same article with different DOIs, but that is designed in order to avoid having publishers mistakenly making duplicate submissions. Even then, there are subtle exceptions to this rule- the same article, if legitimately published in two different issues (e.g. a regular issue and a thematic issue) will be assigned different DOIs. This is because, though the actual article content might be identical, the *context* in which it is cited is also important to record and distinguish. Finally, of course, we assign multiple DOIs to the same “object” when we assign book-level and chapter level DOIs. Or when we assign DOIs to components or reference work entries.
The likelihood of multiple DOIs being assigned to the same object increases as we have multiple RAs. In the future we might legitimately have a monograph that has different Bowker DOIs for different e-book platforms (Kindle, iPad, Kobo.) yet all three might share the same Crossref DOI for citation purposes.

Again, the examples show this already happening. The examples f & g are assigned by DataCite (via FigShare) and Crossref respectively. The first identifies the author version and was presumably assigned by said author. The second identifies the publisher version and was assigned by the publisher.

Although Crossref, as a publisher-focused RA, might have historically proscribed the assignment of Crossref DOIs to archive or author versions, there has never been and could never be any such restrictions on other DOI RAs. These are legitimate applications of two citation identifiers to two versions of the same article.

However, the next set of examples, h, i, j and k show what appears to be a slightly different problem. In these cases articles that appear to be in all aspects *identical* have been assigned two separate DOIs by different RAs. In one respect this is a logistical or technical problem- although Crossref can check for such potential duplicate assignments within its own system, there is no way for us to do this across different RAs. But this is also a marketing and education problem- how do RAs with similar constituencies (publishers, researchers, librarians) and application of the DOI (scholarly citation) educate and inform their members about best practice in applying DOIs in that particular RAs context?

DOI registration agencies are not focused on content types, they are focused on constituencies and applications

The examples f through k also illustrate another area of fuzzy thinking about RAs- that they are somehow built around particular content types. We routinely hear people mistakenly explain that difference between Crossref and DataCite is that “Crossref assigns DOIs to journal articles” and that “DataCite assigns DOIs to data.” Sometimes this is supplemented with “and Bowker assigns DOIs to books.” This is nonsense. CrossRef assigns DOIs to data (example o) as well as conference proceedings, programs, images, tables, books, chapters, reference entries, etc. And DataCite covers a similar breadth of content types including articles (examples c, h, f, l, m ). The difference between Crossref, DataCite and Bowker is their constituencies and applications- not the content types they apply DOIs to. Crossref’s constituency is publishers. DataCite’s constituency is data repositories, archives and national libraries. But even though Crossref and DataCite have different constituencies, they share a similar application of the DOI- that is the use of DOI as citation identifiers. This is in contrast to MovieLabs whose application of the DOI is supply chain management.

DOI registration agency constituencies and applications can overlap *or* be entirely separate

Although Crossref’s constituency is “publishers”, we are catholic in our definition of “publisher” and have several members who run repositories that also “publish” content such as working papers and other grey literature (e.g. Woods Hole Oceanographic Institution, University of Michigan Library, University of Illinois Library). DataCite’s constituency is data repositories, archives and national libraries, but this doesn’t stop DataCite (through CDL/FigShare) from working with the publisher, PLoS, on their “Reproducibility Initiative” which requires the archiving of article-related datasets. PloS has announced that they will host all supplemental data sets on FigShare but will assign DOIs to those items through Crossref.

Crossref’s constituency of publishers overlaps heavily with Airiti, JaLC, mEDRA, ISTIC and Bowker. In the case of all but Bowker we also overlap in our application of the DOI in the service of citation identification. Bowker, though it shares Crossref’s constituency, uses DOIs for supply chain management applications.

Meanwhile, EIDR is an outlier, its constituency does not overlap with Crossref’s *and* its application of the DOI is different as well.

The relationship between RA constituency overlap (e.g. scholarly publishers vs television/movie studios) and application overlap (e.g. citation identification vs. supply chain management) can be visualized as such:

RA Application/Constituency overlap

The differences (subtle or large) between the various RAs are not evident to anybody without a fairly sophisticated understanding of the identifier space and the constituencies represented by the various RAs. To the ordinary person these are all just DOIs, which in turn are described as simply being “persistent interoperable identifiers.”

Which of course begs the question, what do we mean by “persistent” and “interoperable?”

DOIs only are as persistent as the registration agency’s application warrants.

The word “persistent” does not mean “permanent.” Andrew Treloar is known to point out that the primary sense of the word “persistent” in the New Oxford American Dictionary is:

Continuing firmly or obstinately in a course of action in spite of difficulty or opposition

Yet presumably the IDF once chose to use the word “persistent” instead of “perpetual” or “permanent” for other reasons. “Persistence” implies longevity, without committing to “forever.”

It may sound prissy, but it seems reasonable to expect that the useful life-expectancy for the identifier used for managing inventory of the the movie “Young Sex Crazed Nurses” might be different than the life expectancy for the identifier used to cite Henry Oldenburg’s “Epistle Dedicatory” in the first issue of the Philosophical Transactions. In other words, some RAs have a mandate to be more “obstinate” than others and so their definitions of “persistence” may vary. Different RAs have different service level agreements.

The problem is that ordinary users of the “persistent” DOI have no way of distinguishing between those DOIs that are expected to have a useful life of 5 years and those DOIs that are expected to have a useful lifespan of 300+ years. Unfortunately, if one of the more than 6 million non-Crossref DOIs breaks today, it will likely be blamed on Crossref.

Similarly, if a DOI doesn’t work with an existing Crossref service, like OpenURL lookup, CrossCheck, CrossMark or Crossref Metadata Search, it will also be laid at the foot of Crossref. This scenario is likely to become even more complex as different RAs provide different specialized services for their constituencies.

Ironically, the converse doesn’t always apply. Crossref oftentimes does not get credit for services that we instigated at the IDF level. For instance, FigShare has been widely praised for implementing content negotiation for DOIs even though this initiative had nothing to do with FigShare, instead it was implemented by DataCite with the prodding and active help of Crossref (DataCite even used Crossref’s code for a while). To be clear, we don’t begrudge praise for FigShare. We think FigShare is very cool- this just serves as an example of the confusion that is already occurring.


DOIs are only “interoperable” at a least common denominator level of functionality

There is no question that use of Crossref DOIs has enabled the interoperability of citations across scholarly publisher sites. The extra level of indirection built into the DOI means that publishers do not have to worry about negotiating multiple bilateral linking agreements and proprietary APIs. Furthermore, at the mundane technical level of following HTTP links, publishers also don’t have to worry about whether the DOI was registered with mEDRA, DataCite or Crossref as long as the DOI in question was applied with citation linking in mind.

However, what happens if somebody wants to use metadata to search for a particular DOI? What happens if they expect that DOI to work with content negotiation or to enable a CrossCheck analysis or show a CrossMark dialog or carry FundRef data? At this level, the purported interoperability of the DOI system falls apart. A publisher issuing DataCite DOIs cannot use CrossCheck. A user with a mEDRA DOI cannot use it with content negotiation. Somebody searching Crossref Metadata Search or using Crossref’s OpenURL API will not find DataCite records. Somebody depositing metadata in an RA other than Crossref or DataCite will not be able to deposit ORCIDs.

There are no easy or cheap technical solutions to fix this level of incompatibility baring the creation of a superset of all RA functionality at the IDF level. But even if we had a technical solution to this problem- it isn’t clear that such a high-level of interoperability is warranted across all RAs. The degree of interoperability that is desirable between RAs is only in proportion to the degree that they serve overlapping constituencies (e.g. publishers) or use the DOI for overlapping applications (e.g. citation)

DOI Interoperability matters more for some registration agencies than others

This raises the question of what it even means to be “interoperable” between different RAs that share virtually no overlap in constituencies or applications. In what meaningful sense do you make a DOI used for inventory control “interoperable” with a DOI used for identifying citable scholarly works? Do we want to be able to check “Young Sex Crazed Nurses” for plagiarism? Or let somebody know when the South Park movie has been retracted or updated? Do we need to alert somebody when their inventory of citations falls below a certain threshold? Or let them know how many copies of a PDF are left in the warehouse?

The opposite, but equally vexing issue arrises for RAs that actually share constituencies and/or applications. Crossref, DataCIte and mEDRA have *all* built separate metadata search capabilities, separate deposit APIs, separate OpenURL APIs, and separate stats packages- *all* geared at handling scholarly citation linking.

Finally, it seems a shame that a third party, like ORCID, who wants to enable researchers to add *any* DOI and its associated metadata to their ORCID profile, will end up having to interface with 4-5 different RAs.

Summary and closing thoughts

Crossref was founded by publishers who were prescient in understanding that, as scholarly content moved online, there was the potential to add great value to publications by directly linking citations to the documents cited. However, publishers also realized that many of the architectural attributes that made the WWW so successful (decentralization, simple protocols for markup, linking and display, etc.), also made the web a fragile platform for persistent citation.

The Crossref solution to this dilemma was to introduce the use of the DOI identifier as a level of citation indirection in order to layer a persist-able citation infrastructure onto the web. The success of this mechanism has been evident at a number of levels. A first-order effect of the system is that it has allowed publishers to create reliable and persistent links between copies of publisher content. Indeed uptake of the Crossref system by scholarly and professional publishers has been rapid and almost all serious scholarly publishers are now Crossref members.

The second order effects of the Crossref system have also been remarkable. Firstly, just as researchers have long expected that any serious paper-based publication would include citations, now researchers expect that serious online scholarly publications will also support robust online citation linking. Secondly, some have adopted a cargo-cult practice of seeing the mere presence of a DOI on a publication as a putative sign of “citability” or “authority.” Thirdly, interest in use of the DOI as a linking mechanism has started to filter out to researchers themselves, thus potentially extending the use of CrossRef DOIs beyond being primarily a B2B citation convention.

The irony is that although the DOI system was almost single-handedly popularized and promoted by CrossRef, the DOI brand is better known than Crossref itself. We now find that new RAs like EIDR, DataCite and new services like FigShare are building on the DOI brand and taking it in new directions. As such the first and second order benefits of CrossRef’s pioneering work with DOIs are likely to be effected by the increasing activity of the new DOI RAs as well as the introduction of new models for assigning and maintaining DOIs.

How can you trust that a DOI is persistent if different RAs have different conceptions of persistence? How can you expect the presence of a DOI to indicate “authority” or “scholarliness” if DOIs are being assigned to porn movies? How can you expect a DOI to point to the “published” version of an article when authors can upload and assign DOIs to their own copies of articles?

It is precisely because we think that some of the qualities traditionally (and wrongly) accorded to DOIs (e.g. scholarly, published, stewarded, citable, persistent) are going to be diluted in the long term that we have focused so much of our recent attention on new initiatives that have a more direct and unambiguous connection to assessing the trustworthiness of Crossref member’s content. CrossCheck and the CrossCheck logos are designed to highlight the role that publishers play in detecting and preventing academic fraud. The CrossMark identification service will serve as a signal to researchers that publishers are committed to maintaining their scholarly content as well as giving scholars the information they need to verify that they are using the most recent and reliable versions of a document. FundRef is designed to make the funding sources for research and articles transparent and easily accessible. And finally we have been both adjusting Crossref’s branding and display guidelines as well as working with the IDF to refine its branding and display guidelines so as to help clearly differentiate different DOI applications and constituencies.

Whilst it might be worrying to some that DOIs are being applied in ways that Crossref has not expected and may not have historically endorsed, we should celebrate that the broader scholarly community is finally recognizing the importance of persist-able citation identifiers.

These developments also serve to reinforce a strong trend that we have encountered in several guises before. That is, the complete scholarly citation record is made up of more than citations to the formally published literature. Our work on ORCID underscored that researchers, funding agencies, institutions and publishers are interested in developing a more holistic view of the manifold contributions that are integral to research. The “C” in ORCID stands for “contributor” and ORCID profiles are designed to ultimately allow researchers to record “products” which include not only formal publications, but also data sets, patents, software, web pages and other research outputs. Similarly, Crossref’s analysis of the CitedBy references revealed that one in fifteen references in the scholarly literature published in 2012 included a plain, ordinary HTTP URI- clear evidence that researchers need to be able to cite informally published content on the web. If the trend in CitedBy data continues, then in two to three years one in ten citations will be of informally published literature.

The developments that we are seeing are a response to the need that users have to persistently identify and cite the full gamut of content types that make up the scholarly literature. If we can not persistently site these content types, the scholarly citation record will grow increasingly porous and structurally unsound.  We can either stand back and let these gaps be filled by other players under their terms and deal reactively with the confusion that is likely to ensue- or we can start working in these areas too and help to make sure that what gets developed interacts with the existing online scholarly citation record in a responsible way.

CrossRef Author ID meeting

February 5, 2007, Washington DC CrossRef invited a number of people to attend an information gathering session on the topic of Author IDs. The purpose of the meeting was to determine:
* About whether there is an industry need for a central or federated contributor id registry;
* whether CrossRef should have a role in creating such a registry;
* how to proceed in a way that builds upon existing systems and standards.

In attendance: Jeff Baer, CSA Judith Barnsby, IOPP Geoff Bilder, CrossRef Amy Brand, CrossRef David Brown, British Library Richard Cave, PLoS (remote) Bill Carden, ScholarOne Gregg Gordon, SSRN Gerry Grenier, IEEE Michael Healy, BISG (remote) Helen Henderson, Ringgold Thomas Hickey, OCLC (remote) Terry Hulburt, IOPP Tim Ingoldsby, AIP Ruth Jones, Britsh Library Marl Land, Parity Dave Martinson, ACS Georgios Papadapoulos, Atypon (with two colleagues) Jim Pringle, Thomson Chris Rosin, Parity Tim Ryan, Wiley Philippa Scoones, Blackwell Chris Shillum, Elsevier Neil Smalheiser, UIC (remote) Barbara Tillett, LoC Vetle Torvik, UIC (remote) Charles Trowbridge, ACS Amanda Ward, Nature (remote) Stu Weibel, OCLC (remote) David Williamson, LoC Notes Amy Brand opened the meeting and welcomed attendees. She said the goal of the meeting was really nothing more than to launch a discussion on a topic of author identifiers and hear from participants re their views and experiences on unique identifiers for individuals — be they authors, contributors, or otherwise. We went around the table and everyone introduced themselves. Amy then introduced Geoff Bilder as moderator of the meeting. Geoffrey Bilder said that CrossRef’s members had indicated that they would like CrossRef to explore whether it could play a role in creating an author identification system. The members feel that an “author DOI” scheme would help them with production and editorial issues. They also recognize that such a scheme could fuel numerous downstream applications. Geoff apologized for sounding like Rumsfeld and said, we know that there is a lot that we don’t know, but we don’t know exactly what we don’t know. We have just started this project and we wanted to get some feedback from various groups concerned with scholarly publishing in order to understand what people would like to see in regards to author identification schemes and what initiatives/efforts we need to be aware of. He commented that the currently assembled group failed to include the open web community, and their input would be important too as this project develops. The meeting then turned to short project summaries from others. Project Summaries Jim Pringle gave a short PPT presentation (attached) and reported that Thomson first started creating its own author ids in 2000, in relation to the launch of its Highly Cited service. The focus for Thomson in this area has been on author disambiguation. Jim said that the focus for CrossRef in this area would be a system that could respond to the question “who are you and what have you written”; he also raised concern about matters of author privacy. Michael Healy then discussed the International Standard Party Identifier (ISPI). ISO TC 46/SC 9 is developing ISPI as a new international identification system for the parties (persons and corporate bodies) involved in the creation and production of content entities. Work on the ISPI project began in August 2006 when the New Work Item proposal was approved by the member bodies of ISO TC 46/SC 9. The first meeting of the ISPI project group was held at CISAC’s offices in Paris on September 12, 2006. This project has strong representation the library sector, RRO’s, booksellers, music and film/TV industries represented as well. Mr. René Lloret Linares from CISAC (International Confederation of Societies of Authors and Composers) chairs the group; until now CISAC has been using a proprietary id scheme and would like to move to use of an open standard to identify all contributors and creators. Michael was asked whether membership in the project group was open, and he replied that anyone can attend meetings as observers but that voting is restricted to those nominated by their own national standards organization. Chris Shillum then asked the group to think about developed use cases for the publishing industry, and how they differ from potential ISPI applications. Helen Henderson reported on the Journals Supply Chain project, a pilot that aims to discover whether the creation of a standard, commonly used identifier for Institutions (customer ids) will be beneficial to parties involved in the journal supply chain. The pilot models interactions between each party — library, publisher, agent. 35 publishers are participating thus far. Helen also said there is a clear need for sub-institutional level ids. Helen also pointed out the value of associating author and institutional ids. On the topic of institutions, Tim Ingoldsby pointed out that both academic and corporate institutions are important. Chris Rosin said Parity is working on author merger and disambiguation as core use cases of author ids for its publisher clients. In particular, they have developed automated merging of instances into profiles, proceeding with conservative bias on what constitutes a match/merge. Parity is also looking at applying author cv’s onto profiles. This will require contributors to participate, and they will need to make it as easy as possible for contributors. Chris said that authentication, trust, and privacy are key considerations; even collecting public information in one place raises privacy issues. Judith Barnsby pointed out that the UK has stronger data protection rules than the US, re privacy. Discussion among the group at this point in the meeting resulted in identifying two different areas in author id assignment — (1) ongoing assignment, (2) retroactive assignment. Geoff said this distinction was useful for CrossRef, who could more easily address ongoing assignment via publishers working directly with authors. Neil Smalheiser, a neuroscientist at UIC, reported on the Arrowsmith Project, a statistical model based on multiple features of the Medline database. The goal of the model is to predict the probability that any two papers are written by the same person. The project’s “Authority” tool weighs criteria such as researcher affiliation, co-author names, journal title, and medical subject headings to identify the papers most likely written by a target author. For details: David Williamson of LoC said he was working on name authority files, using ONIX metadata. Barbara Tillet of LoC spoke about authority files and related efforts in library world, which uses the control number, one type of unique id. She reported that IFLA (International Federation of Library Associations) has a group working on how to share authority numbers, which has actually been in discussion since the 1970s; there is to be an IFLA-IPA meeting in April 2007. The library community is eager to share what it knows and what it has developed this far. Barbara suggested that use of Dublin Core format here may be the best way to go. Different communities will no doubt need different ids. What is needed in the library community is an international, multi-lingual solution, based on unicode, connecting regional authority files. Publishers will want to take advantage of library author-ity files for retrospective identifications. Thomas Hickey of OCLC mentioned the WorldCat Identity service, which summarizes information for 20 million authors searchable in WorldCat. Gerry Grenier reported that IEEE was about to implement its own author disambiguation and id system, and he offered that this metadata could be fed into a CrossRef system. Different participants had different views on whether the goal here should be a “light and non-centralized” (or federated) approach versus a centralized registry with one place to link authors across all publishers, versus a hybrid — centralized source to handout unique id, but publisher data could be distributed. There could also be a network of registration agencies working in a federated system. Different participants also had different views on CrossRef’s role. Several publishers at the meeting supported CrossRef’s role, especially in the STM space, whereas there was concern raised among some parties about whether CrossRef was an appropriate choice for a system that will need to be “available everywhere to everybody”, and others re-iterated the importance of giving the academic community a voice in the development of such a service Discussion then turned to use cases — the question being, what problems would having an author id help you solve in your organization? USE CASES ARTICULATED AT MEETING:
* For RROs, known use case is to facilitate distribution of monies owed to authors;;
* for booksellers, disambiguation in search;;
* to understand the provenance of documents;
* search — to find works for particular person; self presentation — how can I effectively present myself and my work to the world?;
* cross-walks — associating various life sciences ids, such as PubChem;
* identity of society members;
* identity of research funding institutions;
* disambiguation and attribution;
* linking authors and institutions;
* for enhancing peer review system — need unique ids to share information with various departments;
* to better know the value of our authors — for activities such as peer review, tracking stats on authors, article downloads, and individualized or personalized services;
* with a central registry, author only has one place they have to update their information;
* authors will want the information to be portable when they move from inst to another — “where is Jeff Smith now?” is one such question;
* to associate connected authors with one another;
* to aggregate info on where (what institution) research is being done on a particular topic;
* privacy can be enhanced with author DOIs;
* sharing info from library to library;
* cluster all the works of a particular person for search purposes;
* stats about authors — “how many times has this author tried and been rejected from Nature?” for instance.

**NEXT STEPS: Please watch the CrossTech blog for ongoing discussion **