What is the Open Syllabus Project?

The OSP is an effort to make the intellectual judgment embedded in syllabi relevant to broader explorations of teaching, publishing, and intellectual history.  The project has collected over 1 million syllabi, has extracted citations and other metadata from them, and is now pleased to make the Syllabus Explorer publicly available as a means of exploring this corpus.  Looking ahead, the OSP’s goal is to expand the collection and make it more useful to authors, teachers, administrators, and students.

How does the OSP get its syllabi?

Primarily through the crawling and scraping of publicly-accessible university websites.  We have also rescraped the links in Dan Cohen’s ‘million syllabus’ database from 2005-2006 utilizing the Internet Archive’s Wayback machine.  We plan to continue our scraping efforts in 2016.

Over time, the project needs individual faculty donations and access to institutional syllabus archives.  At present, we have around 1.1 million syllabi, drawing predominantly from the past decade of teaching in the US.  We think the total number of US, UK, Canadian, and Australian syllabi for the past 15 years is in the range of 80-100 million.

What is Teaching Score?

‘Teaching Score’ (TS) is a numerical indicator of the frequency with which a particular work is taught.  Overall Teaching Score is based on the rank of the text among citations in the total collection; Field Teaching Score is based on the rank of the text syllabi identified as belonging to a particular field,  rendered on a 1-100 scale.  Is this a perfect way to express frequency?  No, but we tried quite a few formula and thought this one worked best.

What counts as ‘taught’?

If a work appears on a syllabus, it counts for the purposes of Teaching Score and other indicators of frequency.  If a work appears 10 times on a syllabus, it counts only once.  If it appears in ‘suggested reading’ or some other secondary list, it still counts. Our methods can’t reliably distinguish primary from secondary reading (yet).

Should I fear Teaching Score as another step in the reduction of intellectual life to crude metrics?

We think Teaching Score captures a different set of judgments about the value of work than conventional citation analysis, and so diversifies the range of things that can be measured and valued.  Given the current reliance on narrow, mostly journal-based citation analysis and rankings, we see diversification as an improvement.  By capturing what people teach, we think Teaching Score will privilege more synthetic, accessible, and public facing work than traditional citation analysis based on research articles.

Why doesn’t the Syllabus Explorer show results for my article on X (or other variations on this question)?

There are several possible answers:

  1. Your article is not taught in the syllabi currently in the collection.   How representative is our collection?  We don’t know yet.
  2. Your article is not in the citation catalogs that we use to identify texts in the collection:  Harvard Library Open Metadata and JStor.  The Syllabus Explorer identifies citations by looking for matches between the titles in these catalogs and text strings in the 1.1 million syllabi.  Matches are confirmed by looking for other relevant citation information surrounding the text string: author, publisher, year, and so on.  This method has some limitations: the Harvard catalog contains 11 million titles but no articles;  the JStor catalog contains 9 million articles but catalogs material only after a 3-5 year delay and provides very little coverage of the sciences.  Neither do well with periodical, open access, and gray literature.  We will address these limitations by adding other catalogs in the future (and updating the current ones).
  3. Our algorithms just didn’t properly identify your piece.  The matching method is imperfect and citation formatting on syllabi is often inconsistent   We have been able to partially correct for certain category problems in the matching algorithms, but these improvements are iterative and ongoing.

Our matching algorithms also have some difficulty with short titles based on commonly-used word, and more so when these lack an author.  This affects a very small number of works, but some significant ones like The Bible and “The Constitution.”

Is that also how you get date, location, and field information?

Yes.  It’s all based on brute force matching of syllabus text and source URLs against lists of dates, university names, and fields (in this case, a simplified version of the CIP 2000 list).  All the same caveats apply: the catalogs, algorithms, and syllabi can all introduce ambiguity or error.  Our best estimate is that our current approaches successfully map 70% of dates and 60% of fields.  We have also developed tools for differentiating syllabi from non-syllabi in the collection process, with an accuracy of around 90%.

This is the low-hanging fruit in these documents.  Our current catalogs offer many more variables to explore, such as differences between types of school.  More refined machine learning and natural language processing techniques will probably be able to extract more information from the syllabi, such as chapter assignments, common sequences of assignments, and schedules.

Why don’t the numbers always add up?

The methods used to extract citations, locations, and fields from the syllabi have different degrees of accuracy and so result in different counts.  In practice, there are many syllabi that go unidentified in one or more of the core categories.  So, for example, Harvey Deitel’s C++: How to Program is identified 1629 times in the whole collection but only 308 times in syllabi identified as belonging to either Computer Science or Engineering.  That puts a lot of Deitel in the unidentified column.  The most reliable numbers will always come from the full, unfiltered collection.

Are you exposing my syllabi to the world?

No.  We provide no access to underlying documents at this time, and will only do so in the future with clear permission to (re)publish.  All of the analytical tools available through the Explorer are limited to statistical aggregation and filtering of metadata from the syllabi.  This does not allow for personally identifying information and it remains conservatively within US fair use principles with respect to the composition and use of the collection.

In general, we would like to see syllabi governed by more open norms within the academy and hope that the OSP can provide a good reason for moving in that direction.

How did this get started?

The OSP pulls together several research groups interested in the question of what one can learn from large numbers of syllabi, including researchers at Harvard, the University of North Carolina-Chapel Hill, and Swarthmore College.  The progenitor of much of this work was the “Million Syllabi” database created by Dan Cohen between 2002 and 2009.  The OSP is housed at The American Assembly at Columbia University, with support from the Columbia University Library and Department of English.   The project has been funded by the Sloan Foundation.

Who built the Syllabus Explorer?

David McClure has been the lead developer on the project since 2014 and deserves most of the credit for the OSP architecture and Syllabus Explorer app.  McClure was supported by Jonathan Stray and the team at Overview, which helped the OSP solve many of the problems related to working with a million-plus-size document collection.  Alexander Duryee was responsible for much of the scraping and management of the growing document collection.  Sam Zhang contributed a classifier tool that permitted us to distinguish syllabi from non-syllabi with around 90% accuracy.  Many others contributed advice, time, and expertise, including the team at Citeseer.

What’s Next?

We think the Open Syllabus Project has the potential to become a very valuable resource for the academy.  In the next months we will:

  • Integrate maps and other visualizations into the Explorer.
  • Expose date information for syllabi, which will enable the tracing of the frequency with which texts are taught over time.
  • Improve the process for identifying fields.
  • Release a preliminary API for access to the metadata and allow for downloading of the metadata.

In the next year, we want to:

  • Create a new demand metric for Open Access materials (by extending Teaching Score), and potentially through that process address the incentives problem that hinders the cataloging of OA materials.
  • Enable cross referencing of other data that we have (like university size and type) and some that we don’t (yet), like the gender of authors.  Top and bottom 10 political science departments in terms of number of women authors taught, anyone?
  • Implement a reliable system for deduping author names and other hard categories.
  • Explore the extension of the OSP into Spanish and other languages.
  • Triple the number of syllabi in the collection
  • Improve the extraction algorithms for citations, fields, dates, and so on, and explore the development of some harder ones (like sequencing of assignments, portions of texts, and primary/secondary reading distinctions)
  • Routinize licensing arrangements with universities for access to syllabus archives.