Tools

A full range of tools for data-driven research is available. In addition to many free and open-source services, there are also proprietary platforms. UB Bern develops its own tools as necessary, and offers licenses and guidance concerning text and data mining platforms.

UB Bern’s DS Digital Toolbox offers Jupyter Notebooks for an easy introduction to the typical tasks involved in data work, including:

  • Using the APIs of publishers, databases and data aggregators
  • Cleaning up spreadsheet data
  • Extracting text from PDFs and using text recognition (OCR)
  • Segmenting documents in preparation for OCR
  • Natural Language Processing (NLP)

We also offer Notebooks for the national platforms e-rara, e-manuscripta and e-periodica to query the metadata and full texts of these Swiss cultural heritage institutions.

The Nexis Data Lab is a platform that makes LexisNexis media content available for text and data mining (TDM). The text database comprises current and historic news reports from 200,000 sources in over 100 countries. Up to 100,000 documents can be analyzed concurrently as a corpus. The platform offers an online Jupyter notebook environment (Python, R and Kernel) for this and simple scripts to get started with TDM.

A single user licence for Nexis Data Lab is available. Please contact ds.ub@unibe.ch if interested.

Constellate is the provider Ithaka's text analysis platform. The text collection provided includes the archives of JSTOR and Chronicling America. Users can put together large corpora and download them in the form of metadata, full texts and n-grams. Constellate offers a series of tutorials as an introduction to Python and Natural Language Processing (NLP) which are also available as Jupyter Notebooks.

To use Constellate you need to access it from the University of Bern’s network or VPN and also set up a personal account.

The HTRC enables TDM methods to be applied to the contents of the HathiTrust Digital Library which comprises over 17 million digitalized volumes dating back to 1700. Users can create corpora to suit their own criteria and process them with the text analysis routines provided, or they can use their own algorithms on them. Various tools and comprehensive documentation are made available for this.

To use the HathiTrust Research Center (HTRC), you must be authenticated by SWITCH edu-ID and set up a personal account on HathiTrust/HTRC.

OpenRefine is open source software with an intuitive user interface for easy manipulations of spreadsheet data. OpenRefine provides many data cleaning and transformation functions which, thanks to the processing history, are easy to document and reproduce. One notable feature is the “Reconciliation” function which allows users to check and enrich their own data against external data providers (e.g. Wikidata, an integrated authority file, CrossRef).

OpenRefine is available for multiple operating systems and can be tried out online here without requiring installation.

Jupyter is an open-source integrated development environment (IDE) for various data science programming languages. Jupyter takes a literate programming approach by combining code and documentation in one document (Jupyter Notebook). This allows analysis steps to be explained in detail, visualizations to be integrated directly into the file, and content to be exported to a variety of formats.

Jupyter can be tried out online here with a variety of kernels. EPFL provides an online JupyterHub environment for associated Swiss universities and research institutes.

Digital Scholarship Toolsammlungen
Text analysis, Natural Language Processing (NLP), literature analysis
Digital Humanities