Data sources for text and data mining

Text and data mining (TDM) refers to algorithm-based processes to automatically extract information from unstructured or semi-structured text data (text mining) and structured data (data mining). 
On this page you will find text and data mining resources – ordered by content category – which are either freely available on the web or through UB Bern’s licenses.

Unless other contact details are provided, please refer to UB Bern if you are interested in obtaining data.

Lizenzierte Daten-, Text- und Bildersammlungen
Resource Contents Detailed information
Swiss Media content: 

Swissdox@LiRI (general information on the Swissdox database)

  • Bulk download of full texts from Swissdox (Swiss media database): TSV, XML
  • Approx. 23 million articles from 250 Swiss newspapers and (online) media products
  • Dating from 1910, updated daily
  • Access via SWITCH edu-ID
  • Use also possible via API

International Media content: Nexis Data Lab
  • media content from 20,000 sources from over 100 countries
  • convenient corpus creation
  • analysis in JupyterHub environment (Python, R)
  • Jupyter Notebooks for getting started with TDM
  • own code and packages uploadable, results and code downloadable
  • no download of raw data
  • Website, FAQ, Help Guide
  • Single user license: Please contact stefan.ittner@unibe.ch if you are interested.
Books International: HathiTrust Research Center
  • 17 million digitized volumes from US libraries (from 1700)
  • own corpus creation and download in preprocessed form (Derived Datasets)
  • simple implemented text analysis routines and visualizations
  • virtual machines for data analysis 
  • preprocessed datasets for English-language literature
WBIS Online (DeGruyter) (general information about the database)
  • Biographical records on over six million historic and contemporary persons
  • Updated continuously. Includes 8.5 million digital facsimile articles from biographical reference works.
  • Multilingual 
Germanistik Online (DeGruyter) (general information about the database)
  • 400,000 bibliographical entries, updated continuously 
Romance Studies Bibliography (DeGruyter) (general information about the database)
  • 400,000 bibliographical entries, updated continuously 
English-language periodicals (Gale Cengage)
  • The Times Digital Archive 1785-2014 general information about the database
  • International Herald Tribune 1887-2013, general information about the database
  • The Economist Historical Archive 1843-2015, general information about the database
English-language periodicals (ProQuest)
  • British Periodicals: 491 newspapers/magazines from the UK, Ireland, India, 1681-2007, 6.7 million articles, JPEG, PDF, OCR/XML, general information about the database
  •  American Periodicals: 1,509 newspapers/magazines and scientific journals, North America, 1741-1988, 11.5 million articles, PDF, OCR/XML, general information about the database
English-language monographs (Gale Cengage)
  • Eighteenth Century Collections Online (ECCO), general information about the database
  • Nineteenth Century Collections Online (NCCO): British Theatre, Music and Literature, general information about the database
  • Nineteenth Century Collections Online (NCCO): Europe and Africa, general information about the database

 

UK Parliamentary Papers (ProQuest)
  •  UK Parliamentary Papers from the 18th-20th century
  • XML, PDF
  • General information about the database
Cambridge Histories (CUP)
  • Over 400 volumes of international history (English)
  • PDF (download), XML (on request)
  • IP-driven access (University network/VPN)
  • General information about the database
Frei zugängliche Textsammlungen
Platform Contents Detailed information
CLARIN Resource Families

Website

  • Overview and, in some cases, access to language corpora in all subject areas and many languages
Partly available for free, various licenses
e-rara 
  • 100,000 historic and rare printed publications from Swiss institutions
  • Full texts: PDF, some TXT
  • Jupyter Notebook for bulk downloads of metadata and full texts
Overview of data interfaces and terms
e-manuscripta 
  • 150,000 manuscript materials  from Swiss institutions
  • Full texts: PDF
  • Jupyter Notebook for bulk downloads of metadata and full texts
Overview of data interfaces and terms
e-periodica
  • 900 journals from Switzerland
  • Full texts: PDF
  • Jupyter Notebook for bulk downloads of metadata and full texts including text parsing
Overview of data interfaces and terms
GLAM Workbench

Website

  •  Comprehensive datasets from Australian and New Zealand heritage institutions, web archives and government documents.
  • API documentation, bulk downloads and Jupyter notebooks
Freely accessible, various licenses
Chronicling America Freely accessible, public domain
Internet Archive

Documentation

  • 37 million books and texts in a variety of genres, languages and data formats
  • Bulk download with command line tool and Python wrapper
Freely accessible, various licenses, sometimes not specified
Project Gutenberg

Documentation

  • 70,000 books in a variety of genres, languages and data formats
Freely accessible, public domain
OpenGLAM Survey

Overview

  • Overview of open data sources (digital reproductions, texts, metadata) of 1,600 cultural heritage institutions worldwide, with details of licenses and APIs
Freely accessible, public domain or open licenses
Text Creation Partnership
  • 73'000 public domain transcribed full texts (SGML/XML/TEI) of prints of the 15th-18th century as bulk downloads (single files also in the Oxford Text Archive: EBUP, HTML, XML, partly also POS-annotated as TSV)
  • Early English Books Online (EEBO, 60'000 transcribed full texts, 1473-1700)
  • Eighteenth-Century Collections Online (ECCO, 3,000 transcribed full texts, 1700-1800)
  • Evans Early American Imprints (Evans, 5,000 transcribed full texts, 1640-1800)
Freely accessible, public domain

The resources and their interfaces are subject to various legal and technical terms of use. Please consult these before any automated access. In particular, automated access is often excluded for licensed content that is not listed here and may cause the provider to block access to the database. Please contact us to check the legality of access if you are in any doubt.

According to the Swiss Federal Act on Copyright and Related Rights, duplication and storage of legally accessible content for scientific purposes as in the context of TDM is permitted.