Data sources for text and data mining

Text and data mining (TDM) refers to algorithm-based processes to automatically extract information from unstructured or semi-structured text data (text mining) and structured data (data mining). 
On this page you will find text and data mining resources – ordered by content category – which are either freely available on the web or through UB Bern’s licenses.

Unless other contact details are provided, please refer to UB Bern if you are interested in obtaining data.

Resource Contents Detailed information
Swiss periodicals: 

Swissdox@LiRI (general information on the Swissdox database)

  • Bulk download of full texts from Swissdox (Swiss media database): TSV, XML
  • Approx. 23 million articles from 250 Swiss newspapers and (online) media products
  • Dating from 1910, updated daily
  • Access via SWITCH edu-ID
  • API will follow in mid-2022

WBIS Online (DeGruyter) (general information about the database)
  • Biographical records on over six million historic and contemporary persons
  • Updated continuously. Includes 8.5 million digital facsimile articles from biographical reference works.
  • Multilingual 
Germanistik Online (DeGruyter) (general information about the database)
  • 400,000 bibliographical entries, updated continuously 
Romance Studies Bibliography (DeGruyter) (general information about the database)
  • 400,000 bibliographical entries, updated continuously 
English-language periodicals (Gale Cengage)
  • The Times Digital Archive 1785-2014 general information about the database
  • International Herald Tribune 1887-2013, general information about the database
  • The Economist Historical Archive 1843-2015, general information about the database
English-language periodicals (ProQuest)
  • British Periodicals: 491 newspapers/magazines from the UK, Ireland, India, 1681-2007, 6.7 million articles, JPEG, PDF, OCR/XML, general information about the database
  •  American Periodicals: 1,509 newspapers/magazines and scientific journals, North America, 1741-1988, 11.5 million articles, PDF, OCR/XML, general information about the database
English-language monographs (Gale Cengage)
  • Eighteenth Century Collections Online (ECCO), general information about the database
  • Nineteenth Century Collections Online (NCCO): British Theatre, Music and Literature, general information about the database
  • Nineteenth Century Collections Online (NCCO): Europe and Africa, general information about the database

 

UK Parliamentary Papers (ProQuest)
  •  UK Parliamentary Papers from the 18th-20th century
  • XML, PDF
  • General information about the database
Cambridge Histories (CUP)
  • Over 400 volumes of international history (English)
  • PDF (download), XML (on request)
  • IP-driven access (University network/VPN)
  • General information about the database
Platform Contents Detailed information
CLARIN Resource Families

Website

  • Overview and, in some cases, access to language corpora in all subject areas and many languages
Partly available for free, various licenses
e-rara 
  • 90,000 historic and rare printed publications from Swiss institutions
  • Full texts: PDF, some TXT
  • Jupyter Notebook for bulk downloads of metadata and full texts
Overview of data interfaces and terms
e-manuscripta 
  • 100,000 manuscript materials  from Swiss institutions
  • Full texts: PDF
  • Jupyter Notebook for bulk downloads of metadata and full texts
Overview of data interfaces and terms
e-periodica
  • 840 journals from Switzerland
  • Full texts: PDF
  • Jupyter Notebook for bulk downloads of metadata and full texts including text parsing
Overview of data interfaces and terms
Chronicling America Freely accessible, public domain
Internet Archive

Dokumentation

  • 34 million books and texts in a variety of genres, languages and data formats
  • Bulk download with command line tool and Python wrapper
Freely accessible, various licenses, sometimes not specified
Project Gutenberg

Dokumentation

  • 60,000 books in a variety of genres, languages and data formats
Freely accessible, public domain
OpenGLAM Survey

Übersicht

  • Overview of open data sources (digital reproductions, texts, metadata) of 1,400 cultural heritage institutions worldwide, with details of licenses and APIs
Freely accessible, public domain or open licenses

The resources and their interfaces are subject to various legal and technical terms of use. Please consult these before any automated access. In particular, automated access is often excluded for licensed content that is not listed here and may cause the provider to block access to the database. Please contact us to check the legality of access if you are in any doubt.

According to the Swiss Federal Act on Copyright and Related Rights, duplication and storage of legally accessible content for scientific purposes as in the context of TDM is permitted.