Vaclav Kosar's face photo
Vaclav Kosar
Software, Machine Learning, & Business

MassiveText Dataset introduced for pre-training of DeepMind's Gopher

Private diverse 10-lingual textual dataset composed of web, Github, news, Wikipedia, Books, C4.
MassiveText dataset composition table
MassiveText dataset composition table

MassiveText Contents

  • Dataset language composition: 99% English, then 10 other languages.
  • Google SafeSearch filter
  • Dataset is composed of subsets: web, Github, news, Wikipedia, Books, C4 (web-text)

MassiveText non-english composition

Papers Using MassiveText

Don’t Confuse MassiveText with Amazon Massive Dataset

  • Amazon Massive is not MassiveText
  • Amazon Massive is:
    • released by Amazon on 2022-04-20
    • multilingual natural-language understanding
    • 51-language dataset

Created on 20 Mar 2022. Updated on: 14 May 2022.

Let's connect





Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste