MassiveText Dataset introduced for pre-training of DeepMind's Gopher
Private diverse 10-lingual textual dataset composed of web, Github, news, Wikipedia, Books, C4.
MassiveText dataset composition table
MassiveText Contents
- Dataset language composition: 99% English, then 10 other languages.
- Google SafeSearch filter
- Dataset is composed of subsets: web, Github, news, Wikipedia, Books, C4 (web-text)

Papers Using MassiveText
Don’t Confuse MassiveText with Amazon Massive Dataset
- Amazon Massive is not MassiveText
- Amazon Massive is:
- released by Amazon on 2022-04-20
- multilingual natural-language understanding
- 51-language dataset
Created on 20 Mar 2022.
Updated on: 14 May 2022.
Let's connect
Privacy Policy
How many days left in this quarter?
Twitter Bullet Points to Copy & Paste