- Introduced in DeepMind’s Scaling Language Models: Methods, Analysis & Insights from Training Gopher
- Disk size is 10.5 TB.
- Token count is around 5T tokens.
- Document count is 2.32B with average 2k tokens per document.
- Dataset is private - not open-source licenced as of 2022-03-20.
- Similar but public dataset is The Pile (diverse 800GB text dataset)
- Dataset language composition: 99% English, then 10 other languages.
- Google SafeSearch filter
- Dataset is composed of subsets: web, Github, news, Wikipedia, Books, C4 (web-text)
Papers Using MassiveText
- Since it is private only DeepMind uses the dataset
- DeepMind’s RETRO Transformer used multi-lingual version of the dataset
- DeepMind’s Gopher used only english version of the dataset
Don’t Confuse MassiveText with Amazon Massive Dataset
- Amazon Massive is not MassiveText
- Amazon Massive is:
- released by Amazon on 2022-04-20
- multilingual natural-language understanding
- 51-language dataset