MassiveText Dataset introduced for pre-training of DeepMind's Gopher

Private diverse 10-lingual textual dataset composed of web, Github, news, Wikipedia, Books, C4.
MassiveText dataset composition table
MassiveText dataset composition table

MassiveText Contents

  • Dataset language composition: 99% English, then 10 other languages.
  • Google SafeSearch filter
  • Dataset is composed of subsets: web, Github, news, Wikipedia, Books, C4 (web-text)
MassiveText non-english composition
MassiveText non-english composition

Papers Using MassiveText

Don’t Confuse MassiveText with Amazon Massive Dataset

  • Amazon Massive is not MassiveText
  • Amazon Massive is:
    • released by Amazon on 2022-04-20
    • multilingual natural-language understanding
    • 51-language dataset

Created on 20 Mar 2022. Updated on: 14 May 2022.
Thank you










About Vaclav Kosar How many days left in this quarter? Twitter Bullet Points to Copy & Paste Averaging Stopwatch Privacy Policy
Copyright © Vaclav Kosar. All rights reserved. Not investment, financial, medical, or any other advice. No guarantee of information accuracy.