Models are trained on very large collections of text: books, articles, documentation, and other public sources. This data provides the statistical patterns the model learns (language, facts, typical reasoning). Care is taken to respect licenses and privacy when preparing datasets.
Raw text is tokenized (broken into numerical tokens) so the model learns to predict the next token given context. Higher-quality and domain-specific data improves performance for particular use cases like finance.