Filter Bubble Bursts: When Common Crawl Beats Clean Data
Cleaning is comforting. Every serious AI team has some version of the same ritual. Remove spam. Remove repetition. Remove bad language detection. Remove low-quality pages. Remove documents that look too weird, too short, too duplicated, too uneducational, too internet. Then hope the model learns from the respectable leftovers. That instinct is not foolish. In small or compute-constrained training runs, filtering often helps. The expensive mistake is treating that local truth as a permanent law. ...