COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora. We have corpora in European and non-European languages (Dutch, English, French, German, Malay, Spanish, Swedish and others). Currently, the corpora are between 1 billion and 10 billion tokens large. We are focusing on corpus quality in all areas (data collection as well as non-linguistic and linguistic post-processing) rather than larger corpus sizes.
To avoid legal problems with copyright claims, the published corpora contain only single sentences in randomized order ("shuffled", similar to the Leipzig Corpora). However, we are planning to release some alternative clever shuffle versions especially for computational applications (within-document shuffle, window shuffle, etc.).