If we do choose to use the raw data, even in truncated form, then we still have many features to deal with. For example, assume that there is a domain with 10 classes, 10 channels and an average stream length of 50 frames. The total number of features we would get by ``flattening'' would be approximately 500 features (10 channels times 50 frames).
Quinlan [personal discussion] points out a rule of thumb which says that as a bare minimum for most learning tasks, there should be at least as many training instances per class as there are features. Thus we would need at least 5000 training streams for the simple example above. This is a simple empirical heuristic; but it highlights the problems that a large number of features introduces.