After excluding English-language /int/ and /international/, we have 815 682 messages of which

  • For now, threads are scored by simply averaging the scores of their messages.
  • Each thread ranking page features the top 30 threads for per-board rankings (if there were that many available), and 60 for aggregate rankings.
  • The kielipankki corpus doesn't have proper thread id's. The (truncated) thread name + board name has been used instead. This means that it's possible for multiple threads with the same names in the same boards to get mixed together.
  • Messages from the kielipankki corpus are in a tokenized form (eg. commas are separated by spaces) and have lost paragraph structure. This is somewhat fixable, but perhaps acceptable.
  • Only threads with at least 3 messages appear in these rankings.
  • The final rankings may change as the data cleaning and models are tinkered with more. If something seems like it might be wrong, it probably is wrong..
  • Ideas to implement later: might also show per-message scores; there are obviously many statistics that could be computed.

...according to the joint model. This may be problematic due to the model having seen more messages from the larger corpus (kielipankki).

Perhaps interesting to see the differences between this and the previous.

In these, only messages with a minimum score of 1.9 in the target model were included; otherwise we see messages with a large negative score from the opposite model and near zero in the target one.

Here the average of each thread had to be at least 1.5.

Topic model for joint corpus, politicalness determined by model "joint"

Cutoff criterion was score > 2.9; includes info on topic distribution between corpuses