Greatest hits

Posted by – June 24, 2008

I had some Internetless time with a laptop over the previous couple of days. Normally time + computer = wasted time, but on this occasion the non-connectedness prompted me to do some simple text analysis on the complete text of this blog. During the time I have been writing my blog I have used:

12242 different words
32755 different word-pairs
37428 different word-triples
35174 different word-quadruples
32160 different word-quintuples
29231 different word-sextuples
26492 different word-septuples
23967 different word-octuples
21646 different word-nonuples
19522 different word-decuples
17528 different word-undecuples

Word-tuples only count inside sentence boundaries. So this paragraph contains the pair “only count” but not “boundaries so” (or didn’t until I wrote it just now). The distribution probably follows some Zipf-type law I don’t know about. It looks like this:

Some of my (apparently) favourite tuples (frequency in parentheses):

words: about (181), like (158), people (158), my (148), me (121), things (109), think (95), something (75) (top-3: the (1466), to (957), of (754))

pairs: I don’t (44), I think (43), kind of (31), people who (25), have to (21), en ole (19) (top-3: in the (117), to be (113), of the (106)) honourable mention: child porn (18)

triples: I don’t know (13), I want to (10), it would be (9), in the future (8), I’m going to (7), ei ole mitään (6), don’t want to (6)

quadruples: whether I want to (5), profits as a percentage (4), it’s okay to be (3), I don’t know why (3), going to have to (3), I think this is (3), how I love you (3), this sort of thing (3), a friend of mine (3)

I should design a way to generate poetry out of this stuff.

0 Comments on Greatest hits