I.S.P. Nation (the P is for Paul), a New Zeland SLA researcher specializing in vocabulary acquisition, gives the following interesting and useful figures and facts:
reading newspapers requires a vocabulary of 8 thousand words.
reading novels requires a vocabulary of 9 thousand words.
The first thousand most frequently occurring words get you to 75% of a normal text; the second thousand gets you 8.5% more, and the third thousand gets you 3% more (4th K > 2.5%, 5th K > 1.3%).
Five thousand words plus proper nouns get you 96% coverage, while 9 thousand + proper nouns get you 98% coverage. Interesting, huh?
With this vocabulary, there are three kinds of word combinations, and ony one of those is impossible to figure out from its component parts. Interestingly, he finds idioms not entirely opaque but does consider what most of us call expressions as impossible to figure out, an example being “as well as”. Believe it or not, there’s only about 100 of those in English.
Some idioms are just not that hard to get, so he calls the “as well as” type “core idioms”. That’s just his own special term. It makes sense to me, b/c a Spanish idiom like pulling one’s hair or English pulling one’s leg can be seen in a figurative way, but “as well as” just cannot be deciphered that way. I don’t mean to suggest that you don’t need help in learning what an idiom means, but once you are told, you can “kind of see it”. Personally, the “as well as” core idioms are exactly the ones I have the most trouble with in Russian.
Oct. 27 addendum. The Nation figures mention proper nouns. I just did a count of the words glossed for a chapter I’m reading. The total words glossed was 136. The percentage of proper nouns was one third, around 46. In the language I’m working in, it is especially important to know you are dealing with a name, a proper noun, because the script is difficult to read (the Arabic script of Urdu called nastaliq), and it doesn’t mark proper nouns in any way, e.g. capitalization.