Introducing Leiden Weibo Corpus

My inner linguistic nerd is going crazy right now. This post released today from a member of Sinoglot, Daan, reveals that he has created a large Weibo corpus. For those not familiar with term, a corpus is a large body/collectinos of text, often used in linguistic research to study trends, frequency, discoure analysis and other interesting data.

The LWC is an annotated linguistic 100-million word corpus containing 5.1 million messages from Sina Weibo, China’s Twitter-like microblogging service. It’s freely available online at http://lwc.daanvanesch.nl/.

I’m a sucker for context lately. Whenever I want to see how a word is used, I go straight to Jukuu.com, instead trying to find the meaning boundaries with a dictionary. This tool by Daan will be an awesome new arsenal in contextual information, especially colloquial online writing. There are some useful stat pages as well, like a frequency list. It will be interesting to see how this differs other frequency lists. I already see a word, 分享, popping in at number 50 that you won’t find in other corpuses that high up, like Jun Da’s fiction corpus. 分享 means to share. You’ll see it everywhere on Weibo and social networking websites. Almost like retweet from Twitter.

Have a look around and start exploring. I’ll bet you’ll find something interesting.