Much of the hype around big data originates from the proliferation of Web data from social networks, discussion boards, email and other messaging-heavy methods of freeform communication.

When one seeks to capture and utilize data from Facebook or Twitter feeds, it’s easy to bite off more than one can chew. With Facebook, there’s at least a certain degree of data field formatting necessitated by the site’s layout. Twitter, on the other hand, with its streaming 140-character messages and no inherent organizational theme, is a much more difficult area to mine. While many Twitter users employ hashtags and “@” symbols to define the content or purpose of their tweets, navigating the barrage of nested replies and retweets in the site’s current design is downright painful.

Before you can extract data, you need to know what you’re looking to capture in the first place. A lot of tweets are fluff, so separating the worthy from the useless is an important skill to master. Separating the Twitter wheat from the substantial chaff is a four-step process:

  1. Define the attributes of your brand and purpose of your research;
  2. Consider time of tweets as a relevance factor;
  3. Track the right people;
  4. Don’t give too much weight to retweets.

First, define the attributes of your brand and the purpose of your research. Are you an online hotel-comparison site looking for endorsements or pings of certain hotels? Are you a baker examining trending hashtags for any wanton product suggestion like “#iwantchocolatechipcookies”? Or are you a technology vendor looking for positive and negative mentions of your product? Twitter already has a powerful reputation for bringing companies with poor customer services practices to their knees, giving a small guppy the voice of a marlin through the magic of retweeting and other people’s influence.
If your goals are predictive or aspirational, however, what you track is a little more nebulous. Let’s say you’re a mobile application developer for Google Android-powered smartphones. You’re on the lookout for the types of services and conveniences people crave so you can capitalize on that demand. Start by establishing your imperative: “I want to know what young working professionals like to do after work.”

Next, use the factor of time to narrow your focus. It’s pretty unlikely that people will tweet their after-work plans at 8:00 on a weekday morning. At 4 p.m., though, it’s a possibility. The popularity of social check-in services make it easy to tell where some users are flocking (mostly restaurants and bars), but could also reveal some interesting exceptions (“I’m at so-and-so bowling alley with my entire staff”). If you notice a trend in these exceptions, especially among people who have a large audience of followers, you can identify a new fad before it fizzles. Conversely, if your company makes online security freeware and your customers consistently tweet complaints that their computers are slow at 3 p.m. – maybe tweak your product’s automatic update settings.

Once you know the type of data you’re pursuing, track the right people. “Real-time” messaging services that are especially widespread are bound to attract scammers and spammers. On a case-by-case basis it’s easy to weed out these accounts(e.g., checking their feeds to see the same link for “meet cute girls here!” in a thousand different ways). On a macro scale, however, looking through every search result’s account is ineffective. Spam accounts tend to have certain commonalities in their handles that are usually grounds for ignoring or deleting. A first name or first name/last name followed by a series of numbers or letters is easy to create within seconds and unlikely to be used by someone else, so this handle format is ideal for spammers. Additionally, a handle without an identifying descriptor or tagline and/or a picture should raise red flags. You’ll know it when you see it – there’s an intuitive difference between @samsmith and @rebeccak23423972983. If you automatically ignore these types of profiles you run the risk of discounting some legitimate users (with an unfortunate naming sense), but that risk may be worth it when you’re scanning thousands of tweets.

Finally, take extensive non-relevant retweets with a grain of salt. If you come across a tweet by an influential user along the lines of “Just tried out the new #WindowsPhone and it rocks!” and it has been retweeted 30 times, that’s a trend worthy of consideration. If the tweet is “hahahaha whats goin on playaz” and it has been retweeted a dozen times,  well, zero times 12 is still zero. Ultimately it’s the content you’re pursuing – not sheer personality – that will turn unstructured fluff into valuable business insight.

Despite the challenges Twitter presents, an enterprise’s big data program can gain breadth and validity through a pragmatic, repeatable process that will allow organizations to use it as a critical source. Your brand’s perception and awareness are free to be damaged or evangelized on Twitter, which makes it a platform that must be acknowledged for any modern-era company. Automated tools with text analysis capabilities make this entire process of subjective determination significantly easier, but even if you decide to aggregate Twitter data manually, there’s no reason why it should become all-consuming. There are better ways for you to spend your time than trying to make sense of potentially random social networking messages.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access