When the World Series approaches, I usually find a few hours during the week to watch a game or two on TV. On one such Sunday not long ago, I found myself caught up in the dialog between the two sports announcers. It went something like this:

"Smith is hitting 296 against Rivera (translation: In the times that Smith has been at bat against Rivera this season, he has gotten a hit on 29.6 percent of those occasions 1) ... but, with men in scoring position, he’s got Rivera’s number  he’s hitting 383, lifetime" (translation: For his entire professional baseball career  when Smith has batted against Rivera, in those situations when there were runners on second, third or both, he got a hit 38.3 percent of the time).

I began to envision a data warehouse with dimensions, fact tables and a query tool providing the program’s statistician with instant access to the data. The announcers continued:

"Smith already has a run-batted-in (RBI) in the third inning. Not since Bill Weaver did it in 1956 did someone have two or more RBIs in 12 consecutive games" (translation: If Smith gets a hit and a runner scores, he will be credited with his second RBI  the 12th consecutive game that he has had two or more RBIs. The last person to do that in major league baseball was Bill Weaver in 1956). "Rivera is tough on right-handers, however, especially here in Yankee Stadium. He’s got a lifetime 2.33 ERA in this park facing right-handed batters." (translation: During his professional career, when Rivera has pitched against right-handed batters in Yankee stadium, teams have scored runs against him at an average of 2.33 earned runs per nine-inning game.)

No one doubts these numbers for an instant.

I concluded last month’s column by stating that we were going to examine an organization that, for the better part of the 20th century, maintained perfect data quality. You would have to look far and wide to find an organization that manages its data as well as major league baseball (MLB) has done since it began keeping records in the 1880s. Why is that? Why would an organization filled predominantly with men (let’s not forget the Women’s teams of the 40s) playing a game with a bat, ball and four bases go to such lengths to manage their data? What can we learn from MLB that can be carried into the corporate world of data quality management?

#### "Baseball is 90 percent mental. The other half is physical."2

I had some interesting conversations with a few of my colleagues as we hypothesized as to why and how data is maintained so precisely in baseball. Allow me to give you my conclusions. First, from its conception, strategy played an important role in how the game was played. Managers felt that the one who kept the best statistics about his players and those on the other teams had a competitive advantage. Second, baseball always maintained a good rule book of how to interpret events that took place during the game. Third, baseball games were played with an official scorer to interpret the rules  a type of data steward. Finally, baseball has made use of the latest technology over the years to analyze and explore its data. Let’s take a closer look at these characteristics.

#### Good Managers "Play the Percentages"

With a few exceptions, the image most people get when they think of professional baseball players and managers is one of a guy who chews tobacco 3, spits and scratches himself with no regard to who is watching. Don’t be fooled, though. These tough guys  particularly the managers  are really quick-witted strategists who come prepared with key statistics that will influence their decision making in the heat of battle.

At the major league level, baseball is really a game of strategy  pitcher versus hitter and manager versus manager. Hitters study the tendencies of the pitchers, knowing what pitches they throw and when they throw them. Conversely, professional pitchers know the hitters  what pitches they like to hit and where in the strike zone they are most likely to hit well. Managers, too, know the strengths and weaknesses of the pitchers and hitters on their own and the opposing team in an effort to give their team an edge. This data helps a manager decide who to play at key points of the game and how to best position them offensively and defensively. Ultimately, this intelligence is derived from the granular data that is collected and stored from each pitch that is thrown during the season.

#### MLB Provides an Exhaustive Repository of (Business) Rules

During the 1999 National League playoffs, with a tie game and a man on third base, Robin Ventura of the NY Mets hit a ball over the fence to end the game, giving the of the Mets a dramatic victory. Ventura rounded first base where he was engulfed by his teammates, the bases were taken in and an interesting conversation immediately ensued. In the baseball archives, would Ventura be given credit for hitting a home run? The answer, as Met fans know, is that Ventura received a single because. As MLB rules point out, he never touched all the bases including home plate. A technicality, maybe, but baseball closely follows an extensive book of rules and it is unwavering when asked to deviate.

This adherence to the rules is part of the culture of baseball. Fans, players and officials confidently know that these standards are universally practiced. As a result, the events of each game are interpreted and documented in a consistent manner from game to game, from team to team and from year to year. Additionally, the rules of organized baseball at all levels dictate that there be a rule enforcer and a scribe. To that extent, all games are played with an Umpire who insures that the rules are strictly followed and, unbeknownst to many, there is an official scorekeeper who interprets and records each game’s activities.

#### The Official Scorekeeper  Baseball’s Data Steward

In the previous example, it was the official scorekeeper assigned to the game who made the ruling that the hit was, in fact, a single and not a home run. According to MLB’s official rule book:

"The league president shall appoint an official scorer for each league championship game. The official scorer shall observe the game from a position in the press box. The scorer shall have sole authority to make all decisions involving judgment, such as whether a batter's advance to first base is the result of a hit or an error. He shall communicate such decisions to the press box and broadcasting booths... After each game the scorer shall prepare a report, on a form prescribed by the league president, listing... the full score of the game, and all records of individual players compiled according to the system specified in these Official Scoring Rules." 4

Here, I believe, lies the most important factor explaining baseball’s superior data quality. In short, the official scorer is baseball’s data steward. It is the scorer’s responsibility to enter the data into baseball’s archives; and that data is closely scrutinized by the masses. The data they enter must be complete, and it must be accurate. Subjective as it is, their decisions will be assessed by thousands or millions of fans (not to mention the players and coaches) and heard about through newspapers, radio and TV. In fact, their job performance will be evaluated based on the quality of the data they provide.

#### MLB Maintains an Effective Data Repository

The analysis of baseball data starts at an early age. Just ask little leaguers about their batting average. The demand for good data grows proportionately as you approach the professional level. As a result, MLB has a history of exploiting the latest in techniques and technology to obtain, analyze and distribute data. For example:

• MLB exploits the latest technology to create data. MLB was among one of the first users of a radar gun to measure a pitcher’s speed.
• MLB has been using statisticians to compile and analyze data since the 40s and 50s.
• MLB has had to integrate data from multiple sources over the years in an effort to create a comprehensive historical data repository. During the 20th century, they have had to merge and compile data from several leagues, most notably the National and American leagues.

#### "This is like deja vu all over again." 5

At the Information Quality 2000 Conference, held in Anaheim last month, it was clear that the corporate need for information quality continues to evolve. It is not just ERP and business intelligence any longer. Much of the focus is on e- business and the opportunities and risks the Internet provides. Data quality issues are no longer contained by the walls of the corporation. Business-to- business (B2B) communications will now expose our data flaws to our business partners, jeopardizing the efficiencies we seek to build into our supply chain.

Improved data quality can be nurtured on an enterprise level or on a local level via tactical projects. Ultimately, however, data must be entered correctly at its source by individuals who take responsibility for its quality. MLB provides a model of an organization that has maintained a culture promoting good data quality. It has users who require good data in order to define a strategy that will provide their team with a competitive advantage (or, at least, minimizes its risks). It has well-defined business rules that are interpreted at each game by individuals who are accountable for the creation of good data. Finally, efforts are continually made to make its data accessible for operational and strategic use by a vast number of interested parties.

Many sports journalists refer to these baseball facts and figures as baseball "history," not as baseball trivia. Surely, America’s pastime would not be the game we know it as if the quality of its data were anything less.

Footnotes:
1. The players and dialog used are for illustrative purposes. While it was based on an actual dialog, authentic statistics were not available.
2. Yogi Berra, former MLB player and manager
3. MLB rules now prohibit the public use of chewing tobacco.
4. Article 10.01 of the rule book of the MLB.
5. Yogi Berra