Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), pages 419-428, October 2008, Honolulu, Hawaii.
This paper explores the challenge of scaling
up language processing algorithms to increasingly
large datasets. While cluster computing
has been available in industrial environments
for several years, academic researchers have
fallen behind in their ability to work on large
datasets. We discuss two challenges contributing
to this problem: lack of a suitable programming
model for managing concurrency
and difficulty in obtaining access to hardware.
Hadoop, an open-source implementation
of Google’s MapReduce framework, provides
a compelling solution to both issues.
Its simple programming model hides systemlevel
details from the developer, and its ability
to run on commodity hardware puts cluster
computing within reach of many academic
research groups. This paper illustrates these
points with a case study on building word cooccurrence
matrices from large corpora. We
conclude with an analysis of an alternative
computing model based on renting instead of
buying computer clusters.