A guide to the history, uses and opportunities with natural language processing

Register now

Natural language processing is one of the fastest-growing fields in tech right now. Tech giants such as Google, Amazon and Facebook invest in NLP solutions to build tools that assist users in many ways: from virtual assistants that rely on advanced speech recognition to chatbots able to deliver customer service 24/7 by processing user queries.

Artificial intelligence (AI) researchers from institutions such as OpenAI or IBM's Thomas J Watson Research Center are busy building increasingly sophisticated models based on innovative approaches like machine learning. OpenAI has recently published parts of GPT-2, a powerful text-prediction model that surpassed anything we've seen so far, producing startlingly accurate text samples based on various prompts (check it out here).

But what does natural language processing stand for? What's the role of NLP in AI research? And how can you become part of this rapidly evolving data science field?

This column is intended to offer a solid introduction to natural language processing that is just perfect for software developers looking to enter an area with a bright future ahead of it.

1. What is Natural Language Processing (NLP)?

Natural language processing is a field of artificial intelligence (AI) that focuses on machine understanding and processing of natural languages used by humans. The ultimate goal of NLP is to read, understand and make sense of human languages to deliver value. Dating back to the 1950s (a great time for the development of AI in general), NLP research played a crucial role in topic modeling, document indexing, translation, as well as information retrieval/extraction.

Following the emergence of machine learning in 2012, NLP development gained new steam. Machine learning technologies allowed engineers to build models that were no longer knowledge-based (following a set of hard-coded rules) but instead embraced the power of pattern recognition through training on datasets.

Today, the methods of natural language processing extract valuable insights from data, power recommendation engines on online stores, filter spam out of our inboxes, and translats our texts. Did you know that Google Translate switched to a neural machine translation engine in 2016?

Since the field is so popular today, it's surrounded by a rich ecosystem of tools and technologies. And that's great news for anyone who wants to get started.

2. What is NLP used for?

Natural language processing finds a wide range of common applications today, standing behind some of the most popular digital products.

Here are a few examples of what Natural Language Processing is used for today:

  • Word processing applications like Microsoft Word or Grammarly that use NLP to check the grammatical accuracy of texts and offer style suggestions.
  • Language translation apps like Google Translate.
  • Dictation software like Dragon NaturallySpeaking that rely on speech recognition.
  • Interactive Voice Response (IVR) apps used by call centers for responding to specific user requests.
  • Virtual assistants like Google Home, Amazon's Alexa, Apple's Siri, and Microsoft's Cortana.

3. How does Natural Language Processing work?

NLP engineers apply algorithms to identify and extract human language rules for converting unstructured language data into a form that machines can understand. For example, once we provide the computer with a text, it will take advantage of these algorithms to extract the meaning associated with every sentence and then collect data from this analysis.

How does NLP work in practice? Here's a speech recognition scenario we all know from interactions with virtual assistants:

  • A human talks to the machine
  • The machine captures the audio content
  • It converts audio to text
  • Then text data processing takes place
  • The machine converts data into audio
  • And then it responds to the human by playing the audio file

As expected, machines sometimes fail to understand the meaning of a sentence, leading to some really surprising results. In fact, the early attempts at making machines translate human language made it into a subject of jokes like this one:

The CIA invested millions of dollars into developing a translation tool that would enable machine translation between English and Russian. The computer translated the biblical sentence, “The spirit is willing, but the flesh is weak,” into Russian and then back to English, coming up with: “The vodka is good, but the meat is rotten.”

The story is pure myth, but it shows how the unrealistic expectations of NLP collided with reality during the early years of research into this matter.

4. Why is Natural Language Processing difficult?

Natural language processing aims to solve one of the most difficult problems in computer science: making machines understand the natural language used by humans. And it's precisely the nature of the languages we use that makes NLP such a challenging and fascinating field.

Here's the gist of the problem:

Sometimes machines find the rules governing natural languages hard to understand because they're too abstract or high-leveled. For example, how can we encode sarcasm or irony that humans use so often when passing information?

Natural language understanding is a complex problem. Understanding a human language requires understanding not only the individual words but also how the concepts are connected in a broader context to deliver the message intended by the sender.

For humans, mastering a language is easy. We all acquire languages seamlessly when we're kids. We somehow learn to navigate the ambiguity and imprecise characteristics of natural language. As it turns out, machines have a much harder time dealing with that.

5. What are the techniques used in NLP?

The main techniques engineers use to complete tasks in this field are syntactic and semantic analysis.

Syntactic analysis - Syntax is the arrangement of words and phrases in a sentence so that they create a well-formed and grammatically correct sentence. In NLP, engineers use syntactic analysis to assess how the natural language aligns with grammatical rules. For example, machines apply such rules to a group of works and then derive meaning from them.

Syntactic techniques used in NLP:

  • Lemmatization – reducing various inflected forms of a word into one form to make analysis easier.
  • Morphological segmentation – this technique divides works into smaller individual units called morphemes (meaningful units that can't be divided any further – for English, that would be “in,” “come,” and “ing” for “incoming”).
  • Parsing – this technique is based on grammatical analysis of a sentence.
  • Sentence breaking – also known as sentence boundary detection, this technique involves placing sentence boundaries in a longer text.
  • Stemming – this is where we cut the inflected words to their root form or word stem (for example, the words “consulting” and “consultant” can be reduced to “consult).
  • Text segmentation – this technique involves dividing a piece of continuous text into distinct units such as topics, sentences, and words.
  • Part-of-speech tagging – identifying the part of speech for every word.

Semantic analysis - Semantics is an area in linguistics which is concerned with meaning conveyed by a text. Semantic analysis is one of the most challenging aspects of NLP. The idea here is applying algorithms to understand the meaning of words, relations between them, and how entire sentences are structured.

Semantic analysis techniques used in NLP:

  • Named entity recognition (NER) – it involves identifying and classifying entity mentions in an unstructured text (for example, names of people or places).
  • Word sense disambiguation – this technique focuses on giving meanings to words based on the context.
  • Natural language generation – transforming structured data into the natural language to produce long-form content like reports or custom content for applications.

6. Natural Language Processing in Python

Python is one of the most widespread languages for NLP projects thanks to its transparent syntax and semantics. The language offers outstanding support for integration with other tools and languages, as well as access to an extensive ecosystem of NLP tools and libraries for tasks such as sentiment analysis, document classification, part-of-speech (POS) tagging, and topic modeling.

Here are six Python libraries that make this language such a great tool for NLP and data science projects.

1. Natural Language Toolkit (NLTK)

This library supports fundamental tasks like classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It's an educational foundation for Python developers interested in giving NLP a go. Developed by two researchers from the University of Pennsylvania, NLTK plays a key role in breakthrough NLP research all over the world. Note that NLTK is difficult to use and rather slow – it's not a great match for the demands of production usage.

2. TextBlob

Great for those starting their journey with NLP in Python. The tool provides beginners with an easy interface for learning basic NLP tasks like pos-tagging, sentiment analysis, or noun phrase extraction. It's great for designing prototypes. But it also inherited the main flaws of NLTK and may be too slow to match the requirements of NLP production usage.

3. CoreNLP

This library was developed at Stanford University and includes wrappers for many different languages, including Python. It's a great fit for fast-paced product development environments. Some of the CoreNLP components can be integrated with NLTK for an efficiency boost.

4. Gensim

This handy library identifies the semantic similarity between two documents through vector space modeling and topic modeling toolkit. Great for working with large text collections – it uses data streaming and incremental algorithms, contrary to other packages that only target batch and in-memory processing. Its processing speed and memory usage optimization make it a great tool for NLP projects.

5. spaCy

Designed for production usage, spaCy offers the fastest syntactic parser available on the market. It's very fast because it was written in Cython. However, it supports the smallest number of languages (seven).

6. polyglot

This library offers a broad range of analysis and impressive language coverage. It works very fast thanks to NumPy. It's a great pick for projects that involve a language spaCy doesn’t support. The tool stands out because it requests the usage of a dedicated command in the command line through the pipeline mechanisms.

Conclusions

Natural language processing is a field that has been experiencing immense growth together with other data science areas such as artificial intelligence, machine learning and deep learning. An increasing number of organizations will turn to the commercial applications of NLP, such as chatbots that will be able to address more complex requests in real time. Invisible user interfaces that rely on the direct interaction between users and machines will become more widespread too. All in all, machines will get better at understanding users and their intentions.

I hope this article helps you understand the basics of this amazing field and make the first steps towards experimenting with NLP techniques in your projects.

For reprint and licensing requests for this article, click here.