To give this blog post a little bit of background info, I decided to work on the Text-Based Adventure Game. I was incredibly into the idea of creating a game and learning more about processing string inputs. Unfortunately, what I didn’t realize was that I would taking a dive into a rabbit hole of information. Prior to starting my research, I didn’t realize that Natural Language Processing (NLP) was a topic under the umbrella of Machine Learning, as we need to “teach” computers how to process natural language, along with a whole slew of new information. The purpose of this post is to give my perspective of NLP and restate what I learned in my own words.
To start off, let’s define what a natural language is. Think about the languages we speak day-to-day, such as English. English is a language that had developed naturally over the course of time. In contrast, think about the programming languages we use to code, such as Python. Python is a language that was created artificially with specific rules as to we can use it properly. Now that we know the basic definition of natural language, let’s discuss natural language on a deeper level.
Continuing on with English as an example, English has a set of grammatical rules and semantics that allow what we say to be coherent to another English-speaker. For example, the sentence “The dog jumped on the bench” makes sense grammatically, as it contains a noun phrase (“The dog”) and a verb phrase (“jumped on the bench”), and makes sense semantically and logically (a dog definitely can jump on a bench). Meanwhile, the sentence “Plain flowers hike up a gray church” makes sense grammatically, but does not make sense semantically (flowers cannot normally hike).
Noam Chomsky, a prominent linguist, postulated that there are syntactic rules that are universal to all languages, and established the idea of context-free grammar (CFG). Simply put, CFG expresses that a sentence can be broken down into parts, in what is known as a parse tree (more on CFG here: https://en.wikipedia.org/wiki/Context-free_grammar).
Now that we know what a natural language is and the parts it is composed of, let’s take a look how we can “teach” NLP to computers. Say that we want our program to process a simple sentence. In this case, we need to perform both lexical analysis and parsing on the sentence.
To start off, to do lexical analysis on a sentence means to simply break it down into its base parts:
- Tokenization –> Turn the sentence into a list of words using the white space between words in the sentence
- Example: “An apple falls from the tree”
- [“An”, “apple”, “falls”, “from”, “the”, “tree”]
- Parts of Speech Tagging –> Tag each word in the list of words with the part of speech they are
- [“An”: article, “apple”: noun, “falls”: verb, “from”: preposition, “the”: article, “tree”: noun]
- Stemming & Lemmatization –> Revert each word into their base form
- [“An”, “apple”, “falls” –> “fall”, “from”, “the”, “tree”]
- Stop Word Removal –> Remove prepositions and articles
- [
“An”: article, “apple”: noun, “falls”: verb,“from”: preposition,“the”: article, “tree”: noun]
- [
And parsing a sentence means to confirm that the sentence is grammatically and syntactically correct. To be grammatically correct, the sentence needs to follow the formal grammar structure, as mentioned by Noam Chomsky. To be syntactically correct, the sentence must be logically sound (i.e. you cannot use a door on a key, but you could use a key on a door).
Parsing can be done in multiple ways– You could choose to do Top-down Parsing or Bottom-up Parsing, and Left-most Derivation or Right-most Derivation.
As we can see, Natural Language Processing is an extensive topic that cannot be covered in one blog post. However, I hope that this post has given you insight into what NLP is and how it is done.
If you want to find out more about Natural Language Processing, I highly recommend checking out these links:
- https://www.tutorialspoint.com/natural_language_processing/natural_language_processing_syntactic_analysis.htm
- https://medium.com/mlearning-ai/nlp-tokenization-stemming-lemmatization-and-part-of-speech-tagging-9088ac068768
- https://www.youtube.com/watch?v=bxpc9Pp5pZM
- https://www.youtube.com/watch?v=2tqdBC2weCo
- https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1