Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
43 Cards in this Set
- Front
- Back
study of language at the level of sounds |
Phonetics |
|
study of the combination of sounds |
phonology |
|
study of the patterns of formation of words by the combination of sounds |
morphology |
|
study of how words combine to form phrases, phrases combine to form clauses and clauses join to make sentences |
syntax/syntactic knowledge |
|
concerns the meaning of the words and sentences |
semantic knowledge |
|
extension of the meanings or semantics |
pragmatic knowledge |
|
concerns connected sentences |
Discourse knowledge |
|
nothing but everyday knowledge that all the speakers share about the world |
word knowledge |
|
applications of NLP |
text analytics smart assistants predictive text machine translation |
|
converts unstructured text data into meaningful data for analysis |
text analytics |
|
recognize patterns in speech thanks to voice recognition, then infer meaning and provide a useful response |
smart assistants |
|
predict things to say based on what you type, finishing the word or suggesting a relevant one |
predictive text |
|
generally translating phrases from one language to another |
machine translation |
|
used to specify strings we might want to extract from a document |
regular expressions |
|
putting characters in sequence |
concatenation |
|
used to specify what a single character cannot be by the use of caret^ |
square braces |
|
Set of operation that allows us to say things like "some number of as" are based on the asterisk or * |
Kleene * (cleany star) |
|
one or more occurrences of the immediately preceding character or regex |
Kleene+ |
|
Process of cleaning your corpus is called |
Text cleaning |
|
Diacritics, often loosely called |
accents |
|
regex has another method, used to remove punctuations from corpora |
sub() — substitution |
|
An international encoding standard for use with different languages and scripts by which each letter, digit, or symbol is assigned with a unique numeric value |
unicode |
|
Combination of words that are shortened by dropping letters and replacing them with apostrophes |
Contractions |
|
Refers to the process of converting a sequence of text into smaller parts known as tokens |
Tokenization |
|
Breaks up text into smaller trunks or segments with more focus information content |
Segmentation |
|
Builds a vocabulary containing g words but are limited to |
Words punctuation marks numbers |
|
Carry sentiment and meaning |
Graphemes |
|
Are parts of words that contains meaning in and of themselves |
Morphemes |
|
Word extraction includes |
One word pair triplets quadruplets |
|
Enables your machine to know about "ice cream" as well as the "ice" and "cream" that comprise it |
n-grams |
|
Simplest way to tokenize a sentence is to use white space within a string as the |
Delimeter |
|
Occurrences of tokens in the sentence/ paragraph/corpora |
One hot vector |
|
Word frequency |
Frequency vector |
|
Presence or absence of a particular word in a particular sentence |
Binary vector |
|
When a sequence of tokens is vectorized it loses a lot of meaning inherent in the order of words |
N grams |
|
Common words in any language that occur with the high frequency but carry less substantive information about the meaning of a phrase |
Stop words |
|
Removes suffixes from words in an attempt to combine words with similar meanings together under their common stem |
Stemming |
|
One of the most popular stemming methods proposed in 1980 by a british computer scientist named martin f porter |
Porter stemming |
|
Multilingual as it can handle non-english words |
Snowball stemmer |
|
More aggressive and dynamic compared to the other two stemmers |
Lancaster stammer |
|
Save the rules externally and basically used an iterative algorithm |
Lancaster stemmer |
|
Is way more aggressive than porter stemmer and is also referred to as porter2 stemmer |
Snowball stemmer |
|
Steming algorithm that utilizes regular expression to identify and remove suffixes from words |
regexp stemmer |