• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/43

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

43 Cards in this Set

  • Front
  • Back

study of language at the level of sounds

Phonetics

study of the combination of sounds

phonology

study of the patterns of formation of words by the combination of sounds

morphology

study of how words combine to form phrases, phrases combine to form clauses and clauses join to make sentences

syntax/syntactic knowledge

concerns the meaning of the words and sentences

semantic knowledge

extension of the meanings or semantics

pragmatic knowledge

concerns connected sentences

Discourse knowledge

nothing but everyday knowledge that all the speakers share about the world

word knowledge

applications of NLP

text analytics


smart assistants


predictive text


machine translation

converts unstructured text data into meaningful data for analysis

text analytics

recognize patterns in speech thanks to voice recognition, then infer meaning and provide a useful response

smart assistants

predict things to say based on what you type, finishing the word or suggesting a relevant one

predictive text

generally translating phrases from one language to another

machine translation

used to specify strings we might want to extract from a document

regular expressions

putting characters in sequence

concatenation

used to specify what a single character cannot be by the use of caret^

square braces

Set of operation that allows us to say things like "some number of as" are based on the asterisk or *

Kleene * (cleany star)

one or more occurrences of the immediately preceding character or regex

Kleene+

Process of cleaning your corpus is called

Text cleaning

Diacritics, often loosely called

accents

regex has another method, used to remove punctuations from corpora

sub() — substitution

An international encoding standard for use with different languages and scripts by which each letter, digit, or symbol is assigned with a unique numeric value

unicode

Combination of words that are shortened by dropping letters and replacing them with apostrophes

Contractions

Refers to the process of converting a sequence of text into smaller parts known as tokens

Tokenization

Breaks up text into smaller trunks or segments with more focus information content

Segmentation

Builds a vocabulary containing g words but are limited to

Words


punctuation marks


numbers

Carry sentiment and meaning

Graphemes

Are parts of words that contains meaning in and of themselves

Morphemes

Word extraction includes

One word


pair


triplets


quadruplets

Enables your machine to know about "ice cream" as well as the "ice" and "cream" that comprise it

n-grams

Simplest way to tokenize a sentence is to use white space within a string as the

Delimeter

Occurrences of tokens in the sentence/ paragraph/corpora

One hot vector

Word frequency

Frequency vector

Presence or absence of a particular word in a particular sentence

Binary vector

When a sequence of tokens is vectorized it loses a lot of meaning inherent in the order of words

N grams

Common words in any language that occur with the high frequency but carry less substantive information about the meaning of a phrase

Stop words

Removes suffixes from words in an attempt to combine words with similar meanings together under their common stem

Stemming

One of the most popular stemming methods proposed in 1980 by a british computer scientist named martin f porter

Porter stemming

Multilingual as it can handle non-english words

Snowball stemmer

More aggressive and dynamic compared to the other two stemmers

Lancaster stammer

Save the rules externally and basically used an iterative algorithm

Lancaster stemmer

Is way more aggressive than porter stemmer and is also referred to as porter2 stemmer

Snowball stemmer

Steming algorithm that utilizes regular expression to identify and remove suffixes from words

regexp stemmer