• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/85

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

85 Cards in this Set

  • Front
  • Back
  • 3rd side (hint)
Why is it hard to find information?
1) information overload
2) Not all information is accessible from one source
3) Lack of systematic organization
4) We don't know what we don't know.
Why this class is important
1. The way knowledge is stored, shared and preserved is changing rapidly

2. Professional opportunities
-Information management
-Web design / information architect
-Electronic commerce
-Search engine business

3. Career advancement
-Job searching
-Reporting to superiors
-Management skills

4. Life…
Why we need to develop techniques for thinking about whether information is reliable
1. How do we know who to trust?

2) Can we tell different kinds of information apart?
Facts
Opinions
Misinformation (everyone makes mistakes sometimes)
Disinformation (deceptive or downright false)
Different points-of-view
...

2) Different kinds of information
Facts
Opinions
Misinformation (everyone makes mistakes sometimes)
Disinformation (deceptive or downright false)
Different points-of-view

3) Things change…
Critical thinking
The use of logical thinking and reliable information to recognize the true (or possibly true) from the false (or possibly false).
Components of Google Search page
Ads
“Organic” results
Snippets/Surrogates
Location based results
Other options
Help
“Organic” Results
Appear because they are determined by the search engine’s algorithm to be relevant to the query
Anatomy of Organic Results
Title
Snippet
URL
Site navigation links
Cache and other options
Snippets/Surrogates
Can’t show the entire page of results
Snippets allow for quick review of whether there is relevance
An entire “science” of how to generate them
Mostly based on your keywords and structure of pages
Different kinds of search results
Search results now include results for news, videos, and other sources
Different than “organic results”
E.g., news may be ranked by time
Dede (2008)
Read this!!!! Dede, C. (2008). A Seismic Shift in Epistemology. Educause Review, (May/June 2008). http://www.educause.edu/EDUCAUSE+Review/EDUCAUSEReviewMagazineVolume43/ASeismicShiftinEpistemology/162892
Quickly write 5-10 words or phrases that describe some of our ITI 220 students
Can use what people are wearing, features such as hair color or length, etc.
The components of search
Crawling, Indexing, (Ranking and Matching)
Web Crawler
Aka “spider”
A program that…

Visits web pages
(usually) Downloads the content
Saves the content or extracted information in a database
If a page is not “crawled”, it will never appear in search results!
Simple Crawling Strategies Questions
How does the crawler know which pages to go to?
How are starting pages chosen?
In what order does it follow the links on those pages?
Simple Crawling Strategies
Breadth-first search. (2010). Wikipedia, the free encyclopedia. Retrieved 2/16/10 from http://en.wikipedia.org/wiki/Breadth-first_search.

Depth-first search. (2010). Wikipedia, the free encyclopedia. Retrieved 2/16/10 from http://en.wikipedia.org/wiki/Depth-first_search.
The Link Graph
See slides for BFS and DFS.
Crawl Limits
On the web, you can’t crawl forever
Limitations could be duration (time limits), computing, storage…
Limit has an effect on the availability, but even the limit strategy has an effect:
E.g. “only 5 steps”
E.g., “only depth of 1 in each site”
crawling Results do vary!
Different crawling strategies mean different pages added to the engine database
I.e., there could be differences in what’s available in Yahoo/Google/MS/…
Other reasons for differences: size, duration, and time of crawl
Revisiting <> Crawling
“Crawling” includes re-visiting of sites that the engine thinks may change
Or maybe are not likely to change but if they do, it is important
This continuous crawl may follow only previously-known pages (i.e., no new links)
Why no search engines indexes all webpages
Quantity of information
Law of diminishing returns – more information is not always better
Web is changing all the time
Web sites inaccessible to search engines
Other hurdles for crawling
Need a login? Google doesn’t have one…
Databases and other special-access sites
Robots.txt: excluding crawlers from your site
http://www.robotstxt.org/robotstxt.html
The Web – Visible and invisible
Visible web: Web pages accessible to search engines
Invisible web (sometimes called “deep web”): Web pages that cannot be accessed by search engines
Why many web pages are “invisible”
Dynamically created information
Non-text format
Tables and databases, Statistical data, Images, …
Proprietary or non-indexed file format
Owner deliberately blocks search engine access
Passwords and fees (research, journals, jit info)
Software designed to block search engines
Firewalls

Fail to satisfy requirements of a specific search engine
http://www.google.com/webmasters/guidelines.html
Webpages missed because of crawling strategy
Webpage not well linked to other webpages
Result of search engine crawling strategy
Visible to Google, not to You
Google may get special access to crawl a site, even if it’s not publicly accessible
You may see results that you cannot access without login
E.g., portal.acm.org
“Invisible” sites and services
WayBack Machine http://webdev.archive.org/ (me/CNN)
Specialized databases (free)
White pages, yellow pages
Articles – http://www.findarticles.com
US Trademarks - http://www.uspto.gov/
Specialized databases (paid, but you have access to)
Most of the journal articles/databases available from:
http://www.libraries.rutgers.edu/rul/indexes/indexes.shtml
http://www.jerseyclicks.org
CAPCHAS AND RECAPCHA
CAPCHA "Completely Automated Public Turing test to tell Computers and Humans Apart."
Examples
http://en.wikipedia.org/wiki/CAPTCHA
RECAPCHA
http://recaptcha.net/learnmore.html
What and Why? Indexing
Indexing is needed to find query matches in fastest time possible
How to find all* web pages with specific terms in them?
Not unlike a book index
Book Index
How is it similar to search engine indexing?
How is it different?
Search Engines Indexes
Textword, and which documents it appears in
E.g.:
SC&I:
http://www.comminfo.rutgers.edu/
http://library.sccsc.edu/comminfo/
http://cissl. comminfo.rutgers.edu/
http://www.ocnj.org/pages/comss/scils/comminfo.htm
http://www3.lehigh.edu/engineering/cse/news/seminars/fall08.asp
Inverted Index
A list of terms, and for each term, the pages it appear on (represented here by numbers)
Index is huge – what do you do?
Finding the pages that have the terms “cats” and “dogs”:
Finding “cats” and “dogs” amongst many other terms in index
Getting the very long list of pages for each of the terms, comparing them
Many tricks, e.g. sorting
A more complicated inverted index
A list of terms, and for each term, the pages it appear on and the location (e.g., word number) on that page
Curating the index terms
Stemming
Multiple words can be stemmed to one
cats=>cat etc
Stop words
Words that are too popular sometimes not added to index (e.g. A, An, Any, Be, Is, No, Of, …)
One reason why engines ignores certain words
Words that are not visible on Page
Engine may add hidden text (not visible in article) to the index
E.g., “meta tags”
Often-used “spamming” trick
Words that are not on Page
Anchor text: on and around the link to the page
E.g., “approachable” could be added to my page’s index!
If exist, the terms add to the term-relevance for my page (“mor naaman”)
Google-bombing
“Miserable Failure”
How Google-bombing works
http://google.about.com/gi/o.htm?zi=1/XJ&zTi=1&sdn=google&cdn=compute&tm=14&f=10&su=p504.1.336.ip_&tt=2&bt=1&bts=1&zu=http%3A//oldfashionedpatriot.blogspot.com/2003_10_01_oldfashionedpatriot_archive.html%23106727416975886151
Google on Google Bombing
Google: “People have asked about how we feel about Googlebombs, ... Because these pranks are normally for phrases that are well off the beaten path, they haven’t been a very high priority for us. But over time, we’ve seen more people assume that they are Google’s opinion, or that Google has hand-coded the results for these Googlebombed queries. That’s not true, and… trying to correct that misperception… [we] came up with an algorithm that minimizes the impact of many Googlebombs”
http://googlewebmastercentral.blogspot.com/2007/01/quick-word-about-googlebombs.html
Index: temporary summary
The indexing technique allows search engine to determine very quickly what web pages a certain term (or terms) appear on
Indexing may include terms that do not appear on the page!
Remember, engines only index the content they crawl
Index and Relevance
The inverted term index can help the engine determine not only whether a term appears, but also judge relevancy
Determining page relevance
Position and prominence on page
Frequency on page
Term frequency
Intuitively
In any given document, a term that occurs many times, is likely to be more important than a term that occurs fewer times
Often used in comparison with “document frequency”
E.g. words like “the” appear a lot on any page in English…
The text-based ranking techniques were all we had until…Google
HTML Tags
<Title>

Meta tags: Provide information about the document
<META NAME="KEYWORDS" CONTENT="Homepage, laptop>
<META NAME="TITLE" CONTENT="Dell - Official Website - Learn about Dell's laptops, desktops, monitors, printers plus computer electronics & accessories. ">
<Meta Keywords = …>
Index and Relevance
The inverted term index can help the engine determine not only whether a term appears, but also judge relevance
Document frequency
For a given collection of documents, the number of documents that the term occurs in
Intuition – if one term occurs in very few documents, its presence in those few documents is likely to be important
Prefer that term
Ranking
Can be:
Query independent: “scils.rutgers.edu should be ranked higher than mornaaman.com in search results
Query-dependent: “mornaaman.com should be ranked higher than scils.rutgers.edu for the query ‘mor naaman’”
In reality, a mix
Information available for ranking
Words in search query
Statistics about string occurrences (words, URLs, etc.)
Popularity and authority (Page rank, HITS)
Information about the site/domain
Commercial factors
External information
Relevance ranking by document frequency
Term occurs in
1 out of 100 documents (highest)
5 out of 100 documents (higher)
100 out of 100 documents (high)
E.g. query for Shitzu Puppy
Shitzu : 1M pages
Puppy: 13M pages
Give “shitzu” more weight in ranking
And then… Google
Google introduced the insight that web pages can have an “importance” metric based on the importance of links to it
(Yes, recursive definition)
E.g., White House page more “important” than other pages, even another page mentions the phrase 4000 times
Google Page Rank
Uses the structure of the web to “objectively” determine importance of pages
Page Rank
Underlying intuition
-The more times that a document is linked to by other pages, the more likely it is to be the one that the user wants
-But it matters which pages link to it
Page ranking concept
Given a web page, A,
Inlinks – links from other docs pointing to page A
Outlinks- links from A pointing to other docs
Why Does it Matter Who Links?
Page A about Afghanistan was linked from:
cnn.com, nytimes.com, scils.rutgers.edu, washingtonpost.com,… 100 links in total
Page B about Afghanistan was linked from:
jonahlevypersonalwebsite.com, justawebsitenobodylooksat.com, anotherpagesomewhere.com,.. 1000 links in total
Which page is more authoritative?
It’s not just how many links…
It’s how many pages links to page A... and how many pages link to the pages that link to page A… and how many pages link to those pages… etc.
Other Ranking Factors
Popularity and relevance feedback
Websites that get the most hits in a certain period of time
Search engines monitor popular searches and downgrade pages that don’t often get clicked
Recency
Page Rank as “Voting”
Web site creators, site administrators, and other publishers can “vote” for a page by linking to it
Oops… “Voting” is not necessarily “democratic”!
Not everyone can vote
More about that when we discuss bias
Other Factors (Ranking)
Priority to in-house documents or documents in in-house database
Payment
Paid placement: advertiser pays for a higher ranking
Paid inclusion: advertiser pays for inclusion in database
Doesn’t exist in popular engines
External information
Offensive sites
Domain names
Balance and diversity
Personalized Search
Google will pull from Web History, Google +, and other information that they gathered about you.
Personalized Search
Personalized Search means other factors (e.g., my search and click history) could influence the ranking as well
Even more crudely:
Location (country, city, …) of searcher
Previous queries
Each search engine uses its own method for ranking web pages based on:
User query
Content indexing
Popularity
Metadata
External information
Recency
Paid placement
Ranking
Can be:
-Query independent: “comminfo.rutgers.edu should be ranked higher than mornaaman.com in search results
-Query-dependent: “mornaaman.com should be ranked higher than comminfo.rutgers.edu for the query ‘mor naaman’”
In reality, a mix
Other Ranking Factors
Popularity and relevance feedback
Websites that get the most hits in a certain period of time
Search engines monitor popular searches and downgrade pages that don’t often get clicked
Recency
Other Factors (ranking)
Priority to in-house documents or documents in in-house database
Payment
Paid placement: advertiser pays for a higher ranking
Paid inclusion: advertiser pays for inclusion in database
Doesn’t exist in popular engines
Is Page Ranking Democratic?
How does Page Ranking work?
In what ways is Page Ranking
democratic?
not democratic?
In what ways does page ranking suppress controversial or minority positions?
Google Flu Trends
http://www.google.org/flutrends/us/#US

Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2008). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–1014. doi:10.1038/nature07634
Ranking
Can be:
Same for all users
Individualized
In reality, a mix
Information available for ranking
Words in search query
Statistics about string occurrences (words, URLs, etc.)
Popularity and authority of pages (Page rank, HITS)
Attributes of the site/domain
User attributes:
Identity (and search history), location, previous queries
Page Rank
Underlying intuition
The more times that a document is linked to by other pages, the more likely it is to be the one that the user wants
But it matters which pages link to it
Page Rank is Google’s technology – other engines now use comparable algorithms
Can libraries use ranking like Page Rank?

And in any case, it’s a mix – Page Rank is just a small part of ranking

Libraries: mostly not (no links!)
Academic Papers have links (references) so more likely
SEO – Search Engine Optimization
SEO is the practice of making a web site and web pages Friendly to search engines
So they get indexed properly
So that desired terms are deemed relevent
Get pages to the top of the ranked lists
Why?
You want to be seen.
You want to make money from advertising, sales.
Tips from Google
Yes, it’s legit – you can help the search engine help you…
http://www.google.com/webmasters/docs/search-engine-optimization-starter-guide.pdf
Create unique, accurate page titles
Accurately describe the page's content
Unique titles for each page
Brief descriptive titles
Make use of the "description" meta tag
Accurately summarize the page's content
URLs should describe content
E.g, http://flickr.com/groups/landscape/ and the URL above…
Content analysis and sitemaps
http://googlewebmastercentral.blogspot.com/2007/12/new-content-analysis-and-sitemap.html
Query-Independent Rank
How to boost your site’s Page Rank?
Get many pages to link to you
Better be an “important” (high page rank) page!
Illegal techniques:
Link Spamming, Link Exchange and Link farms
E.g., blog comment spamming
Query-Independent Rank (comment)
Is “many pages” enough? No, many IMPORTANT pages better.

Penenberg reading: “"The search engines live in a fantasy world," Boser said. "Every link is a vote. But people buy and sell links." “.
The JCPenney case:
Segal, D. (2011, February 12). Search Optimization and Its Dirty Little Secrets. The New York Times. Retrieved from http://www.nytimes.com/2011/02/13/business/13search.html?src=me&ref=general
Google is adjusting its algorithm to lower rankings for:
Content farms
Sites with low levels of content
Google’s response to Content Farms
Miller, C. C. (2011, February 25). New Google Search System Seeks to Weed Out Useless Results. The New York Times. Retrieved from http://www.nytimes.com/2011/02/26/technology/internet/26google.html?ref=technology
The Google Ad Auction
Introduction to the Google Ad Auction
http://www.youtube.com/watch?v=a8qQXLby4PY
Google Adwords Cost. (n.d.).Search Engine Journal. Retrieved February 23, 2012, from
http://www.searchenginejournal.com/google-adwords-cost/24536/
Google Analytics
Offers information about clicks on your site. Who links to it?

http://www.google.com/analytics/
Keyword tool
https://adwords.google.com/o/Targeting/Explorer?__u=1000000000&__c=1000000000&ideaRequestType=KEYWORD_IDEAS#search.none
Cohen (2009)
Full Boolean logic with use of logical operators

Implied Boolean logic with keyword searching

Boolean logic using search form terminology

http://www.internettutorials.net/boolean.asp
Factors affecting human information search process
Information seeker – individual’s personality, experience and world knowledge

Who are you/Where are you in life?
How do you learn? (read, listen, do)?
How do you communicate? (verbal, electronic, nonverbal)
Are you law abiding? Are you for the “underdog”? Are you “balanced”?
How do you evaluate? How does something not seem right? Do you question?
How do you research?
Task – the information need in the context of the (larger) goal of information seeker.

The task determines the information seeker’s actions
A complex task may have many subtasks
Tasks change over during the course of the search process
If the outcome isn’t satisfactory, the task may need to be redefined or repeated

What is the task at hand?
What are the steps/subtasks you have to take to do this task?
Factors affecting human information search process
Domain – the subject field or type of content in which the information need can best be situated

What are the domains on the issue of File Sharing?
What are the domains of information reliability?
What are the domains of this assignment?
Setting – the environment, either cultural, virtual, or physical, in which the search is conducted

What’s the setting where you’ll do your research? Why?
Why does setting matter?
Factors affecting human information search process
5. Search system – the resources or tools that the information seeker uses.

What Search systems will you/should you use for this task? What are the differences between the types of systems?
6. Outcome – the results of the search, as indicated by the information that is found and by the change in the mental state of the information seeker.
The search process
Identify the information need(s)
Design a search strategy
Execute the search
Assess results
Organizing and present results
Repeat?