Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
85 Cards in this Set
- Front
- Back
- 3rd side (hint)
Why is it hard to find information?
|
1) information overload
2) Not all information is accessible from one source 3) Lack of systematic organization 4) We don't know what we don't know. |
|
|
Why this class is important
|
1. The way knowledge is stored, shared and preserved is changing rapidly
2. Professional opportunities -Information management -Web design / information architect -Electronic commerce -Search engine business 3. Career advancement -Job searching -Reporting to superiors -Management skills 4. Life… |
|
|
Why we need to develop techniques for thinking about whether information is reliable
|
1. How do we know who to trust?
2) Can we tell different kinds of information apart? Facts Opinions Misinformation (everyone makes mistakes sometimes) Disinformation (deceptive or downright false) Different points-of-view ... 2) Different kinds of information Facts Opinions Misinformation (everyone makes mistakes sometimes) Disinformation (deceptive or downright false) Different points-of-view 3) Things change… |
|
|
Critical thinking
|
The use of logical thinking and reliable information to recognize the true (or possibly true) from the false (or possibly false).
|
|
|
Components of Google Search page
|
Ads
“Organic” results Snippets/Surrogates Location based results Other options Help |
|
|
“Organic” Results
|
Appear because they are determined by the search engine’s algorithm to be relevant to the query
|
|
|
Anatomy of Organic Results
|
Title
Snippet URL Site navigation links Cache and other options |
|
|
Snippets/Surrogates
|
Can’t show the entire page of results
Snippets allow for quick review of whether there is relevance An entire “science” of how to generate them Mostly based on your keywords and structure of pages |
|
|
Different kinds of search results
|
Search results now include results for news, videos, and other sources
Different than “organic results” E.g., news may be ranked by time |
|
|
Dede (2008)
|
Read this!!!! Dede, C. (2008). A Seismic Shift in Epistemology. Educause Review, (May/June 2008). http://www.educause.edu/EDUCAUSE+Review/EDUCAUSEReviewMagazineVolume43/ASeismicShiftinEpistemology/162892
|
|
|
Quickly write 5-10 words or phrases that describe some of our ITI 220 students
|
Can use what people are wearing, features such as hair color or length, etc.
|
|
|
The components of search
|
Crawling, Indexing, (Ranking and Matching)
|
|
|
Web Crawler
|
Aka “spider”
A program that… Visits web pages (usually) Downloads the content Saves the content or extracted information in a database If a page is not “crawled”, it will never appear in search results! |
|
|
Simple Crawling Strategies Questions
|
How does the crawler know which pages to go to?
How are starting pages chosen? In what order does it follow the links on those pages? |
|
|
Simple Crawling Strategies
|
Breadth-first search. (2010). Wikipedia, the free encyclopedia. Retrieved 2/16/10 from http://en.wikipedia.org/wiki/Breadth-first_search.
Depth-first search. (2010). Wikipedia, the free encyclopedia. Retrieved 2/16/10 from http://en.wikipedia.org/wiki/Depth-first_search. |
|
|
The Link Graph
|
See slides for BFS and DFS.
|
|
|
Crawl Limits
|
On the web, you can’t crawl forever
Limitations could be duration (time limits), computing, storage… Limit has an effect on the availability, but even the limit strategy has an effect: E.g. “only 5 steps” E.g., “only depth of 1 in each site” |
|
|
crawling Results do vary!
|
Different crawling strategies mean different pages added to the engine database
I.e., there could be differences in what’s available in Yahoo/Google/MS/… Other reasons for differences: size, duration, and time of crawl |
|
|
Revisiting <> Crawling
|
“Crawling” includes re-visiting of sites that the engine thinks may change
Or maybe are not likely to change but if they do, it is important This continuous crawl may follow only previously-known pages (i.e., no new links) |
|
|
Why no search engines indexes all webpages
|
Quantity of information
Law of diminishing returns – more information is not always better Web is changing all the time Web sites inaccessible to search engines |
|
|
Other hurdles for crawling
|
Need a login? Google doesn’t have one…
Databases and other special-access sites Robots.txt: excluding crawlers from your site http://www.robotstxt.org/robotstxt.html |
|
|
The Web – Visible and invisible
|
Visible web: Web pages accessible to search engines
Invisible web (sometimes called “deep web”): Web pages that cannot be accessed by search engines |
|
|
Why many web pages are “invisible”
|
Dynamically created information
Non-text format Tables and databases, Statistical data, Images, … Proprietary or non-indexed file format Owner deliberately blocks search engine access Passwords and fees (research, journals, jit info) Software designed to block search engines Firewalls … Fail to satisfy requirements of a specific search engine http://www.google.com/webmasters/guidelines.html |
|
|
Webpages missed because of crawling strategy
|
Webpage not well linked to other webpages
Result of search engine crawling strategy |
|
|
Visible to Google, not to You
|
Google may get special access to crawl a site, even if it’s not publicly accessible
You may see results that you cannot access without login E.g., portal.acm.org |
|
|
“Invisible” sites and services
|
WayBack Machine http://webdev.archive.org/ (me/CNN)
Specialized databases (free) White pages, yellow pages Articles – http://www.findarticles.com US Trademarks - http://www.uspto.gov/ Specialized databases (paid, but you have access to) Most of the journal articles/databases available from: http://www.libraries.rutgers.edu/rul/indexes/indexes.shtml http://www.jerseyclicks.org |
|
|
CAPCHAS AND RECAPCHA
|
CAPCHA "Completely Automated Public Turing test to tell Computers and Humans Apart."
Examples http://en.wikipedia.org/wiki/CAPTCHA RECAPCHA http://recaptcha.net/learnmore.html |
|
|
What and Why? Indexing
|
Indexing is needed to find query matches in fastest time possible
How to find all* web pages with specific terms in them? Not unlike a book index |
|
|
Book Index
|
How is it similar to search engine indexing?
How is it different? |
|
|
Search Engines Indexes
|
Textword, and which documents it appears in
E.g.: SC&I: http://www.comminfo.rutgers.edu/ http://library.sccsc.edu/comminfo/ http://cissl. comminfo.rutgers.edu/ http://www.ocnj.org/pages/comss/scils/comminfo.htm http://www3.lehigh.edu/engineering/cse/news/seminars/fall08.asp |
|
|
Inverted Index
|
A list of terms, and for each term, the pages it appear on (represented here by numbers)
|
|
|
Index is huge – what do you do?
|
Finding the pages that have the terms “cats” and “dogs”:
Finding “cats” and “dogs” amongst many other terms in index Getting the very long list of pages for each of the terms, comparing them Many tricks, e.g. sorting |
|
|
A more complicated inverted index
|
A list of terms, and for each term, the pages it appear on and the location (e.g., word number) on that page
|
|
|
Curating the index terms
|
Stemming
Multiple words can be stemmed to one cats=>cat etc Stop words Words that are too popular sometimes not added to index (e.g. A, An, Any, Be, Is, No, Of, …) One reason why engines ignores certain words |
|
|
Words that are not visible on Page
|
Engine may add hidden text (not visible in article) to the index
E.g., “meta tags” Often-used “spamming” trick |
|
|
Words that are not on Page
|
Anchor text: on and around the link to the page
E.g., “approachable” could be added to my page’s index! If exist, the terms add to the term-relevance for my page (“mor naaman”) |
|
|
Google-bombing
|
“Miserable Failure”
How Google-bombing works http://google.about.com/gi/o.htm?zi=1/XJ&zTi=1&sdn=google&cdn=compute&tm=14&f=10&su=p504.1.336.ip_&tt=2&bt=1&bts=1&zu=http%3A//oldfashionedpatriot.blogspot.com/2003_10_01_oldfashionedpatriot_archive.html%23106727416975886151 |
|
|
Google on Google Bombing
|
Google: “People have asked about how we feel about Googlebombs, ... Because these pranks are normally for phrases that are well off the beaten path, they haven’t been a very high priority for us. But over time, we’ve seen more people assume that they are Google’s opinion, or that Google has hand-coded the results for these Googlebombed queries. That’s not true, and… trying to correct that misperception… [we] came up with an algorithm that minimizes the impact of many Googlebombs”
http://googlewebmastercentral.blogspot.com/2007/01/quick-word-about-googlebombs.html |
|
|
Index: temporary summary
|
The indexing technique allows search engine to determine very quickly what web pages a certain term (or terms) appear on
Indexing may include terms that do not appear on the page! Remember, engines only index the content they crawl |
|
|
Index and Relevance
|
The inverted term index can help the engine determine not only whether a term appears, but also judge relevancy
|
|
|
Determining page relevance
|
Position and prominence on page
Frequency on page |
|
|
Term frequency
|
Intuitively
In any given document, a term that occurs many times, is likely to be more important than a term that occurs fewer times Often used in comparison with “document frequency” E.g. words like “the” appear a lot on any page in English… The text-based ranking techniques were all we had until…Google |
|
|
HTML Tags
|
<Title>
Meta tags: Provide information about the document <META NAME="KEYWORDS" CONTENT="Homepage, laptop> <META NAME="TITLE" CONTENT="Dell - Official Website - Learn about Dell's laptops, desktops, monitors, printers plus computer electronics & accessories. "> <Meta Keywords = …> |
|
|
Index and Relevance
|
The inverted term index can help the engine determine not only whether a term appears, but also judge relevance
|
|
|
Document frequency
|
For a given collection of documents, the number of documents that the term occurs in
Intuition – if one term occurs in very few documents, its presence in those few documents is likely to be important Prefer that term |
|
|
Ranking
|
Can be:
Query independent: “scils.rutgers.edu should be ranked higher than mornaaman.com in search results Query-dependent: “mornaaman.com should be ranked higher than scils.rutgers.edu for the query ‘mor naaman’” In reality, a mix |
|
|
Information available for ranking
|
Words in search query
Statistics about string occurrences (words, URLs, etc.) Popularity and authority (Page rank, HITS) Information about the site/domain Commercial factors External information |
|
|
Relevance ranking by document frequency
|
Term occurs in
1 out of 100 documents (highest) 5 out of 100 documents (higher) 100 out of 100 documents (high) E.g. query for Shitzu Puppy Shitzu : 1M pages Puppy: 13M pages Give “shitzu” more weight in ranking |
|
|
And then… Google
|
Google introduced the insight that web pages can have an “importance” metric based on the importance of links to it
(Yes, recursive definition) E.g., White House page more “important” than other pages, even another page mentions the phrase 4000 times |
|
|
Google Page Rank
|
Uses the structure of the web to “objectively” determine importance of pages
|
|
|
Page Rank
|
Underlying intuition
-The more times that a document is linked to by other pages, the more likely it is to be the one that the user wants -But it matters which pages link to it |
|
|
Page ranking concept
|
Given a web page, A,
Inlinks – links from other docs pointing to page A Outlinks- links from A pointing to other docs |
|
|
Why Does it Matter Who Links?
|
Page A about Afghanistan was linked from:
cnn.com, nytimes.com, scils.rutgers.edu, washingtonpost.com,… 100 links in total Page B about Afghanistan was linked from: jonahlevypersonalwebsite.com, justawebsitenobodylooksat.com, anotherpagesomewhere.com,.. 1000 links in total Which page is more authoritative? |
|
|
It’s not just how many links…
|
It’s how many pages links to page A... and how many pages link to the pages that link to page A… and how many pages link to those pages… etc.
|
|
|
Other Ranking Factors
|
Popularity and relevance feedback
Websites that get the most hits in a certain period of time Search engines monitor popular searches and downgrade pages that don’t often get clicked Recency |
|
|
Page Rank as “Voting”
|
Web site creators, site administrators, and other publishers can “vote” for a page by linking to it
Oops… “Voting” is not necessarily “democratic”! Not everyone can vote More about that when we discuss bias |
|
|
Other Factors (Ranking)
|
Priority to in-house documents or documents in in-house database
Payment Paid placement: advertiser pays for a higher ranking Paid inclusion: advertiser pays for inclusion in database Doesn’t exist in popular engines |
|
|
External information
|
Offensive sites
Domain names Balance and diversity |
|
|
Personalized Search
|
Google will pull from Web History, Google +, and other information that they gathered about you.
|
|
|
Personalized Search
|
Personalized Search means other factors (e.g., my search and click history) could influence the ranking as well
Even more crudely: Location (country, city, …) of searcher Previous queries |
|
|
Each search engine uses its own method for ranking web pages based on:
|
User query
Content indexing Popularity Metadata External information Recency Paid placement |
|
|
Ranking
|
Can be:
-Query independent: “comminfo.rutgers.edu should be ranked higher than mornaaman.com in search results -Query-dependent: “mornaaman.com should be ranked higher than comminfo.rutgers.edu for the query ‘mor naaman’” In reality, a mix |
|
|
Other Ranking Factors
|
Popularity and relevance feedback
Websites that get the most hits in a certain period of time Search engines monitor popular searches and downgrade pages that don’t often get clicked Recency |
|
|
Other Factors (ranking)
|
Priority to in-house documents or documents in in-house database
Payment Paid placement: advertiser pays for a higher ranking Paid inclusion: advertiser pays for inclusion in database Doesn’t exist in popular engines |
|
|
Is Page Ranking Democratic?
|
How does Page Ranking work?
In what ways is Page Ranking democratic? not democratic? In what ways does page ranking suppress controversial or minority positions? |
|
|
Google Flu Trends
|
http://www.google.org/flutrends/us/#US
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2008). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–1014. doi:10.1038/nature07634 |
|
|
Ranking
|
Can be:
Same for all users Individualized In reality, a mix |
|
|
Information available for ranking
|
Words in search query
Statistics about string occurrences (words, URLs, etc.) Popularity and authority of pages (Page rank, HITS) Attributes of the site/domain User attributes: Identity (and search history), location, previous queries |
|
|
Page Rank
|
Underlying intuition
The more times that a document is linked to by other pages, the more likely it is to be the one that the user wants But it matters which pages link to it Page Rank is Google’s technology – other engines now use comparable algorithms Can libraries use ranking like Page Rank? And in any case, it’s a mix – Page Rank is just a small part of ranking Libraries: mostly not (no links!) Academic Papers have links (references) so more likely |
|
|
SEO – Search Engine Optimization
|
SEO is the practice of making a web site and web pages Friendly to search engines
So they get indexed properly So that desired terms are deemed relevent Get pages to the top of the ranked lists Why? You want to be seen. You want to make money from advertising, sales. |
|
|
Tips from Google
|
Yes, it’s legit – you can help the search engine help you…
http://www.google.com/webmasters/docs/search-engine-optimization-starter-guide.pdf Create unique, accurate page titles Accurately describe the page's content Unique titles for each page Brief descriptive titles Make use of the "description" meta tag Accurately summarize the page's content URLs should describe content E.g, http://flickr.com/groups/landscape/ and the URL above… |
|
|
Content analysis and sitemaps
|
http://googlewebmastercentral.blogspot.com/2007/12/new-content-analysis-and-sitemap.html
|
|
|
Query-Independent Rank
|
How to boost your site’s Page Rank?
Get many pages to link to you Better be an “important” (high page rank) page! Illegal techniques: Link Spamming, Link Exchange and Link farms E.g., blog comment spamming |
|
|
Query-Independent Rank (comment)
|
Is “many pages” enough? No, many IMPORTANT pages better.
Penenberg reading: “"The search engines live in a fantasy world," Boser said. "Every link is a vote. But people buy and sell links." “. |
|
|
The JCPenney case:
|
Segal, D. (2011, February 12). Search Optimization and Its Dirty Little Secrets. The New York Times. Retrieved from http://www.nytimes.com/2011/02/13/business/13search.html?src=me&ref=general
|
|
|
Google is adjusting its algorithm to lower rankings for:
|
Content farms
Sites with low levels of content |
|
|
Google’s response to Content Farms
|
Miller, C. C. (2011, February 25). New Google Search System Seeks to Weed Out Useless Results. The New York Times. Retrieved from http://www.nytimes.com/2011/02/26/technology/internet/26google.html?ref=technology
|
|
|
The Google Ad Auction
|
Introduction to the Google Ad Auction
http://www.youtube.com/watch?v=a8qQXLby4PY |
|
|
Google Adwords Cost. (n.d.).Search Engine Journal. Retrieved February 23, 2012, from
|
http://www.searchenginejournal.com/google-adwords-cost/24536/
|
|
|
Google Analytics
|
Offers information about clicks on your site. Who links to it?
http://www.google.com/analytics/ Keyword tool https://adwords.google.com/o/Targeting/Explorer?__u=1000000000&__c=1000000000&ideaRequestType=KEYWORD_IDEAS#search.none |
|
|
Cohen (2009)
|
Full Boolean logic with use of logical operators
Implied Boolean logic with keyword searching Boolean logic using search form terminology http://www.internettutorials.net/boolean.asp |
|
|
Factors affecting human information search process
|
Information seeker – individual’s personality, experience and world knowledge
Who are you/Where are you in life? How do you learn? (read, listen, do)? How do you communicate? (verbal, electronic, nonverbal) Are you law abiding? Are you for the “underdog”? Are you “balanced”? How do you evaluate? How does something not seem right? Do you question? How do you research? |
Task – the information need in the context of the (larger) goal of information seeker.
The task determines the information seeker’s actions A complex task may have many subtasks Tasks change over during the course of the search process If the outcome isn’t satisfactory, the task may need to be redefined or repeated What is the task at hand? What are the steps/subtasks you have to take to do this task? |
|
Factors affecting human information search process
|
Domain – the subject field or type of content in which the information need can best be situated
What are the domains on the issue of File Sharing? What are the domains of information reliability? What are the domains of this assignment? |
Setting – the environment, either cultural, virtual, or physical, in which the search is conducted
What’s the setting where you’ll do your research? Why? Why does setting matter? |
|
Factors affecting human information search process
|
5. Search system – the resources or tools that the information seeker uses.
What Search systems will you/should you use for this task? What are the differences between the types of systems? |
6. Outcome – the results of the search, as indicated by the information that is found and by the change in the mental state of the information seeker.
|
|
The search process
|
Identify the information need(s)
Design a search strategy Execute the search Assess results Organizing and present results Repeat? |
|