Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Hint

Flashcards
»
Retrieving and Evaluating Electronic Information

Retrieving And Evaluating Electronic Information

by steve-rutgers, Feb. 2012

Subjects: and electronic evaluating information retrieving rutgers university

Favorite

Add to folder

Flag

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/85

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

85 Cards in this Set

Front
Back
3rd side (hint)

Why is it hard to find information?	1) information overload 2) Not all information is accessible from one source 3) Lack of systematic organization 4) We don't know what we don't know.
Why this class is important	1. The way knowledge is stored, shared and preserved is changing rapidly 2. Professional opportunities -Information management -Web design / information architect -Electronic commerce -Search engine business 3. Career advancement -Job searching -Reporting to superiors -Management skills 4. Life…
Why we need to develop techniques for thinking about whether information is reliable	1. How do we know who to trust? 2) Can we tell different kinds of information apart? Facts Opinions Misinformation (everyone makes mistakes sometimes) Disinformation (deceptive or downright false) Different points-of-view ... 2) Different kinds of information Facts Opinions Misinformation (everyone makes mistakes sometimes) Disinformation (deceptive or downright false) Different points-of-view 3) Things change…
Critical thinking	The use of logical thinking and reliable information to recognize the true (or possibly true) from the false (or possibly false).
Components of Google Search page	Ads “Organic” results Snippets/Surrogates Location based results Other options Help
“Organic” Results	Appear because they are determined by the search engine’s algorithm to be relevant to the query
Anatomy of Organic Results	Title Snippet URL Site navigation links Cache and other options
Snippets/Surrogates	Can’t show the entire page of results Snippets allow for quick review of whether there is relevance An entire “science” of how to generate them Mostly based on your keywords and structure of pages
Different kinds of search results	Search results now include results for news, videos, and other sources Different than “organic results” E.g., news may be ranked by time
Dede (2008)	Read this!!!! Dede, C. (2008). A Seismic Shift in Epistemology. Educause Review, (May/June 2008). http://www.educause.edu/EDUCAUSE+Review/EDUCAUSEReviewMagazineVolume43/ASeismicShiftinEpistemology/162892
Quickly write 5-10 words or phrases that describe some of our ITI 220 students	Can use what people are wearing, features such as hair color or length, etc.
The components of search	Crawling, Indexing, (Ranking and Matching)
Web Crawler	Aka “spider” A program that… Visits web pages (usually) Downloads the content Saves the content or extracted information in a database If a page is not “crawled”, it will never appear in search results!
Simple Crawling Strategies Questions	How does the crawler know which pages to go to? How are starting pages chosen? In what order does it follow the links on those pages?
Simple Crawling Strategies	Breadth-first search. (2010). Wikipedia, the free encyclopedia. Retrieved 2/16/10 from http://en.wikipedia.org/wiki/Breadth-first_search. Depth-first search. (2010). Wikipedia, the free encyclopedia. Retrieved 2/16/10 from http://en.wikipedia.org/wiki/Depth-first_search.
The Link Graph	See slides for BFS and DFS.
Crawl Limits	On the web, you can’t crawl forever Limitations could be duration (time limits), computing, storage… Limit has an effect on the availability, but even the limit strategy has an effect: E.g. “only 5 steps” E.g., “only depth of 1 in each site”
crawling Results do vary!	Different crawling strategies mean different pages added to the engine database I.e., there could be differences in what’s available in Yahoo/Google/MS/… Other reasons for differences: size, duration, and time of crawl
Revisiting <> Crawling	“Crawling” includes re-visiting of sites that the engine thinks may change Or maybe are not likely to change but if they do, it is important This continuous crawl may follow only previously-known pages (i.e., no new links)
Why no search engines indexes all webpages	Quantity of information Law of diminishing returns – more information is not always better Web is changing all the time Web sites inaccessible to search engines
Other hurdles for crawling	Need a login? Google doesn’t have one… Databases and other special-access sites Robots.txt: excluding crawlers from your site http://www.robotstxt.org/robotstxt.html
The Web – Visible and invisible	Visible web: Web pages accessible to search engines Invisible web (sometimes called “deep web”): Web pages that cannot be accessed by search engines
Why many web pages are “invisible”	Dynamically created information Non-text format Tables and databases, Statistical data, Images, … Proprietary or non-indexed file format Owner deliberately blocks search engine access Passwords and fees (research, journals, jit info) Software designed to block search engines Firewalls … Fail to satisfy requirements of a specific search engine http://www.google.com/webmasters/guidelines.html
Webpages missed because of crawling strategy	Webpage not well linked to other webpages Result of search engine crawling strategy
Visible to Google, not to You	Google may get special access to crawl a site, even if it’s not publicly accessible You may see results that you cannot access without login E.g., portal.acm.org
“Invisible” sites and services	WayBack Machine http://webdev.archive.org/ (me/CNN) Specialized databases (free) White pages, yellow pages Articles – http://www.findarticles.com US Trademarks - http://www.uspto.gov/ Specialized databases (paid, but you have access to) Most of the journal articles/databases available from: http://www.libraries.rutgers.edu/rul/indexes/indexes.shtml http://www.jerseyclicks.org
CAPCHAS AND RECAPCHA	CAPCHA "Completely Automated Public Turing test to tell Computers and Humans Apart." Examples http://en.wikipedia.org/wiki/CAPTCHA RECAPCHA http://recaptcha.net/learnmore.html
What and Why? Indexing	Indexing is needed to find query matches in fastest time possible How to find all* web pages with specific terms in them? Not unlike a book index
Book Index	How is it similar to search engine indexing? How is it different?
Search Engines Indexes	Textword, and which documents it appears in E.g.: SC&I: http://www.comminfo.rutgers.edu/ http://library.sccsc.edu/comminfo/ http://cissl. comminfo.rutgers.edu/ http://www.ocnj.org/pages/comss/scils/comminfo.htm http://www3.lehigh.edu/engineering/cse/news/seminars/fall08.asp
Inverted Index	A list of terms, and for each term, the pages it appear on (represented here by numbers)
Index is huge – what do you do?	Finding the pages that have the terms “cats” and “dogs”: Finding “cats” and “dogs” amongst many other terms in index Getting the very long list of pages for each of the terms, comparing them Many tricks, e.g. sorting
A more complicated inverted index	A list of terms, and for each term, the pages it appear on and the location (e.g., word number) on that page
Curating the index terms	Stemming Multiple words can be stemmed to one cats=>cat etc Stop words Words that are too popular sometimes not added to index (e.g. A, An, Any, Be, Is, No, Of, …) One reason why engines ignores certain words
Words that are not visible on Page	Engine may add hidden text (not visible in article) to the index E.g., “meta tags” Often-used “spamming” trick
Words that are not on Page	Anchor text: on and around the link to the page E.g., “approachable” could be added to my page’s index! If exist, the terms add to the term-relevance for my page (“mor naaman”)
Google-bombing	“Miserable Failure” How Google-bombing works http://google.about.com/gi/o.htm?zi=1/XJ&zTi=1&sdn=google&cdn=compute&tm=14&f=10&su=p504.1.336.ip_&tt=2&bt=1&bts=1&zu=http%3A//oldfashionedpatriot.blogspot.com/2003_10_01_oldfashionedpatriot_archive.html%23106727416975886151
Google on Google Bombing	Google: “People have asked about how we feel about Googlebombs, ... Because these pranks are normally for phrases that are well off the beaten path, they haven’t been a very high priority for us. But over time, we’ve seen more people assume that they are Google’s opinion, or that Google has hand-coded the results for these Googlebombed queries. That’s not true, and… trying to correct that misperception… [we] came up with an algorithm that minimizes the impact of many Googlebombs” http://googlewebmastercentral.blogspot.com/2007/01/quick-word-about-googlebombs.html
Index: temporary summary	The indexing technique allows search engine to determine very quickly what web pages a certain term (or terms) appear on Indexing may include terms that do not appear on the page! Remember, engines only index the content they crawl
Index and Relevance	The inverted term index can help the engine determine not only whether a term appears, but also judge relevancy
Determining page relevance	Position and prominence on page Frequency on page
Term frequency	Intuitively In any given document, a term that occurs many times, is likely to be more important than a term that occurs fewer times Often used in comparison with “document frequency” E.g. words like “the” appear a lot on any page in English… The text-based ranking techniques were all we had until…Google
HTML Tags	<Title> Meta tags: Provide information about the document <META NAME="KEYWORDS" CONTENT="Homepage, laptop> <META NAME="TITLE" CONTENT="Dell - Official Website - Learn about Dell's laptops, desktops, monitors, printers plus computer electronics & accessories. "> <Meta Keywords = …>
Index and Relevance	The inverted term index can help the engine determine not only whether a term appears, but also judge relevance
Document frequency	For a given collection of documents, the number of documents that the term occurs in Intuition – if one term occurs in very few documents, its presence in those few documents is likely to be important Prefer that term
Ranking	Can be: Query independent: “scils.rutgers.edu should be ranked higher than mornaaman.com in search results Query-dependent: “mornaaman.com should be ranked higher than scils.rutgers.edu for the query ‘mor naaman’” In reality, a mix
Information available for ranking	Words in search query Statistics about string occurrences (words, URLs, etc.) Popularity and authority (Page rank, HITS) Information about the site/domain Commercial factors External information
Relevance ranking by document frequency	Term occurs in 1 out of 100 documents (highest) 5 out of 100 documents (higher) 100 out of 100 documents (high) E.g. query for Shitzu Puppy Shitzu : 1M pages Puppy: 13M pages Give “shitzu” more weight in ranking
And then… Google	Google introduced the insight that web pages can have an “importance” metric based on the importance of links to it (Yes, recursive definition) E.g., White House page more “important” than other pages, even another page mentions the phrase 4000 times
Google Page Rank	Uses the structure of the web to “objectively” determine importance of pages
Page Rank	Underlying intuition -The more times that a document is linked to by other pages, the more likely it is to be the one that the user wants -But it matters which pages link to it
Page ranking concept	Given a web page, A, Inlinks – links from other docs pointing to page A Outlinks- links from A pointing to other docs
Why Does it Matter Who Links?	Page A about Afghanistan was linked from: cnn.com, nytimes.com, scils.rutgers.edu, washingtonpost.com,… 100 links in total Page B about Afghanistan was linked from: jonahlevypersonalwebsite.com, justawebsitenobodylooksat.com, anotherpagesomewhere.com,.. 1000 links in total Which page is more authoritative?
It’s not just how many links…	It’s how many pages links to page A... and how many pages link to the pages that link to page A… and how many pages link to those pages… etc.
Other Ranking Factors	Popularity and relevance feedback Websites that get the most hits in a certain period of time Search engines monitor popular searches and downgrade pages that don’t often get clicked Recency
Page Rank as “Voting”	Web site creators, site administrators, and other publishers can “vote” for a page by linking to it Oops… “Voting” is not necessarily “democratic”! Not everyone can vote More about that when we discuss bias
Other Factors (Ranking)	Priority to in-house documents or documents in in-house database Payment Paid placement: advertiser pays for a higher ranking Paid inclusion: advertiser pays for inclusion in database Doesn’t exist in popular engines
External information	Offensive sites Domain names Balance and diversity
Personalized Search	Google will pull from Web History, Google +, and other information that they gathered about you.
Personalized Search	Personalized Search means other factors (e.g., my search and click history) could influence the ranking as well Even more crudely: Location (country, city, …) of searcher Previous queries
Each search engine uses its own method for ranking web pages based on:	User query Content indexing Popularity Metadata External information Recency Paid placement
Ranking	Can be: -Query independent: “comminfo.rutgers.edu should be ranked higher than mornaaman.com in search results -Query-dependent: “mornaaman.com should be ranked higher than comminfo.rutgers.edu for the query ‘mor naaman’” In reality, a mix
Other Ranking Factors	Popularity and relevance feedback Websites that get the most hits in a certain period of time Search engines monitor popular searches and downgrade pages that don’t often get clicked Recency
Other Factors (ranking)	Priority to in-house documents or documents in in-house database Payment Paid placement: advertiser pays for a higher ranking Paid inclusion: advertiser pays for inclusion in database Doesn’t exist in popular engines
Is Page Ranking Democratic?	How does Page Ranking work? In what ways is Page Ranking democratic? not democratic? In what ways does page ranking suppress controversial or minority positions?
Google Flu Trends	http://www.google.org/flutrends/us/#US Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2008). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–1014. doi:10.1038/nature07634
Ranking	Can be: Same for all users Individualized In reality, a mix
Information available for ranking	Words in search query Statistics about string occurrences (words, URLs, etc.) Popularity and authority of pages (Page rank, HITS) Attributes of the site/domain User attributes: Identity (and search history), location, previous queries
Page Rank	Underlying intuition The more times that a document is linked to by other pages, the more likely it is to be the one that the user wants But it matters which pages link to it Page Rank is Google’s technology – other engines now use comparable algorithms Can libraries use ranking like Page Rank? And in any case, it’s a mix – Page Rank is just a small part of ranking Libraries: mostly not (no links!) Academic Papers have links (references) so more likely
SEO – Search Engine Optimization	SEO is the practice of making a web site and web pages Friendly to search engines So they get indexed properly So that desired terms are deemed relevent Get pages to the top of the ranked lists Why? You want to be seen. You want to make money from advertising, sales.
Tips from Google	Yes, it’s legit – you can help the search engine help you… http://www.google.com/webmasters/docs/search-engine-optimization-starter-guide.pdf Create unique, accurate page titles Accurately describe the page's content Unique titles for each page Brief descriptive titles Make use of the "description" meta tag Accurately summarize the page's content URLs should describe content E.g, http://flickr.com/groups/landscape/ and the URL above…
Content analysis and sitemaps	http://googlewebmastercentral.blogspot.com/2007/12/new-content-analysis-and-sitemap.html
Query-Independent Rank	How to boost your site’s Page Rank? Get many pages to link to you Better be an “important” (high page rank) page! Illegal techniques: Link Spamming, Link Exchange and Link farms E.g., blog comment spamming
Query-Independent Rank (comment)	Is “many pages” enough? No, many IMPORTANT pages better. Penenberg reading: “"The search engines live in a fantasy world," Boser said. "Every link is a vote. But people buy and sell links." “.
The JCPenney case:	Segal, D. (2011, February 12). Search Optimization and Its Dirty Little Secrets. The New York Times. Retrieved from http://www.nytimes.com/2011/02/13/business/13search.html?src=me&ref=general
Google is adjusting its algorithm to lower rankings for:	Content farms Sites with low levels of content
Google’s response to Content Farms	Miller, C. C. (2011, February 25). New Google Search System Seeks to Weed Out Useless Results. The New York Times. Retrieved from http://www.nytimes.com/2011/02/26/technology/internet/26google.html?ref=technology
The Google Ad Auction	Introduction to the Google Ad Auction http://www.youtube.com/watch?v=a8qQXLby4PY
Google Adwords Cost. (n.d.).Search Engine Journal. Retrieved February 23, 2012, from	http://www.searchenginejournal.com/google-adwords-cost/24536/
Google Analytics	Offers information about clicks on your site. Who links to it? http://www.google.com/analytics/ Keyword tool https://adwords.google.com/o/Targeting/Explorer?__u=1000000000&__c=1000000000&ideaRequestType=KEYWORD_IDEAS#search.none
Cohen (2009)	Full Boolean logic with use of logical operators Implied Boolean logic with keyword searching Boolean logic using search form terminology http://www.internettutorials.net/boolean.asp
Factors affecting human information search process	Information seeker – individual’s personality, experience and world knowledge Who are you/Where are you in life? How do you learn? (read, listen, do)? How do you communicate? (verbal, electronic, nonverbal) Are you law abiding? Are you for the “underdog”? Are you “balanced”? How do you evaluate? How does something not seem right? Do you question? How do you research?	Task – the information need in the context of the (larger) goal of information seeker. The task determines the information seeker’s actions A complex task may have many subtasks Tasks change over during the course of the search process If the outcome isn’t satisfactory, the task may need to be redefined or repeated What is the task at hand? What are the steps/subtasks you have to take to do this task?
Factors affecting human information search process	Domain – the subject field or type of content in which the information need can best be situated What are the domains on the issue of File Sharing? What are the domains of information reliability? What are the domains of this assignment?	Setting – the environment, either cultural, virtual, or physical, in which the search is conducted What’s the setting where you’ll do your research? Why? Why does setting matter?
Factors affecting human information search process	5. Search system – the resources or tools that the information seeker uses. What Search systems will you/should you use for this task? What are the differences between the types of systems?	6. Outcome – the results of the search, as indicated by the information that is found and by the change in the mental state of the information seeker.
The search process	Identify the information need(s) Design a search strategy Execute the search Assess results Organizing and present results Repeat?

Share This Flashcard Set

Set the Language

Retrieving And Evaluating Electronic Information

Add to Folders

Upgrade to Cram Premium

Card Range To Study

85 Cards in this Set