Wordsmith.org: the magic of words

Wordsmith Talk

About Us | What's New | Search | Site Map | Contact Us  

Previous Thread
Next Thread
Print Thread
Page 1 of 2 1 2
#162187 09/27/06 02:59 AM
Joined: Apr 2000
Posts: 10,542
tsuwm Offline OP
Carpal Tunnel
OP Offline
Carpal Tunnel
Joined: Apr 2000
Posts: 10,542
It is said that if you place a million monkeys in front of a million keyboards, they will eventually produce
the works of Shakespeare. This is simply not true.
They cannot even produce an encyclopedia.

scroogle.org

#162188 09/27/06 11:57 AM
Joined: Mar 2000
Posts: 6,511
Carpal Tunnel
Offline
Carpal Tunnel
Joined: Mar 2000
Posts: 6,511
Great find, tsuwm.

#162189 09/27/06 11:58 AM
Joined: Mar 2000
Posts: 11,613
Carpal Tunnel
Offline
Carpal Tunnel
Joined: Mar 2000
Posts: 11,613
Okay--I didn't understand any kind of point that site was trying to make, but I did latch on to a question I can ask: what's a scraper, as in Google/Yahoo scraper?

#162190 09/27/06 11:59 AM
Joined: Mar 2000
Posts: 11,613
Carpal Tunnel
Offline
Carpal Tunnel
Joined: Mar 2000
Posts: 11,613
Great find, tsuwm. Why, please?

#162191 09/27/06 01:05 PM
Joined: Apr 2000
Posts: 10,542
tsuwm Offline OP
Carpal Tunnel
OP Offline
Carpal Tunnel
Joined: Apr 2000
Posts: 10,542
as I understand it (not too well), Scroogle puts a level of abstraction between you and Google, which is deemed to be a good thing mostly due to Ads by Google.

"Once upon a time — actually it was a few short years ago — Google decided to sell ads. They called it "contextual advertising," and today it accounts for 99 percent of their revenue. But all is not well in Googleland. Thousands upon thousands of sites, with millions and millions of pages, had been trying to rank well in Google's search engine even before Google sold ads. Now these same spammers can get paid for their spam by slapping Google's ads on them. The entire web went downhill so fast that Google lost its ability to sort out the spammers from the real content. But Google didn't care, because their cut for all of those ads came right off the top. The more spam there was, the richer Google got. One year after going public, their market capitalization is neck-to-neck with Time Warner."

Google scraping, then, does a Google search with:
no cookies | no search-term records | access log deleted within 48 hours

#162192 09/27/06 01:19 PM
Joined: Apr 2000
Posts: 10,542
tsuwm Offline OP
Carpal Tunnel
OP Offline
Carpal Tunnel
Joined: Apr 2000
Posts: 10,542
I tried to link to the following, but it comes up on their Google Scraper page only randomly:

There are two reasons why an ad-free scraper of Google's main search results is important. One reason is personal, and the other is political.

On a personal level, your support for Scroogle says that search engines should not be tracking you and retaining this information indefinitely. Not only does Google scrape much of the web, but they keep records of who searches for what. If information about your searching is accessible by cookie ID or by your IP address, it is subject to subpoena. This is a violation of your privacy. Someday Google's data retention practices will be regulated, because Google is too arrogant to do the right thing voluntarily. In the meantime, you should not be leaving your fingerprints in Google's databases.

There are other proxies that can protect your privacy on the web. Almost all are general-purpose proxies that cloak all of your web activity behind an IP address that is not easily traced to your service provider. One is Anonymizer.com. A possible problem with this one is that the founder, Lance Cottrell, has connections with the FBI and the Voice of America. It also costs money for a reasonable level of service. Another is Tor, which is much more secure. But it is also slow, because Tor is a complicated system that needs networks of volunteers to run server software. Juvenile surfers from video pirates to rogue Wikipedia editors tend to clog free services such as Tor, which slows them down even more.

Since Scroogle does just one thing, it is fairly fast and simple. But because it does only one thing, it is vulnerable to action by Google. They could block our IP address, which would require that we relay requests to other servers that are more difficult for them to locate. They could also centralize their system more in order to better detect and throttle any outside address that does too many searches per minute. Finally, they could make minor changes in their output format on a regular basis, which would break our scraper and require frequent reprogramming. Any of the above might quickly get too complex and expensive for us, and that would be the end of Scroogle.

One action that Google is less likely to take is to serve Scroogle with a cease and desist letter. This introduces the second reason why Scroogle deserves support. As a nonprofit with a history of activism on privacy issues, it would be difficult for Google to sue us on the grounds that their search results and rankings are copyrighted. The main reason for this is that we are noncommercial. None of our sites has ever carried ads, we have zero employees, and our gross annual income is about $10,000. Our lack of commercial intent strengthens our claim that we have the right to scrape Google. It's obvious that we are doing it in the public interest.

Goobage in, Goobage out Showing Google's results without their ads is another political statement. About 99 percent of Google's total revenue comes from ads, and these are ruining the web. Thousands of "Made for AdSense" domains are spewing garbage. Since these sites need content to trigger Google's ads, they steal it by scraping legitimate sites, or generate their own by purchasing junk from bulk writers. Meanwhile, click fraud is rampant. Zombie botnets are used to click on ads. If you cannot afford to buy a botnet from some shady character, then you can contract with someone in a country where labor is cheap. They will hire people to click on ads all day at below-minimum wage.

It's time to stop pretending that Google's revenue model is anything more than a temporary bubble, and it's time for Google to start developing more socially-responsible sources of income. Showing Google's results without the ads amounts to more public-interest advocacy. It says that the web spam situation is intolerable.

We remain vulnerable to blocking, throttling, or breaking by Google, which unfortunately is legal if they decide to stop us. But the longer Scroogle exists and the more our traffic grows, the stronger our statements become. We cannot survive many more months without at least one more server, even if Google leaves us alone. While we could apply for foundation grants, our experience tells us that foundations are about ten years behind on Internet and other high-tech issues. Any funding proposals we send out would strike them as bizarre and incomprehensible. It's not worth our time to send out proposals to foundations.

That leaves us asking lots of Scroogle users for small contributions. Searchers who prefer Scroogle are making a unique statement about important issues. Nothing else we know of is making the same points as effectively.

#162193 09/27/06 02:37 PM
Joined: Jul 2005
Posts: 1,773
D
Pooh-Bah
Offline
Pooh-Bah
D
Joined: Jul 2005
Posts: 1,773
tsu: Thank you for that def, but much of the above discussion is over my head. Jackie asks, "what is scraping?" as applied to web adverts, and I also could use a nice, concise def that could be understood by the average clod (me). Is scraping Google ("scroogle"?), for instance, different from scraping any other site


dalehileman
#162194 09/27/06 02:46 PM
Joined: Aug 2005
Posts: 3,290
Carpal Tunnel
Offline
Carpal Tunnel
Joined: Aug 2005
Posts: 3,290
Web scraping "refers to an application that processes the HTML of a Web page to extract data for manipulation such as converting the Web page to another format (i.e. HTML to WML). Web Scraping scripts and applications will simulate a person viewing a Web site with a browser. With these scripts you can connect to a Web page and request a page, exactly as a browser would do. The Web server will send back the page which you can then manipulate or extract specific information from." And Wikipedia's article on same.

Go figure. Removing the double quotes in the first URL fixed both.

Last edited by zmjezhd; 09/29/06 01:21 AM.
#162195 09/27/06 02:54 PM
Joined: Oct 2005
Posts: 557
M
addict
Offline
addict
M
Joined: Oct 2005
Posts: 557
Way back in the DOS days, we used to do "screen scraping". By monitoring the video memory while another program was running, you could "see" what characters the program was putting on the screen. If you defined what the screen looked like, e.g. customer name is on line 3 from column 5 to column 63, you could then capture data from the other program without having to read their file format.

#162196 09/27/06 05:55 PM
Joined: Jul 2005
Posts: 1,773
D
Pooh-Bah
Offline
Pooh-Bah
D
Joined: Jul 2005
Posts: 1,773
zm: Thank you for those links, though on account of my unfamiliarity with the Intricacies of the Url, I'm not having much luck with the first one. If it's no trouble you might transmit it as a clickable link (lacking the vocabulary, my term for the address usu found in blue with an underline). Sorry for being so dense in these matters, and thanks again


dalehileman
Page 1 of 2 1 2

Moderated by  Jackie 

Link Copied to Clipboard
Forum Statistics
Forums16
Topics13,913
Posts229,316
Members9,182
Most Online3,341
Dec 9th, 2011
Newest Members
Ineffable, ddrinnan, TRIALNERRA, befuddledmind, KILL_YOUR_SUV
9,182 Registered Users
Who's Online Now
0 members (), 342 guests, and 3 robots.
Key: Admin, Global Mod, Mod
Top Posters(30 Days)
Top Posters
wwh 13,858
Faldage 13,803
Jackie 11,613
tsuwm 10,542
wofahulicodoc 10,533
LukeJavan8 9,916
AnnaStrophic 6,511
Wordwind 6,296
of troy 5,400
Disclaimer: Wordsmith.org is not responsible for views expressed on this site. Use of this forum is at your own risk and liability - you agree to hold Wordsmith.org and its associates harmless as a condition of using it.

Home | Today's Word | Yesterday's Word | Subscribe | FAQ | Archives | Search | Feedback
Wordsmith Talk | Wordsmith Chat

© 1994-2024 Wordsmith

Powered by UBB.threads™ PHP Forum Software 7.7.5