Your own search engine!

turbohans · Mar 28, 2012

Recently I found some awesome open source software that works as a peer-to-peer search engine. You can download and try out this software from YaCy.net, installing it on a Ubuntu system is very easy. To install on a Ubuntu server/desktop you will just need to add their repository line to your sources list and install using an "apt-get" command.

This software can be configured to run in a private or public cluster, and even run on a federated index using Apache Solr. So you have a choice if you would rather show results from your own private websites or contribute to a global distributed search engine.

Pretty cool 'eh?

johan851 · Mar 28, 2012

Pretty cool I guess. But we don't need any more crawlers on the web, thank you very much.

turbohans · Mar 28, 2012

johan851 said:
Pretty cool I guess. But we don't need any more crawlers on the web, thank you very much.

I am glad you don't I guess. :shrug:

Automata · Mar 28, 2012

This is actually kind of neat. I might try this out.

ihrsetrdr · Mar 28, 2012

Interesting. I just installed remotely on my Debian desktop, will play with it when I get back.

Add the following to your /etc/apt/sources.list:

Code:

deb http://debian.yacy.net ./

turbohans · Mar 29, 2012

I guess I should have mentioned that YaCy is not necessarily a just a web-crawler. I found the most usefulness in its ability to catalog local resources if necessary. With that you can search through shared-files, FTP, and other resources on a local network. (As an example, would be very useful on a institutional level, replacing the front-end most colleges use for their Apache SoLr index.)

If you were to use it as a (private) web-crawler though it could be integrated into websites to cross the gap between search done by separate platforms. So for example overclockers here will give you seperate search results depending on if you use the search on the front page or the forum, this would allow results from both sides while still looking the same as it does now.

There are also a few other features that are not available with Google even.. So far I am impressed, not looking to replace Google just do things a little different.

johan851 · Mar 29, 2012

It would be interesting to see this become a distributed computing problem, actually. Lend your resources as a web crawler to some centralized open source service and see how good the search results end up.

turbohans · Mar 29, 2012

johan851 said:
It would be interesting to see this become a distributed computing problem, actually. Lend your resources as a web crawler to some centralized open source service and see how good the search results end up.

That is exactly how it works if you run it in public mode. The only thing different in that aspect is they have no teams set up yet.

Also there is not "main" server, the 'nodes' are designed to load-balance themselves and display results as a collective while running publicly.

There is however a stats site. Currently the stats-page is in German, but you can find me on the #1 page. :comp:

http://www.yacystats.de/peers24.html < pg. #1 at the very top LMAO!

I was going to mention also if you guys dont like the port 8090 to access it you can do a proxy-pass with Apache very easy.

Code:

$ sudo a2enmod proxy

create a file like this in "/etc/apache2/sites-available/" named somthing like this "yacy.example.com.conf OR search.example.com" using the modifyed contents below. (modify the server name to match what you want it to reply from)

Code:

<VirtualHost _default_:80>

	ServerAdmin [email protected]
	# This is what you want to change to your domain-name
	ServerName search.example.com 

	# Put this in the main section of your configuration (or desired virtual host, if using Apache virtual hosts)
	ProxyPreserveHost On
 
	<Proxy *>
	    Order deny,allow
	    Allow from all
	</Proxy>
 
	ProxyPass / http://localhost:8090/
	
	<Location />
    	Order allow,deny
    	Allow from all
	</Location>
	
	ErrorLog ${APACHE_LOG_DIR}/error.log

	# Possible values include: debug, info, notice, warn, error, crit,
	# alert, emerg.
	LogLevel warn

	CustomLog ${APACHE_LOG_DIR}/access.log combined

	Alias /doc/ "/usr/share/doc/"
    	<Directory "/usr/share/doc/">
        Options Indexes MultiViews FollowSymLinks
        AllowOverride All
        Order deny,allow
        Deny from all
        Allow from 127.0.0.0/255.0.0.0 ::1/128
   	 </Directory>

</VirtualHost>

enable it,

Code:

$ sudo a2ensite yacy.example.com.conf

reload Apache.

Code:

$ sudo service apache2 restart

h4rm0ny · Apr 5, 2012

Wow. This is really neat. Imagine a world in which search engines were just a miniscule burden shared between all websites. You could actually have searchability without all the pressure from advertisers to track every last little thing.

turbohans · Apr 7, 2012

h4rm0ny said:
Wow. This is really neat. Imagine a world in which search engines were just a miniscule burden shared between all websites. You could actually have searchability without all the pressure from advertisers to track every last little thing.

I agree, and YaCy does a good job of doing just that. As far as YaCy is concerned all pages are created equal and there are no paid-results that will show up. YaCy really does not even use that much bandwidth either, it is robots.txt compliant and will generally look to other peers before indexing a website. Another cool thing I guess is the fact that somebody could actually do offline searches with a well established YaCy peer.

Your own search engine!

turbohans

Member

johan851

Insatiably Malcontent, Senior Member

turbohans

Member

Automata

Destroyer of Empires and Use

ihrsetrdr

Señor Senior Member

turbohans

Member

johan851

Insatiably Malcontent, Senior Member

turbohans

Member

h4rm0ny

Member

turbohans

Member

Similar threads