Looking for Search Engine / Crawler Script.

I have tried searching for “Search Engine Script”, but I can find nothing. I am aware of things like Azizi and others, but I just want a “ready to go” (other than waiting to build the database) system.

There seems to be scripts for almost everything else - and I even found a reference to a PHP search engine which linked back to CodeCanyon - but that said the page had been removed.

Does anyone on here sell a SAAS crawler with search front end please?

A search engine for what exactly?

If you mean the entire world wide web, stop right there. Google’s search crawlers process hundreds of petabytes of data each day. There’s no ready-to-go script in existence that can operate at multi-data-center scales like that, and your credit card certainly won’t pay for that amount of storage.

There are also a few problems with expecting a web crawler in a CodeCanyon script. The most pressing one is that PHP is one of the worst possible language choices for such a system. Yet with any other language, your requirement of “ready to go” will not be satisfied.

Another major issue is that crawling the web is extremely complicated and requires a tremendous amount of work and effort to do correctly. If you ever manage to find this effort consolidated into a $35 stock item, then I would be convinced that AI has completely replaced enterprise software development.

Now, if you just want to crawl a specific list of websites, I could see a use case for that, but you really should use Google Search API with the site: parameter instead. There are plenty of scripts (including Azizi) that provide your own search interface on top of Google Search like that.

1 Like

No, I am not trying to index the World - but you might be staggered to know that Google did not start out in even a single 40U rack server, let alone a petabyte handling data centre.

In fact with a handy 2.5Gbit connection to play with, a Kemp loadmaster and some low end servers with 40TB of storage on a RAID10, I might even have slightly better kit to start off with, than they did.

My aims are rather more modest, though I do want people to be able to “add url” and for the engine to slowly build up its own database. Ideally each server giving priority to “client Search” and then using reserve capacity to “trawl” for new pages and to index specific targets.

I am sure you knowledge of code is vastly superior to mine - but even I have managed to write a very crude crawler which can read pages, extract text and then find more links to follow. It is just horribly crude, slow, clunky and I am certain that other people have done a vastly better job than me. It’s complicated, but I think if a coder told me it was “extremely complicated”, I would just wonder if I had the right person.

I am not hooked on PHP - I simply said that was the only script I had found - but no longer seemed to be for sale. As for being “ready to go”, that is after I have paid someone to install it for me. I just don’t want to pay for a “half finished job”.

I am also not saying my upper limit is $35. Quite why you are making so many assumptions is strange. My last purchase of code was around $1800 for the “base pack” and around another $2500 worth of customisation - An ongoing project with I expect another $2000 - $4000 work needed before it is as well integrated into another project as I want.

I don’t want to just use Google - or Bing - I want a small search engine that can crawl for itself, hence not wanting to use the generic “use an API” solution and stick your badge on the front.

As I stated, I know of Azizi - but I want to build my own database as well as being able to use an API in the early days.

Still I am sure you thought you were being constructive in pointing out that Google are huge and that storing petabytes of data is not cheap, so thank you.

You seem very set on this goal, so I won’t try to dissuade you any further. In my opinion, it’s not a feasible project except in the short term. Nothing is impossible though.

I wouldn’t expect to see something like this on a stock marketplace for any price point. If you ever do, I would urge extreme caution, as you will require frequent updates and probably support as well. New scripts will be extremely buggy, because the number of edge cases when crawling across the web (particularly for search engine indexing purposes) is staggering.

Let me explain some potential hurdles.

First, the reason I say PHP is a horrible language choice for this kind of crawling is that it can only do one thing at a time. There are ways around this, but it’s still not great. Something like Node.js is simple to program with and can crawl hundreds of pages simultaneously, but this is one of the least efficient solutions.

You should know that many websites require a web browser to view correctly. Google has switched to crawling using something like Puppeteer (basically a Chromium browser) to be able to view client-side websites. It is possible for a script to do this, but you’ll exhaust your server’s resources quickly as you begin to scale up.

Additionally, you will quickly run into issues with your user agent. Cloudflare may start to block you. You will need to apply with them as a trusted agent, which is a fairly simple process, but you’ll need to get off the ground and build reputation first.

A database for this kind of project can get extremely demanding. It’s not as simple as storing a copy of each page’s text. Indices must be built with fragments of the text, this data adds up quickly, and can require significant compute power to both insert and query.

If your search engine ever does maintain a large number of pages, it will most likely be crawling 24/7. Remember that you need to periodically fetch existing pages in your index as well. This means constant inserts and updates to the database.

The problem with constant “upserts” into the database is that they often block queries. This can get quite serious, quite quickly, and this is a fundamental reason that search engines don’t update their user-facing databases in real-time. In the beginning, a strategy may be to maintain 2-3 databases, one for reads, one for writes, and one to sync and buffer between them nightly.

If I imagine something like this being in a stock script, I wouldn’t expect it to scale very far. They would most likely not account for these hurdles, which you can face fairly early on. It may be worthwhile consulting with some freelances, but oversight will be required to make sure they build something that can scale well, depending on how much you intend to grow.

In other words, the platform you require is ultimately going to depend on how big you want to be able to scale. I’m not aware of any scripts on CodeCanyon that fit this at any scale, but regardless of what you find, you should analyze it carefully with your future in mind.

1 Like

Thank you. There is some really useful comments and thoughts there I appreciate it.

The refreshing had occured to me, though I admit I was thinking that “low ranking pages” would probably be re-indexed fairly infrequently - possibly just using “spare capacity” on one of the client servers. So its main task is to take search requests and fetch data - but if there is plenty of capacity, then it can do some crawling duty while traffic is light.

I am not expecting hundreds of thousands of searches per hour - let alone millions per second - so most of the time, I think the server(s) are going to be idling along.

The updates to the database comments are interesting, I had thought record locking would have been sufficient, but it seems I might have greatly oversimplified things in my mind here. The idea of having multiple databases is certainly a thought which had occured - eg: sports, music, science, patents, how, why, recipes, images etc.

Thank you for your thoughts and comments, I appreciate it.

1 Like

I’ve been looking into this also. I’ve found one service, and another person I just emailed. Otherwise, you’ll need to program it at cost. There are free templates. However plenty of programming/developer knowledge is required. In total I have found Ad ready 2 scripts. But whether they reply is another matter. Let me know if you have any luck. And I’ll update you the same. wow web hot mall.

Use Scrapy it is a python tool used by all top companies. Many site Seo crawlers are built using it.

Good tips, It’ll be helpful to me