How to build your own (topic specific) search engine- Web-Crawler, Meta Search Engine
Aus der Kategorie: Knowledge Base
Q: "I need to explore multiple search engines for information and URLs on relevant medical equipment such as chemistry analyzers and surgical tables."
A: "This topic is more complex then it looks like, and Manuels recommendation needs some clarification."
So, please let me share my thoughts about this topic.
The horrible simple fact is: I cannot suggest a "compact solution"/package but like to do some advertisment (sorry). So for further development on this, simply I need money ;-)
But so far here is some deeper guidance, if you follow it I believe it is easy to resolve your problem.
First, what you actually are you looking for is a "Meta Search Engine" or a "Subject-Specific Web Crawler" (for your damned pharmacy however).
If you just use the google API as suggested by Manuel you will run into restrictions and privacy issues, meaning Google first will track all activity of your client (this propably will happen also, if you use examples I suggest below and stay legal, but nevermind so far), second Google will personalize your client what will affect your SERP
And finally and further, a more complex Web Crawler will enable you first to include sources from more than one provider in your serps, second to fake user agents, user agent sessions, IPs and local settings, proxys and , and third you may want to process the fetched search results further and spider the SERPs webpages delivered by the search engines result pages, parse and analize them on your own meta-search-engines topic specific criterias, follow links and
finally build your own pharma database based on your web.
Also remember there are lots of sources like topic specific RSS-Feeds and so on you will like to spider moderate/apply them to your SE manually!
If you used and combined a number of search engines, found websites and ways to access them, by meta-search engine, fake client or by just using the common APIs, you want to process all that data, analize text and meta-tags and provide prepared views and search results to your End-User.
So on the one side there are several common solutions available now in 2016 to do so, but on the other side I have to report that our conjecture about the serps beeing delivered asyncronously IS NOT ACTUALLY TRUE: I HAVE a meta search engine in use and it CAN HTML by requesting ?q=queryparameters and without asyncronous clientside /#!HASHBANGS needed!
REQUESTS (e.g. when taking a search):
- A "normal" request to google search, like /?q=searchterm
- A request to each of the found links in the result page of the previous request to google (and processing it links later...)
- Cached requests to the official Bing API, on text results, images, videos,...
- Periodical automated requests to a few selected breaking news portals
- Periodical automated requests to a few selected RSS-Feeeds
- Query the web-crawlers result DB
- Query a lot of internal DBs and tables, e.g. domain specific and user generated data stocks
PREPROCESS of the results:
- Split the results into topics/modules
- Performing SQL FULLTEXT search on the results
- Parsing and analyzing text and html/metatags of the found pages
- configurable ranking of the results based on the criterias
- Calculate the "SEO-Performance" of a few selected sites
- Compute the most important buzzwords of the breaking-news headlines of the moment
- Extract links for later crawling
- Searchformular Autocomplete
- OpenSearchDescription.xml (to register search engine in browser)
- No "faking clients" or "special tricks", when requesting sites the user agent and the IP is indicating my crawler as my metacrawler, everything is legal and fair!
Although your request on building a search for "medical, chemistry, surgical products" sounds a bit scaring to me, nevermind, currently I am qualified and purchasable.
To get started building your own Web-Crawler, I recommend the following package to you:
Using the Http Client Class by Manuel Lemos you will be able to develop a Search Engines Crawler Client or any other kind of bot or proxy in PHP with ease.
You will find many other helpful classes related to query websites, SEO and API-using (e.g. Google APIs) on phpclasses.org.
Erstellt von WEBFAN (Monday 1st of August 2016 03:29:24 PM - vor 538.61 Tagen)
in der Kategorie Knowledge Base als statische Seite
Veröffentlich/Freigeschaltet: Monday 1st of August 2016 04:56:17 PM von WEBFAN
Zuletzt geändert: Monday 1st of August 2016 04:56:17 PM von WEBFAN
Der Beitrag wurde insgesamt 1931 mal gelesen (durchschnittlich 3.58 mal am Tag)
Jetzt kostenlos als Benutzer von "frdl" registrieren...!
Kommentar zu diesem Beitrag verfassen:
Bewertung des Beitrages: - Noch keine Bewertung - von 10 Punkten (bei 0 Stimmen)
Kommentare zu diesem Beitrag:
- keine Kommentare zu diesem Beitrag vorhanden -