Automating Google searches

I usually don’t write about OSINT techniques, I think that before approaching OSINT we should speak about ethics and respect. By the way, some OSINT automation techniques are harmless: if you don’t know what you have to search for, you won’t be able to get anything regardless you are automating or not.

Google Dorks

We probably already know Google Dorks: in short, they are some useful parameters to reduce the result. There is a lot of them but I’m used to using:

intitle: limit the search to pages that contains a specific pattern in the title (HTML tag).
inurl: limit the search to URLs that contains a specific pattern.
site: limit the search to a specific domain or TLD.
filetype: limit the search to a specific file type.

For example: site:adainese.it filetype:pdf will display all PDF files included in my website.

Be aware that you should also check my robots.txt .

Manual Google searches

If you are used to doing many searches on Google in a short timeframe, you will see that Google tries to interrupt you, especially if you are not authenticated.

However, during OSINT analysis you probably have a set of requests that you want to run on each target.

Finally, the standard Google output is probably not the best usable data format in the world.

SerpApi

We are using SerpApi, that basically executes each request, possibly solving CAPTCHAs and parsing (web scraping) the Google result in a JSON format. SerpApi supports Google Search, Maps, Jobs, Product, PlayStore, YouTube… But it also supports DuckDuckGo, Yandex, Ebay, Yahoo…

We need to create an account, and we will have 100 requests/month for free.

Automating Google searches

Let’s try to automate the search mentioned before: site:adainese.it filetype:pdf

We can approach the problem using a direct REST API or Python module. Let’s see both:

wget -q -O- "https://serpapi.com/search.json?engine=google&q=site%3Aadainese.it+filetype%3Apdf&api_key=25c8c8ad0cdc8f34e197327cce506d30843530b54361c16fb22ee37869acba3c" | jq
{
  "search_metadata": {
    "id": "62cc1d3fe93ff4ebb8b22f80",
    "status": "Success",
    "json_endpoint": "https://serpapi.com/searches/c6980a3075b990cb/62cc1d3fe93ff4ebb8b22f80.json",
    "created_at": "2022-07-11 12:53:19 UTC",
    "processed_at": "2022-07-11 12:53:19 UTC",
    "google_url": "https://www.google.com/search?q=site%3Aadainese.it+filetype%3Apdf&oq=site%3Aadainese.it+filetype%3Apdf&sourceid=chrome&ie=UTF-8",
    "raw_html_file": "https://serpapi.com/searches/c6980a3075b990cb/62cc1d3fe93ff4ebb8b22f80.html",
    "total_time_taken": 4.62
  },
  "search_parameters": {
    "engine": "google",
    "q": "site:adainese.it filetype:pdf",
    "google_domain": "google.com",
    "device": "desktop"
  },
  "search_information": {
    "organic_results_state": "Results for exact spelling",
    "query_displayed": "site:adainese.it filetype:pdf",
    "total_results": 8,
    "time_taken_displayed": 0.22
  },
  "organic_results": [
    {
      "position": 1,
      "title": "PoC Cyber Range Platform - Andrea Dainese",
      "link": "https://www.adainese.it/files/slides/20200922-cyberrange.pdf",
      "displayed_link": "https://www.adainese.it › 20200922-cyberrange",
      "date": "Sep 22, 2020",
      "snippet": "Rosario is a Cybersecurity professional with around 20 years of experience. Rosario is the. South Europe Cyberbit Sales Engineer and.",
      "about_this_result": {
        "source": {
          "description": "adainese.it was first indexed by Google in January 2021"
        },
        "languages": [
          "English"
        ],
        "regions": [
          "the United States"
        ]
      },
      "about_page_link": "https://www.google.com/search?q=About+https://www.adainese.it/files/slides/20200922-cyberrange.pdf&tbm=ilp&ilps=ADNMCi11pvVrwLJY0LqGhLcTwhibCc9pIw",
      "cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:EhJFMGqqonQJ:https://www.adainese.it/files/slides/20200922-cyberrange.pdf+&cd=1&hl=en&ct=clnk&gl=us"
    },
[...]

Good but we can do better with a script. Maybe we want to download all PDF files included in the result. Before, remember to install the Python module:

pip3 install google-search-results

Finally, customize the following script:

#!/usr/bin/env python3
import os
import json
import requests
from serpapi import GoogleSearch

params = {
  'engine': 'google',
  'q': 'site:adainese.it filetype:pdf',
  'api_key': '25c8c8ad0cdc8f34e197327cce506d30843530b54361c16fb22ee37869acba3c'
}

search = GoogleSearch(params)
result = search.get_json()

with open('search_result.json', 'w') as outfile:
    json.dump(result, outfile)

for organic_result in result['organic_results']:
    link = organic_result['link']
    filename = os.path.basename(link)
    r = requests.get(link, allow_redirects=True)
    open(filename, 'wb').write(r.content)