Categories

Category cover

Automation
35 posts

Category cover

Notes
19 posts

Category cover

Security
19 posts

Category cover

Personal Security
14 posts

Category cover

Infrastructure
11 posts

Category cover

CISO
9 posts

Category cover

OT/ICS
5 posts

Category cover

UNetLab
3 posts

Category cover

Write-up
3 posts

Category cover

Books
2 posts

Category cover

OSInt
2 posts

Category cover

My life
1 posts

Automating Google searches

Andrea Dainese
July 17, 2022
Post cover

I usually don’t write about OSINT techniques, I think that before approaching OSINT we should speak about ethics and respect. By the way, some OSINT automation techniques are harmless: if you don’t know what you have to search for, you won’t be able to get anything regardless you are automating or not.

Google Dorks

We probably already know Google Dorks: in short, they are some useful parameters to reduce the result. There is a lot of them but I’m used to using:

  • intitle: limit the search to pages that contains a specific pattern in the title (HTML tag).
  • inurl: limit the search to URLs that contains a specific pattern.
  • site: limit the search to a specific domain or TLD.
  • filetype: limit the search to a specific file type.

For example: site:adainese.it filetype:pdf will display all PDF files included in my website.

Be aware that you should also check my robots.txt .

Manual Google searches

If you are used to doing many searches on Google in a short timeframe, you will see that Google tries to interrupt you, especially if you are not authenticated.

However, during OSINT analysis you probably have a set of requests that you want to run on each target.

Finally, the standard Google output is probably not the best usable data format in the world.

SerpApi

We are using SerpApi, that basically executes each request, possibly solving CAPTCHAs and parsing (web scraping) the Google result in a JSON format. SerpApi supports Google Search, Maps, Jobs, Product, PlayStore, YouTube… But it also supports DuckDuckGo, Yandex, Ebay, Yahoo…

We need to create an account, and we will have 100 requests/month for free.

Automating Google searches

Let’s try to automate the search mentioned before: site:adainese.it filetype:pdf

We can approach the problem using a direct REST API or Python module. Let’s see both:

wget -q -O- "https://serpapi.com/search.json?engine=google&q=site%3Aadainese.it+filetype%3Apdf&api_key=25c8c8ad0cdc8f34e197327cce506d30843530b54361c16fb22ee37869acba3c" | jq
{
  "search_metadata": {
    "id": "62cc1d3fe93ff4ebb8b22f80",
    "status": "Success",
    "json_endpoint": "https://serpapi.com/searches/c6980a3075b990cb/62cc1d3fe93ff4ebb8b22f80.json",
    "created_at": "2022-07-11 12:53:19 UTC",
    "processed_at": "2022-07-11 12:53:19 UTC",
    "google_url": "https://www.google.com/search?q=site%3Aadainese.it+filetype%3Apdf&oq=site%3Aadainese.it+filetype%3Apdf&sourceid=chrome&ie=UTF-8",
    "raw_html_file": "https://serpapi.com/searches/c6980a3075b990cb/62cc1d3fe93ff4ebb8b22f80.html",
    "total_time_taken": 4.62
  },
  "search_parameters": {
    "engine": "google",
    "q": "site:adainese.it filetype:pdf",
    "google_domain": "google.com",
    "device": "desktop"
  },
  "search_information": {
    "organic_results_state": "Results for exact spelling",
    "query_displayed": "site:adainese.it filetype:pdf",
    "total_results": 8,
    "time_taken_displayed": 0.22
  },
  "organic_results": [
    {
      "position": 1,
      "title": "PoC Cyber Range Platform - Andrea Dainese",
      "link": "https://www.adainese.it/files/slides/20200922-cyberrange.pdf",
      "displayed_link": "https://www.adainese.it › 20200922-cyberrange",
      "date": "Sep 22, 2020",
      "snippet": "Rosario is a Cybersecurity professional with around 20 years of experience. Rosario is the. South Europe Cyberbit Sales Engineer and.",
      "about_this_result": {
        "source": {
          "description": "adainese.it was first indexed by Google in January 2021"
        },
        "languages": [
          "English"
        ],
        "regions": [
          "the United States"
        ]
      },
      "about_page_link": "https://www.google.com/search?q=About+https://www.adainese.it/files/slides/20200922-cyberrange.pdf&tbm=ilp&ilps=ADNMCi11pvVrwLJY0LqGhLcTwhibCc9pIw",
      "cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:EhJFMGqqonQJ:https://www.adainese.it/files/slides/20200922-cyberrange.pdf+&cd=1&hl=en&ct=clnk&gl=us"
    },
[...]

Good but we can do better with a script. Maybe we want to download all PDF files included in the result. Before, remember to install the Python module:

pip3 install google-search-results

Finally, customize the following script:

#!/usr/bin/env python3
import os
import json
import requests
from serpapi import GoogleSearch

params = {
  'engine': 'google',
  'q': 'site:adainese.it filetype:pdf',
  'api_key': '25c8c8ad0cdc8f34e197327cce506d30843530b54361c16fb22ee37869acba3c'
}

search = GoogleSearch(params)
result = search.get_json()

with open('search_result.json', 'w') as outfile:
    json.dump(result, outfile)

for organic_result in result['organic_results']:
    link = organic_result['link']
    filename = os.path.basename(link)
    r = requests.get(link, allow_redirects=True)
    open(filename, 'wb').write(r.content)

And we will get the JSON (for future references) and all PDF files downloaded in the current directory.

You probably know that acting in that way is pretty… noisy. You should evaluate the countermeasures to remain hidden.

References