Using BigQuery CLI for Honeypot Log Analysis
Last updated: 2021-09-20
Curious about BigQuery? This super quick tutorial will help you get started. BigQuery is a Google Cloud Platform service that provides serverless, scalable data analysis fast. This tutorial will discuss the basics of working with BigQuery including how to create datasets and tables, upload information to them, and how to make queries using the BigQuery command-line interface tool bq
.
Pre-Config
In order to follow along, you’ll need to have done the following:
- Install Google SDK components
- Authenticate a connection to your GCP Account
- Be familiar with SQL Statements
Create a Temporary Project
Create a temporary project while you get familiar with BigQuery. Think of a project as a container object that will hold all GCP resources assigned to it. The PROJECT_ID
is the name of the project and it will need to be a unique value not in use by any project in existence. For example, I found PROJECT_ID
“sample” was already taken but found “sample01” was available.
gcloud projects list
gcloud projects create PROJECT_ID
Once the project creation is successful, verify by listing your current working projects. Then set the current working project to your new PROJECT_ID
.
gcloud projects list
gcloud config project set PROJECT_ID
Create a new Dataset
First, we’ll create a new dataset named sample.
$bq mk sample
Dataset 'PROJECT_ID:sample' successfully created.
Upload a Table
Now its time to upload data to a table for analysis. In a previous post I mentioned parsing the docker logs to pull information about source-ip addresses and username and password attempts made by attacking entities using egrep
and sed
. For this exercise we’ll use my public gist. Download this file and upload the data to a table in your dataset using the command pattern bq load DATASET_ID.TABLE_ID FILENAME FIELD1:DATA_TYPE
, as shown below.
$bq load sample.passwords passwords.txt password:string
Upload complete.
Waiting on bqjob_r4c9403a495b4f4ad_00000177757b1dc7_1 ... (0s) Current status: DONE
View the available tables in dataset sample
.
$bq ls sample
tableId Type Labels Time Partitioning Clustered Fields
----------- ------- -------- ------------------- ------------------
passwords TABLE
And view the table schema you created for represented earlier as FIELD1:DATATYPE
.
$bq show sample.passwords
Table malwaremily:sample.passwords
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- --------------------- ------------ ------------- ------------ ------------------- ------------------ --------
05 Feb 22:55:41 |- password: string 537 3329
Run Queries
Woo! We can run some queries to pull information. Queries can be made with the general pattern bq query "SQL_STATEMENT"
.
Top 10 most commen password attempts
We can count the top 10 password occurrences with COUNT
.
$bq query "SELECT COUNT(password) AS PASS_COUNT, password FROM sample.passwords GROUP BY password ORDER BY PASS_COUNT DESC LIMIT 10"
Waiting on bqjob_r21d3499c472ace56_0000017782fd4af8_1 ... (0s) Current status: DONE
+------------+-----------+
| PASS_COUNT | password |
+------------+-----------+
| 61 | password |
| 49 | 1 |
| 36 | P |
| 30 | root |
| 26 | A |
| 22 | admin |
| 20 | 123456 |
| 14 | p |
| 14 | 123 |
| 8 | raspberry |
+------------+-----------+
We can pull the top 10 most common password attempts using the COUNT
, GROUP
, ORDER
, DESC
and LIMIT
keywords. COUNT
will count the occurrence of each password and save the related values in column PASS_COUNT
. The GROUP
keyword allow us to group by passwords and then ORDER
the results by PASS_COUNT
in descending order with DESC
. We limit the number of results returned with LIMIT
.
You may be wondering why we order the results by column password
and not by the custom aggregate PASS_COUNT
. The reason for this is because the default SQL used by bq
does not allow grouping by an aggregate. This is demonstrated in the query below.
bq query "SELECT COUNT(password) AS PASS_COUNT, password FROM sample.passwords GROUP BY PASS_COUNT ORDER BY password DESC LIMIT 10"
The error reads Cannot group by an aggregate. Consider using SQL, which allows grouping by any expression.
But aren’t we using SQL already? Well, yes but BigQuery differentiates between standard SQL and legacy SQL dialect. What is the default? Right now, it depends on how you access BigQuery.
GCP documentation reads, “In the Cloud Console and the client libraries, standard SQL is the default. In the bq command-line tool and the REST API, legacy SQL is the default.” Some searching on public forums like StackOverflow suggest that Standard SQL offers a better experience with less ‘gotchas’ than traditional SQL. While this post uses the default legacy SQL, standard SQL can be enabled for the BigQuery command-line interface. To do so, read the “Setting standard SQL as the defualt for the command-line tool” from the GCP documentation.
Delete your test environment
Once you’re done with the project the environment should be deleted.
gcloud projects delete PROJECT_ID
If you would like to only delete the dataset, you can do so with the following command:
bq rm -r sample
Next Steps
For more information about getting started with the bq
cli check out this google tutorial where you will use a sample Shakespeare dataset provided by google. If you’d like to automate this behavior then you might be interested in learning more about the C#, Go, Java, Node.js, PHP, Python and/or Ruby client libraries.