Integrate with AWS S3 to perform automated content classification on your data buckets
About the Amazon AWS S3 Integration
What it does:
- Performs content scanning on objects in an S3 Bucket to detect and map the types of data that are stored inside of it.
- The integration supports scanning and identifying data types inside many different file and document types. For more information, see below.
Before setting up this integration:
- Be sure to add Amazon S3 to your Inventory. To learn how to add systems to your Inventory, click here.
- Make sure your MineOS plan supports automatic integrations.
How to set up
On the system side:
- Log into your AWS account
- Go to IAM -> Users -> Add New Users
- Type "mine-os" as the name (or any other name you prefer), select "Access Key" and click Next
- Select "Attach existing policies directly" and type s3 on the search.
- Select AmazonS3ReadOnlyAccess and click Next
- Leave the tags page empty and click Next
- Click Create User
- Copy the Access Key ID and Secret access key from this page to MineOS
- The AWS Multiple buckets integration (using regex) requires the s3:ListAllMyBuckets permission.
- Encrypted buckets require the kms:Decrypt permission.
On your Privacy Portal:
- Head to your Data Inventory and select Amazon AWS S3
- Scroll down to the component titled “Request handling”
- Select “Handle this data source in privacy requests”
- Select “Integration” as the handling style (see image below).
- Paste the Access Key ID and Secret Access Key into the designated fields
- Under Bucket Regex type a bucket name or regular expressions to match the bucket names you want to scan. Here are a few examples for using regular expressions for scanning multiple buckets:
- .+ will scan all the buckets the user account has access to.
- prod-.+ will scan all the buckets that start with "prod-".
- .+-data will scan all the buckets that end with "-data".
- Click "Test your integration" so Mine can verify your settings and save them. Once you do, the bucket names that match the regex will appear. Verify the buckets are as you intended:
- If successful, click "Test & save" to enable the integration.
If you would like to add more buckets or more regexes, click the "+ Create Instance" link at the bottom and type in another bucket name or a regex. You can reuse the same Key secret & ID.
Supported File Types
Mine's content classification supports the following file types by extracting text from the files and performing classification:
- Apache Avro (.avro) - There are limits on maximum block size, file size, number of columns etc.
- Apache Parquet (.parquet)
- .csv .tsv
- PDF - File size limit: 30MB
- Textual files
- Microsoft Word - File size limit: 30MB
- Microsoft Excel - File size limit: 30MB
- Microsoft Powerpoint - File size limit: 30MB
- Archives - are not supported.
- Image files (with OCR) - not yet supported, although it is planned.
- "Requestor pays" buckets are not supported.
- Buckets with Glacier storage class are not supported.
- Compressed objects (gzip) are not currently supported.
Talk to us if you need any help with integrations via our chat or at email@example.com, and we'll be happy to assist!🙂