Integrate with AWS S3 to perform automated content classification on your data buckets
About the Amazon AWS S3 Integration
What it does:
- Performs content scanning on objects in an S3 Bucket to detect and map the types of data that are stored inside of it.
- The integration supports scanning and identifying data types inside many different file and document types. For more information, see below.
Before setting up this integration:
- Be sure to add Amazon S3 to your Inventory. To learn how to add systems to your Inventory, click here.
- Make sure your MineOS plan supports automatic integrations.
How to set up
On the system side:
- Log into your AWS account
- Go to IAM -> Users -> Add New Users
- Type "mine-privacy-ops" as the name, select "Access Key" and click Next
- Select "Attach existing policies directly" and type s3 on the search.
- Select AmazonS3ReadOnlyAccess and click Next
- Leave the tags page empty and click Next
- Click Create User
- Copy the Access Key ID and Secret access key from this page to MineOS
On your Privacy Portal:
- Head to your Data Inventory and select Amazon AWS S3
- Scroll down to the component titled “Request handling”
- Select “Handle this data source in privacy requests”
- Select “Integration” as the handling style (see image below).
- Paste the Access Key ID and Secret Access Key into the designated fields
- Type the bucket name you want to scan under Bucket Name or a regular expression matching the bucket name you want to scan. For example: `*` will scan all the buckets the user account has access to.
- Click "Test your integration" so Mine can verify your API key(s).
- If successful, click "Test & save to enable the integration.
If you would like to add more buckets, click the "+ Create Instance" link at the bottom and type in another bucket name. You can reuse the same Key secret & ID.
Supported File Types
Mine's content classification supports the following file types by extracting text from the files and performing classification:
- Apache Avro (.avro) - There are limits on maximum block size, file size, number of columns etc.
- Apache Parquet (.parquet)
- .csv .tsv
- PDF - File size limit: 30MB
- Textual files
- Microsoft Word - File size limit: 30MB
- Microsoft Excel - File size limit: 30MB
- Microsoft Powerpoint - File size limit: 30MB
Other file types not listed are not supported, including:
- Archives - are not supported.
- Image files (with OCR) - not yet supported, although it is planned.
Limitations
- "Requestor pays" buckets are not supported.
- Compressed objects (gzip) are not currently supported.
- The system supports scanning multiple buckets. If the number of buckets is very large - that’s not currently supported.
Talk to us if you need any help with integrations via our chat or at portal@saymine.com, and we'll be happy to assist!🙂