August 9th, 2019 by Michael Rink
Amazon Releases AWS Lake Formation
On August 8th, Amazon Web Services released AWS Lake Formation, a data lake service. Many customers were already using Amazon S3 (Simple Storage Services) for their data lake, so Lake Formation might best be viewed as a set of tools to make an Amazon data lake less expensive and more user-friendly.
Amazon lists five key tools that Amazon Web Services Lake Formation provides: source crawlers, ETL and data prep, data catalog, security settings, and access control. All of these tools are managed through a central Lake Formation Console. Other AWS services like Athena, Redshift, and EMR will still be able to access data once it has been moved over.
AWS Lake Formation source crawlers are aimed at reducing the overhead involved in just getting data from wherever it currently is, into your data lake. Customers with existing S3 instances just need to point Lake Formation at the instances they want to pull in. The process is slightly more involved for new customers or those looking to add new data sources. AWS Lake Formation can pull in entire databases, or do incremental updates based on user-defined tables and keys.
AWS Lake Formation uses AWS glue to provide extract, transform, load (ETL) and data preparation services. Lake Formation also provides a built-in machine learning service to deduplicate data as it is brought in. This should help keep the size of the data lake, and thus costs, down.
One of the toughest hurdles for data lakes is keeping track of what all is in the lake. AWS Lake Formation provides a data catalog that describes the different data sets that are available along with which groups of users have access to each. This should make the process of finding relevant data sets more user-friendly.
The last two tools built in are really a set of tools to provide security and access control. The toolset includes services like AWS Identity, Access Management, and AWS Key Management Service. AWS Lake Formation allows customers to set data lake wide policies through the central console. If more granular control is needed, it also supports per data set security settings.