All technological notes.
Structured
Unstructured
Semi-Structured
Data that is not as organized as structured data but has some level of structure in the form of tags, hierarchies, or other patterns.
Characteristics:
e.g.:
Volume
Velocity
Variety
Data Warehouse
A centralized repository optimized for analysis where data from different sources is stored in a structured format.
Characteristics:
e.g.:
Data Lake
A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data.
Choosing a Warehouse vs. a Lake
Use a Data Warehouse when:
Use a Data Lake when:
Often, organizations use a combination of both, ingesting raw data into a data lake and then processing and moving refined data into a data warehouse for analysis.
Data Lakehouse
A hybrid data architecture that combines the best features of data lakes and data warehouses, aiming to provide the performance, reliability, and capabilities of a data warehouse while maintaining the flexibility, scale, and low-cost storage of data lakes.
Characteristics:
e.g.:
Data Mesh
Individual teams own “data products” within a given domain
These data products serve various “use cases” around the organization
“Domain-based data management”
Federated governance with central standards
ETL:
Extract:
Transform
Data cleansing (e.g., removing duplicates, fixing errors)Data enrichment (e.g., adding additional data from other sources)date formatting, string manipulation)Load:
ETL Pipelines
This process must be automated in some reliable way
AWS GlueEventBridge Amazon MWAA]AWS Step FunctionsLambdaGlue WorkflowsJDBC
ODBC
CSV
Comma-Separated ValuesText-based format that represents data in a tabular form where each line corresponds to a row and values within a row are separated by commas.
When to Use:
JSON
JavaScript Object NotationAvro
Parquet
Star schema
Fact tables
fact is an event that is counted or measured, such as a sale or log in.Dimensions tables
dimension includes reference data about the fact, such as date, item, or customer.Primary / foreign keys:
Entity Relationship Diagram (ERD):

Data Lineage
Importance:
e.g.:

Schema Evolution
Importance:
Glue Schema Registry
Indexing
Partitioning
Compression
Columnar compression