[목차]
Data Preparation
- Extract, Transform, Load (ETL)
- Extract data from the source
- Transform data at the source, sink, or in a staging area
- Load data into the sink
- Sources: file, DB, event log, ...
- Sinks: Python, R, RDBMS, Data warehouse, ...
Files
- File is named sequence of bytes
- stored as a collection of pages (blocks)
- File System is a collection of files organized in a hierarchical namespace
- operations
- open() close()
- seek()
- read() write()
Log Files
- Process usually daemons and create logs.
- Syslog
- port 514 using UDP
- /var/log/messages (default)
- extend : syslog-ng and rsyslog
- complex message formatting
- content-based filtering
- TCP
Tabular Data
- A table is a collection of rows and columns
- row → index
- column → name
- cell: (index, name) pair
- Excel files, CSV files
Extensible Markup Language (XML)
- Standard for data representation and exchange