Prepared by : Christian Robles
March 16, 2025
Prepared by : Christian Robles
March 16, 2025
Conduct interviews with stakeholders.
Initiate call with Data SMEs
Distribute surveys to gather additional insights if needed.
Document functional and non-functional requirements.
Develop a timeline with milestones (Estimated Duration: 3 days).
Create Kanban board or similar project management board
Assign roles and responsibilities.
Identify resources required for the project.
List all potential data sources (databases, APIs, spreadsheets).
Once the project is completed, client will need to dump weekly or monthly updates to folder 'C:\\PBI\BGH\OPD' for main OPD MS Excel data source
As long as the original table format and is in 'Sheet1', data refresh will not encounter any issue. This is expected as files will be utilized as is after data extraction from DOH's tool.
PDF file extracts for Referral-related tabs. Automation using Power automate will be created to automatically extract data from 'C:\\PBI\BGH\Referral'
If Geojson mapping is available, will try to create a visual map upto baranggay level.
Assess data availability and quality.
Client have 100% access to DOH's tool for data extraction.
Extract data from identified sources.
This has been tested
Ensure data is up-to-date and relevant.
A simple data extract will enable client to add updated data.
Resolve Data Type Issues
Remove duplicates and irrelevant data.
No duplicate for original OPD main data (fact table)
Each row is valid consultation. a few missing Patient ID but after clarification, this is the same patient. A patient ID will be assigned depending on available information
Check for missing data
Created 'Raw OPD' as main table reference/duplication source. This will not be loaded to the PBI UI. From
Incorrect values
Incorrect data type
Mix data types in a single column (i.e. "N/A" value in a numeric column)
Categorical variables misclassified (i.e. {0,1,2} for {"low", "medium", "high")
Numeric Data with Leading/Trailing Characters (extra space characters is the best example)
Precision and Overflow Issues
Missing Values Misrepresented ("N/A" instead of null or [1,2,"NA",4"] NA as 0 or null)
Boolean Misrepresentation (yes/no, 1/0, true/false as strings)
Date and Time Format inconsistencies. Or Dates that needs to use locale to correctly converts to date
Prepare Summary Statistic results for each column
For numeric columns
Count and Distinct Count
Mean, Median, Mode
Maximum, Minimum, Range
Standard Deviation, Sample Variance
Standard Deviation, Standard Error
Kurtosis, Skewness
For Date and Time
Count, Distinct count
Date Range in days, Days between average
Earliest date, Latest date, Median date
most Frequent date
Weekend count
for String
Trim, clean values
Perform required word/phrase case
Check for dependencies
Check for IDs related to column content
Spelling
single cell multiple content (i.e. products ordered)
Normalize data formats (1NF, 2NF, etc. using Power Query).
Avoid over-normalization
Denormalize for simplified star or snowflake schema (flat dimension tables and centralized fact table).
Create calculated fields if necessary.
Sub-steps:
Analyze data distributions (mean, median, mode).
Identify outliers and anomalies.
Visualize data relationships (scatter plots, correlation matrices).
Check for missing values and assess their impact.
Tips:
For categorical data, look for frequency counts.
For numerical data, assess skewness and kurtosis.