Canadian Beef Cattle Survey Production Lot Uploads: General Data Preparation Information

Site Data Information Reminder - Variations of geometry or the shape of the data

There are 5 sites where source data from the project comes from.
The database for the project has a "standard" shape or model how the data is stored.
Data from the 5 sites does not arrive to the project managers in the standard format or "shape".
When you prepare a workbook spreadsheet for upload, it is important to keep in mind the "shape" the source data.

Source Excel Data - General Information

Excel Workbook (wb) is referred to as a workbook, or a spreadsheet, and represents a site, year and possibly a herd.
Excel Worksheet (ws) is a worksheet within a workbook and usually represents a herd or a drug.
Excel Data Row (row) is a row within a worksheet usually representing a feedlot denoted by a feedlot code.
Excel Data Column (col) is a column within a worksheet usually representing a database field in the main survey data tables.
Excel Data includes Survey fields, associated Production Lot Core Fields and Drug fields.
Sometimes associated Drug fields are provided as columns with each column representing a specific drug.
Sometimes associated Drug fields are provided in rows with each column representing an trade name or active ingredient field, with each row representing a different drug route.

Prepare working version of spreadsheet:
Working Copy of Workbook

Find the file/s provided to you to prepare to upload. I am certain you remembered where you saved it.
Make a copy of the spreadsheet. We have established rules for naming your new working Excel workbook.

Guidelines to name your copied spreadsheet:
The first 5 characters are your first, last initial then wb_.
The next set of characters are the date. The format is 4 digit year, 2 digit month and 2 digit day and end with _.
I use this format so it sorts nicely on the file system.
The next 5 characters are the 4 digit project year of the data and _.
The next characters are shorthand for the name of the participating veterinary clinic and _. For example Alberta Health Solutions is abhs_
The next characters are the herd number or herd range.
The next characters are version numbers from v1 thru v#.
The final characters are the extension for Excel spreadsheets .xlsx
An example for a workbook name is: yrwb_20210316_2019_abhs_v1_CSBH00029.xlsx

Prepare working version of spreadsheet:
Working Copy of Worksheets within Workbook

Within the work workbook file, you may have one or more worksheets. Make a copy of the worksheet you are focussed on and rename.
The new worksheet name will begin with 'To Upload-' then indicate year, or herd or type of data of some sort.
The new working spreadsheet will be the actual data you will upload.
Copy the area of data of interest from the original worksheet.
Move to the working version of the worksheet and paste the data into it. Make sure to paste by values!
The new worksheet data has to be cleaned up. There are steps to take to ensure you have what you want.

Check for hidden columns. If there are hidden columns unhide them.
To unhide columns using Excel, select all data, right-click in the data area, a pop-up menu appears.
Click on "Unhide". Any hidden columns will now be visible

There is an optional step you may consider. You may want to remove all formatting from the spreadsheet.
It may mean you to re-format some fields afterwards.

Here are a few specific column formatting changes to make which makes conversion a little easier:
Go to date columns in the spreadsheet. You should format them as date. It will alleviate a lot of problems during the upload.
Go to the numeric columns and make sure they are formatted as numeric and not text!
Go to the last column in the worksheet. Starting from the first empty column to the right of last column, highlight 10 or more columns, then delete all the columns.
Go to the last row in the worksheet. Starting at the first empty row, highlight 10 or 20 or more rows, then delete the rows.
It is a good idea to delete a few extra rows and columns outside the data area to remove any unwanted hidden data.
Check to make sure all the data in your rectangle of data is there. If not, you may not have pasted by values!

Once you have done all of this and you are satisfied with what you have, check to make sure you have enough columns for the core fields and drug fields.

There are 5 core fields which are categorical. The fields are sex, age category, origin, brd risk, and quarter of the year closeout.
Each of these 5 categorical variables have standard categories.
As you prepare the spreadsheet take note of the category values in the source spreadsheet.
Most of the time these category values do not match the database.
Once in the application you can add these "alternate" category values for each categorical variable.
In this way we keep our standard in place and allow the source data to arrive in whatever format the data source providers transferred it.

Prepare working version of spreadsheet:
Data Cell Values within a Worksheet within a Workbook

Each cell in a spreadsheet contains data values.
The situation arises where a column of data is set to 0 or is blank.
It happens often in the drug section of a spreadsheet.
Let us assume we are working within a column of drug data where each cell value is a measure of how much drug was used.
For example, assume we have injectable data. Also, assume one of the drugs has all zeros in the column.
It has been decided if the column is in the spreadsheet then the drug data will be uplodaed into the database.
The values uploaded will all be zero just as they are in the spreadsheet.
There are situations where a drug is listed in a column in a spreadsheet. The cell data is all blank or has some zeroes and has some values > 0.
The 0 values and values > 0 will be uploaded into the database. What happens to the blank cells?
It has been decided, if the field is in the spreadsheet, and some or all data values are blank or missing or null, then the missing data value is converted into a 0,
and the data will be uploaded into the database as a 0.

The final situation which happens is when one herd has 10 drugs listed and the next herd has 15 drugs listed.
What do we do in this situation?
It has been decided if a drug exists in a spreadsheet then it will be uploaded, missing as 0, 0's as 0 and valid data.
If another worksheet does not have that drug in it then it will not be uploaded with value 0.
It will upload only what exits!
At report time we can include 0 drug counts, 0 drug quantities in however a report will be requested.

General Data Table Information

When spreadsheets arrive, they arrive in batches of one or more. The number of spreadsheets depends on the number of enrolled herds per veterinary clinic and how they are organized. Make sure to open all spreadsheets to make sure
they are in a similar format.
Check the columns in the spreadsheet and make sure to match to core and drug columns.
Do not delete spreadsheet columns if they are not being used. The main reason is to reduce errors.
You do not need to move the columns. This can cause errors as well.
Whatever the user feels is comfortable doing to reduce errors is important!
The data tables are stored in a PostGreSQL Database

Survey Fields: Database Model

The uploaded data is converted, checked and integrated into a standard "database model".
Dr. Sheryl Gow has modelled a survey to collect data specific to her questions of interest.
Dr. Sheryl Gow and I worked together to "data model" her survey. The data model represents
the Beef Cattle Survey in a way which makes it easy to work with once uploaded.
Below you will see a short version of the data model in a picture format.
A more advanced picture can be provided but for now a simple picture will suffice.

For each survey, there are one or more production lot records.
For each production lot record, there are 1 or more drug records for a drug route.

Survey Table Fields: Table Name is survey_cattle_beef.csb_survey_main

csb_survey_main_id bigint,
csb_record_group_id bigint,
csb_vet_id bigint,
csb_herd_id bigint,
csb_project_year integer,
csb_project_code varchar(128),
csb_date_created timestamp without time zone,

(The following new fields were added November 2021)
csb_survey_notes text

Production Lot Table Fields (Core): Table Name is survey_cattle_beef.csb_survey_production_lots

csb_survey_production_lot_id bigint,
csb_survey_main_id bigint,
csb_record_group_id bigint,
csb_account_id bigint,
csb_survey_production_lot_upload_id bigint,
csb_pl_template_id integer,
csb_pl_production_lot_code varchar(128),
csb_pl_sex_id integer,
csb_pl_animal_age_category_id integer,
csb_pl_percent_calves numeric,
csb_pl_percent_yearlings numeric,
csb_pl_origin_id integer,
csb_pl_brd_risk_id integer,
csb_pl_n_animals_in integer,
csb_pl_avg_date_in date,
csb_pl_source_weight_in_units_id integer,
csb_pl_source_weight_in_value numeric,
csb_pl_avg_weight_in_kgs numeric,
csb_pl_avg_date_out date,
csb_pl_quarter_of_year_closeout_id integer,
csb_pl_source_weight_out_units_id integer,
csb_pl_source_weight_out_value numeric,
csb_pl_avg_weight_out_kgs numeric,
csb_pl_n_animals_dead integer,
csb_pl_n_animals_slaughtered integer,
csb_pl_total_days_on_feed numeric,
csb_pl_total_head_days numeric,

(The following new fields were added November 2021)
csb_pl_n_animals_railed integer,
csb_pl_percent_adults numeric DEFAULT 0,
csb_pl_origin_percent_auction numeric DEFAULT 0,
csb_pl_origin_percent_backgrounded numeric DEFAULT 0,
csb_pl_origin_percent_ranch_direct numeric DEFAULT 0,
csb_pl_brd_risk_percent_low numeric DEFAULT 0,
csb_pl_brd_risk_percent_medium numeric DEFAULT 0,
csb_pl_sex_percent_steers numeric DEFAULT 0,
csb_pl_sex_percent_cows numeric DEFAULT 0,
csb_pl_sex_percent_heifers numeric DEFAULT 0,
csb_pl_sex_percent_bulls numeric DEFAULT 0,
csb_pl_notes text

Production Lot Table Fields (Drugs): Table Name is survey_cattle_beef.csb_survey_production_lot_drugs

csb_survey_production_lot_drug_id bigint,
csb_survey_production_lot_id bigint,
csb_survey_main_id bigint,
csb_record_group_id bigint,
csb_account_id bigint,
csb_survey_production_lot_upload_id bigint,
csb_pl_template_id integer,
csb_pl_drug_ordinal integer,
csb_pl_drug_route_id integer,
csb_pl_source_drug_name varchar(192),
csb_pl_source_drug_value numeric,
csb_pl_source_drug_value_units_id integer,
csb_pl_drug_trade_name_id integer,
csb_pl_drug_active_ingredient_id integer,
csb_pl_drug_total_mg numeric,
csb_pl_drug_total_ml numeric,
csb_pl_drug_primary_reason_for_use integer,
csb_pl_drug_secondary_reason_respiratory integer DEFAULT 0,
csb_pl_drug_secondary_reason_digestive integer DEFAULT 0,
csb_pl_drug_secondary_reason_lameness integer DEFAULT 0,
csb_pl_drug_secondary_reason_neurologic integer DEFAULT 0,
csb_pl_drug_secondary_reason_mixed integer DEFAULT 0,
csb_pl_drug_secondary_reason_mixed_description varchar(128),

(The following new fields were added November 2021)
csb_pl_drug_regimen_id bigint,
csb_pl_drug_regimen_detail_id bigint,
csb_pl_drug_primary_reason_for_use_pct_gp numeric DEFAULT 0,
csb_pl_drug_primary_reason_for_use_pct_dp numeric DEFAULT 0,
csb_pl_drug_primary_reason_for_use_pct_dt numeric DEFAULT 0,
csb_regimen_secondary_reason_liver_abscess integer DEFAULT 0,
csb_regimen_secondary_reason_coccidiosis integer DEFAULT 0,
csb_regimen_secondary_reason_enteritis integer DEFAULT 0,
csb_regimen_secondary_reason_other integer DEFAULT 0,
csb_regimen_secondary_reason_other_description text,
csb_pl_drug_secondary_reason_source text,
csb_pl_drug_notes text

Reminders:

Press the save button often! If you save a mistake you can go back to the original data and start again.
Each sheet or worksheet within a workbook, can represent one drug, one herd or combination of data.
The data's format and shape and structure depends a lot on the source site.
The data's format and shape and structure determine the types of templates to create to upload the data.
The data's first row is considered the spreadsheets column name.
Column names in a spreadsheet should be restricted to be 55 characters or less in length.
The column name will be used as a field name when uploaded into a PostgreSQL data table minus any invalid PG field name characters.
Reporting of the data will be by active ingredient and/or active ingredient class as well as by regimen.


Appendix 1: The #1 thing preparing Excel Data - Documentation

Documentation is the most important aspect of preparing any dataset from any source into any other target dataset.
For over 35 years dataset preparation is top of mind for any project.
What was learned up to now is "Documentation" can not be taken lightly. Documentation needs to be done carefully.
You need to know how many others will need access to the documentation to append comments, or consume what was documented.
Sharing a document with others about how the data was prepared has become a problem.
Here are several suggestions how to improve sharing and documenting data preparation of any project.

Keep your documentation close to the source dataset. What I mean is I have found if working in a spreadsheet,
add a worksheet called "Notes" and place your important documentation right close to the data.
You can always copy this and share it elsewhere.
Often times the documentation is prepared in a word processor the assumption is documents are processed in a word processor.
I suggest to change your assumptions and place the documentation beside the data whatever "beside" can mean.

We chose to take these notes and prepare a website of data preparation information. It is easily shared with others.

Another reason to document the preparation process is the notes can be used to build a data model, prepare an application
and document what you are doing in one central database. It is the stage we are at now and will be kept in mind for further development.
It will get sources of data to engage more in the preparing of the data and promote great value to any project.

Appendix 2: Help Exporting Excel Data into HTML

To prepare field names and other information in HTML format from an Excel spreadsheet it was discovered it was done in a clumsy way.
I discovered when in any Excel spreadsheet you can export the whole workbook as an HTML formatted webpage including worksheets at the bottom.
Do the following after opening any Excel spreadsheet:
Enable editting if not enabled
Click on File | Save As
Chose file type as Web page (*.htm, *.html)
Click on the Save button
You will find a file saved in the folder where the Excel file is located called:
   excel_file_name.htm (or .html)
Double click on the file.
The default browser will open with your data.

A folder is created with the same name as the excel file with extra "_files".
For example for Excel file excel_file_name.xlsx a folder you will find a folder called excel_file_name_files.
Go into that folder. You will see sheet files numbered sheet001.htm through sheet###.htm.
Search in the appropriate sheet file for the data you would like to publish online on a website.
You can copy the sheet data
and paste into your webpage for your webstie.

Appendix 3: Development, Testing and Production

There are three phases or stages to getting the survey database functioning properly.

Phase 1: Development
The development phase includes survey development changes.
Development phase includes database development, administration, restructuring, function development etc.
The development phase includes any changes made to a Windows application accessing the database survey data.
All changes are affected by users advice and suggestions, changes to the survey changes the db and application, and so on.

The new item I have added to the development phase is the uploading and integrating of the source data into the database.
We do not want to add data directly into production while we are still building database and application infrastructure to do the uploading.
A lot of bugs in development exist, especially developing something new.
It turns out there are a lot of "bugs" in the data as well. The development phase for I have heard some call the "Staging Area".
If this is what staging means it has been rediscovered.
Other names for stagings could be scaffolding, demonstrating, rendering, translation or characterization of the data.

Phase 2: Testing
The testing phase takes any or all of the work from the development phase and is tested to see if it works!
The same goes for the data. Most of the time, not all of the time, but most of the time, everything works as it should as defined at the time.
When it does not work here then we have dodged a bullet uploading into production. We have an opportunity to make more fixes before putting into production.

A major aspect of the testing phase is to improve documentation.
If something is not working when put into testing, the biggest culprit for this happening is the documentation was insufficient.
By this I mean if I would have documented better at development moving into testing would have been correct almost all of the time.

Phase 3: Production
After one or more iterations of placing the work into Testing with no problems, we are ready to put the data and associated db changes into production.
Placing the work into production has been the simplest phase to work through. So far, up to now, only 1 attempt at uploading is required once
the testing phase is completed properly. Data is uploaded as we understand it at the time. It may be other changes are required later.
We do that once the requirements are better known and scripts have been development in development phase,
tested in the testing phase then ready for production.
And the circle continues!

Summary of Phases:
In the past I would have skipped steps believing it would speed things up. I was wrong! Going through each step is actually quicker overall.
Documenting what is done at development phase is so important. Documentation is a must and speeds up the process.
After documentation has been written out, improved, editted and so on patterns emerge. The patterns provide insight to improvements how to develop
and automate the process moving from Development to Testing and into Production.