It has been a while since I posted, but here goes. I have been experimenting with Splunk to make one of my old processes better and easier.
In the past I have done file system assessments for customers to provide capacity planning information as well as usage patterns. There are a number of ways I have collected data depending on if the platform is Windows, Linux, or a NAS device. In this post I will focus on collecting file system metadata from a Windows file server. To do this I use a PowerShell script to collect the data and dump to a CSV file. Historically I would use the CSV output and load it into SQL Server as the first step. Then I would use the SQL connectivity and pivot charting functionality in Excel for the reporting. It occurred to me as I have been working with Splunk that I could improve this process using Splunk.
Another thought also occurred to me, this process could be performed by owners of Splunk with no impact on licensing costs. Splunk is licensed on daily ingest volume and that volume can be exceeded 3 times per month without penalty. File system assessment data is something that would typically only be collected on a periodic basis so this data could be ingested without increasing the daily license capacity. Using the methods I show below organizations who own Splunk could easily do free periodic file system assessments.
The first step is to collect the file system metadata using the following PowerShell script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
<# .SYNOPSIS Collect filesystem metadata information to CSV file .DESCRIPTION Uses configuration information from xml file to scan file systems .EXAMPLE StartXml-FileSystemMetadata.ps1 -XMLConfigurationFile "C:\Temp\Get-FileSystemMetadata.xml" .NOTES Author: David Muegge Requirements: PowerShell Version 4 Disclaimer **************************************************************** * DO NOT USE IN A PRODUCTION ENVIRONMENT UNTIL YOU HAVE TESTED * * THOROUGHLY IN A LAB ENVIRONMENT. USE AT YOUR OWN RISK. IF * * YOU DO NOT UNDERSTAND WHAT THIS SCRIPT DOES OR HOW IT WORKS, * * DO NOT USE IT OUTSIDE OF A SECURE, TEST SETTING. * **************************************************************** #> [CmdletBinding()] Param([Parameter(Mandatory=$false,Position=1)][String]$XMLConfigurationFile="C:\Temp\FileSystemMetadata_Config.xml") #region Functions ## function Start-ScanFileStructure{ [CmdletBinding()] Param( [Parameter(Mandatory=$true)][String]$ScanBasePath, [Parameter(Mandatory=$true)][String]$OutputCSVFile, [Parameter(Mandatory=$false)][String]$ShareName=$null, [Parameter(Mandatory=$false)][String]$KeywordSearch=$null, [Parameter(Mandatory=$false)][Int]$KeywordDirPart=$null, [Parameter(Mandatory=$false)][Switch]$MD5 ) Try{ # File Structure Extract Keyword from Regex Search if($KeywordSearch){ if($MD5 -eq "True"){ Get-ChildItem -Path $ScanBasePath -File -Recurse | Select-Object @{Name="FileName";Expression={$_.Name}},DirectoryName,@{Name="Extension";Expression={$_.Extension.Replace(".","")}},Mode,CreationTime,LastAccessTime,LastWriteTime,@{Name="Bytes";Expression={$_.Length}},@{Name="Share";Expression={$ShareName}},@{Name="MD5";Expression={(get-filehash -Path $_.fullname -Algorithm MD5).Hash}},@{Name="Keyword";Expression={$_.FullName -match $KeywordSearch | Out-Null; $matches[1]}} | export-csv -Path $OutputCSVFile -NoTypeInformation }else{ Get-ChildItem -Path $ScanBasePath -File -Recurse | Select-Object @{Name="FileName";Expression={$_.Name}},DirectoryName,@{Name="Extension";Expression={$_.Extension.Replace(".","")}},Mode,CreationTime,LastAccessTime,LastWriteTime,@{Name="Bytes";Expression={$_.Length}},@{Name="Share";Expression={$ShareName}},@{Name="MD5";Expression={$null}},@{Name="Keyword";Expression={$_.FullName -match $KeywordSearch | Out-Null; $matches[1]}} | export-csv -Path $OutputCSVFile -NoTypeInformation } }else{ # File Structure Extract Keyword from Directory Part if($KeywordDirPart){ if($MD5 -eq "True"){ Get-ChildItem -Path $ScanBasePath -File -Recurse | Select-Object @{Name="FileName";Expression={$_.Name}},DirectoryName,@{Name="Extension";Expression={$_.Extension.Replace(".","")}},Mode,CreationTime,LastAccessTime,LastWriteTime,@{Name="Bytes";Expression={$_.Length}},@{Name="Share";Expression={$ShareName}},@{Name="MD5";Expression={(get-filehash -Path $_.fullname -Algorithm MD5).Hash}},@{Name="Keyword";Expression={($_.FullName.Split("\"))[$KeywordDirPart]}} | export-csv -Path $OutputCSVFile -NoTypeInformation }else{ Get-ChildItem -Path $ScanBasePath -File -Recurse | Select-Object @{Name="FileName";Expression={$_.Name}},DirectoryName,@{Name="Extension";Expression={$_.Extension.Replace(".","")}},Mode,CreationTime,LastAccessTime,LastWriteTime,@{Name="Bytes";Expression={$_.Length}},@{Name="Share";Expression={$ShareName}},@{Name="MD5";Expression={$null}},@{Name="Keyword";Expression={($_.FullName.Split("\"))[$KeywordDirPart]}} | export-csv -Path $OutputCSVFile -NoTypeInformation } }else{ # File Structure No Keyword if($MD5 -eq "True"){ Get-ChildItem -Path $ScanBasePath -File -Recurse | Select-Object @{Name="FileName";Expression={$_.Name}},DirectoryName,@{Name="Extension";Expression={$_.Extension.Replace(".","")}},Mode,CreationTime,LastAccessTime,LastWriteTime,@{Name="Bytes";Expression={$_.Length}},@{Name="Share";Expression={$ShareName}},@{Name="MD5";Expression={(get-filehash -Path $_.fullname -Algorithm MD5).Hash}},@{Name="Keyword";Expression={$null}} | export-csv -Path $OutputCSVFile -NoTypeInformation }else{ Get-ChildItem -Path $ScanBasePath -File -Recurse | Select-Object @{Name="FileName";Expression={$_.Name}},DirectoryName,@{Name="Extension";Expression={$_.Extension.Replace(".","")}},Mode,CreationTime,LastAccessTime,LastWriteTime,@{Name="Bytes";Expression={$_.Length}},@{Name="Share";Expression={$ShareName}},@{Name="MD5";Expression={$null}},@{Name="Keyword";Expression={$null}} | export-csv -Path $OutputCSVFile -NoTypeInformation } } } }Catch{ Write-Error $_ continue } } #endregion #region MAIN ## [Int64]$ErrorCount = 0 # Check if XML configuration file parameter provided if($XMLConfigurationFile -eq $null){ Write-Error -Message "XMLConfigurationFile parameter not provided" } else { # Execute Configuration Try{ # Read Configuration File [xml]$XMLConfig = Get-Content $XMLConfigurationFile $ErrorLogPath = $XMLConfig.config.ErrorLogPath $FSInstances = $XMLConfig.config.FileSystem # Setup error log $errorfile = "{0}\{1}{2}.{3}" -f $ErrorLogPath,"Get-FileSystemMetadata_Error_",(Get-Date -Format yyyyMMdd-HHmmss),"txt" # Start Transcript if ($host.name -eq 'ConsoleHost'){Start-Transcript $errorfile} Write-Verbose -Message ((Get-Date -Format yyyyMMdd-HHmmss) + " : Error log file created " + $errorfile) foreach($fs in $FSInstances){ Write-Verbose -Message ((Get-Date -Format yyyyMMdd-HHmmss) + " : Scanning " + ($fs.scanpath)) if(($fs.KeywordSearch)){ if($fs.MD5){ Start-ScanFileStructure -ScanBasePath $fs.scanpath -OutputCSVFile $fs.outputfile -KeywordSearch $fs.KeywordSearch -ShareName $fs.ShareName -MD5 }else{ Start-ScanFileStructure -ScanBasePath $fs.scanpath -OutputCSVFile $fs.outputfile -KeywordSearch $fs.KeywordSearch -ShareName $fs.ShareName } }else{ if(($fs.KeywordDirPart)){ if($fs.MD5){ Start-ScanFileStructure -ScanBasePath ($fs.scanpath) -OutputCSVFile ($fs.outputfile) -KeywordDirPart ($fs.KeywordDirPart) -ShareName $fs.ShareName -MD5 }else{ Start-ScanFileStructure -ScanBasePath ($fs.scanpath) -OutputCSVFile ($fs.outputfile) -KeywordDirPart ($fs.KeywordDirPart) -ShareName $fs.ShareName } }else{ if($fs.MD5){ Start-ScanFileStructure -ScanBasePath ($fs.scanpath) -OutputCSVFile ($fs.outputfile) -ShareName $fs.ShareName -MD5 }else{ Start-ScanFileStructure -ScanBasePath ($fs.scanpath) -OutputCSVFile ($fs.outputfile) -ShareName $fs.ShareName } } } } } Catch { Write-Error $_ $ErrorCount ++ continue } Finally { # Stop Transcript Write-Host '' Write-Host "-- Error Count: $ErrorCount ------ " if ($host.name -eq 'ConsoleHost'){Stop-Transcript} } } #endregion |
The script requires a single argument which specifies the path to an XML file. This XML file is used to define configuration for the script. Here is an example of how you would call the script.
1 |
StartXml-FileSystemMetadata.ps1 -XMLConfigurationFile "C:\Temp\Get-FileSystemMetadata.xml" |
The XML configuration file defines the file system(s) to scan. Here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
C:\Temp <!-- Notes XML properties for each file system to be scanned scanpath = UNC or local file path to be scanned Note: account executing script must have full access to entire file system to scan metadata successfully KeywordSearch = optional regex to search path and poulate keyword column KeywordDirPart = optional int to poulate keyword column with specific Directory part ShareName = will populate Share field in output file MD5 = (True or False) if true an MD5 hash will be calculated and populate the MD5 output field Note: This can take extemely long to complete, use with caution and test on small set of data. Useful for identifying duplicate files. Examples: Simple Scan - Keyword column is not populated <FileSystem scanpath="" outputfile="" /> Directory Scan capture directory part - Keyword column is populated with 3rd element(0 based array) of directory path <FileSystem scanpath="" outputfile="" KeywordDirPart="" /> Directory Scan capture regex match - Keyword column is populated with regex match of directory path <FileSystem scanpath="" outputfile="" KeywordSearch="" /> Regex Examples "C:\\Temp\\Users\\(.*)\\.*|C:\\Temp\\Users\\(.*)$" "C:\\Temp\\Users\\(.*[^\\])\\.*" "C:\\Temp\\Users\\([^\\]+)/g\\.*" Usage Examples <FileSystem scanpath="\\isilon\home" outputfile="C:\Data\Analysis\RTT_Lab\Collected_Data\FileSystem\homedata.csv" KeywordDirPart="4" ShareName="home" MD5="0"/> <FileSystem scanpath="\\isilon\cch" outputfile="C:\Data\Analysis\RTT_Lab\Collected_Data\FileSystem\cch.csv" ShareName="cch" MD5="False"/> <FileSystem scanpath="\\isilon\DataProtection" outputfile="C:\Data\Analysis\RTT_Lab\Collected_Data\FileSystem\DataProtection.csv" ShareName="DataProtection" MD5="False"/> <FileSystem scanpath="\\isilon\QHome" outputfile="C:\Data\Analysis\RTT_Lab\Collected_Data\FileSystem\QHome.csv" ShareName="QHome" MD5="False"/> <FileSystem scanpath="\\isilon\SmartQuotas" outputfile="C:\Data\Analysis\RTT_Lab\Collected_Data\FileSystem\SmartQuotas.csv" ShareName="SmartQuotas" MD5="False"/> <FileSystem scanpath="\\isilon\StorageTeam" outputfile="C:\Data\Analysis\RTT_Lab\Collected_Data\FileSystem\StorageTeam.csv" ShareName="StorageTeam" MD5="False"/> CSV Output example "FileName","DirectoryName","Extension","Mode","CreationTime","LastAccessTime","LastWriteTime","Bytes","Share","MD5","Keyword" "AppEvent.cls","\\FS01\Archive_Data\Code Library\!DAM Libraries\Classes\VB6\WriteNTEventLogs","cls","-----","8/17/2011 2:45:58 AM","1/7/2015 11:12:52 AM","7/20/2001 1:20:14 PM","19212","Archive_Data",,"Code Library" "Common.bas","\\FS01\Archive_Data\Code Library\!DAM Libraries\Classes\VB6\WriteNTEventLogs","bas","-----","8/17/2011 2:45:58 AM","1/7/2015 11:12:52 AM","5/6/2001 4:25:58 PM","9335","Archive_Data",,"Code Library" "FormatMsg.bas","\\FS01\Archive_Data\Code Library\!DAM Libraries\Classes\VB6\WriteNTEventLogs","bas","-----","8/17/2011 2:45:58 AM","1/7/2015 11:12:52 AM","5/6/2001 4:57:50 PM","10240","Archive_Data",,"Code Library" "MSSCCPRJ.SCC","\\FS01\Archive_Data\Code Library\!DAM Libraries\Classes\VB6\WriteNTEventLogs","SCC","--r--","8/17/2011 2:45:58 AM","1/7/2015 11:12:52 AM","12/30/2005 2:46:09 PM","192","Archive_Data",,"Code Library" "modError.vb","\\FS01\Archive_Data\Code Library\!DAM Libraries\Modules\VB.NET","vb","-----","8/17/2011 2:45:58 AM","1/7/2015 11:12:41 AM","6/24/2005 11:40:35 AM","3762","Archive_Data",,"Code Library" --> |
The example includes an explanation of optional attributes for the XML element(s). This allows control of how the data is organized and tagged, which provides more useful reporting options. The sample also shows several configuration examples and some example output. Once the metadata is collected into CSV files. It can be easily loaded into Splunk using the ad-hoc data load feature or a file system input on a forwarder. Here is an example of a file metadata record/event in Splunk.
One thing to note here is the event timestamp in Splunk. The Time field is derived from the modified time of the file. This was done on purpose, because in my experience doing file system assessments it is the only timestamp that is generally accurate. I have found many cases where last accessed is not available or incorrect. I have also seen many cases where the create date is incorrect. Sometimes the create date is more recent than the modified date and occasionally it is even in the future. Here is the Splunk props.conf sourcetype stanza I defined for the data. It includes TIME_FORMAT and TIMESTAMP_FIELDS to use the modified date for the _time field. It also uses MAX_DAYS_AGO since the modified date can go back many years.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
[posh_file_metadata] DATETIME_CONFIG = INDEXED_EXTRACTIONS = csv KV_MODE = none MAX_DAYS_AGO = 100000 NO_BINARY_CHECK = true SHOULD_LINEMERGE = false TIMESTAMP_FIELDS = LastWriteTime TIME_FORMAT = %m/%d/%Y %I:%M:%S %p category = Structured description = File System MetaData disabled = false pulldown_type = true |
Once the data is loaded into Splunk here is the type of information we can easily find.
These are just some simple charts, but the metadata provides many reporting options. There are benefits of using Splunk besides the fact that it can be done without additional cost.
- This eliminates the need to create tables and/or ETL processes for a relational database
- The data can be loaded very easily compared to using a relational database
- The dashboard can be reused very easily for new reports. Simply use a dedicated index and clean or delete/recreate as needed for updated reporting
If I were doing this in an environment I managed on a day to day basis. I would send the data directly to Splunk via the HTTP event collector. I’ll need to modify the collection code a bit to provide an example, but I’ll try to post a follow-up with that info.
I hope some folks find this useful.
Regards,
Dave