This guide is an explanation and suggested workflow for using Summation Pro’s Predictive Coding feature. This guide was made using version 6.3.
Cluster Analysis
The backbone of the Predictive Coding algorithm is built upon our Cluster Analysis processing feature. This feature will look at each document during processing and determine a list of KeyWord pairs from every document. When Predictive Coding is applied to the database, it will reference those KeyWord pairs, and compare them against the KeyWord pairs from the documents you marked as Responsive. If a certain number of KeyWord pairs match your responsive documents, then those documents will be determined Relevant by the Predictive Coding algorithm.
The importance of explaining this is because Cluster Analysis may not run on your entire dataset in one processing session. You may have to run it multiple times to cover your entire data set. The “ClusterID” column will tell you how many of your items have been analyzed and how many have not. You will need to re-run cluster analysis until every document has been assigned a ClusterID in this column. Any document that does not have a ClusterID will not be considered by the predictive coding engine.
To execute an additional Cluster Analysis examination, click on the Green + sign for “Add Data to the Project” then choose “Cluster Analysis”.
Seed Set
A Seed Set is a random subset of documents that would seemingly be a good representation of the entire collection to be considered. When using predictive coding, you will need to manually review approximately 10% of your entire data set. This 10% will be your Seed Set that will create the predictive coding algorithm. Be sure that your 10% adequately reflects the entire collection. Do not just review the first 10% in the Item List. If you have multiple sources of data (Email, computers, network shares, phones, etc.) then you will need to add a random sampling from each source to be included in your subset. If the 10% you reviewed was all from the same source, then you will not adequately train the predictive coding algorithm to find relevant data in the other sources. My workflow suggestion is to create a Label specific for your SeedSet. It can be a single Label comprising all sources, or a Label Group that individualized the sources.
Coding the Seed Set
The ReviewResponsiveness field is the field used to train the predictive coding algorithm. Ideally your seed set would be roughly 50/50 Responsive/Not Responsive. You do not have to add any additional KeyWords. The Cluster Analysis feature determines the keywords that will be used. The ReviewResponsiveness field, and the Predictive Coding tagging layout, were made intentionally cumbersome as to ensure that each and every document was considered for relevancy when coding to train the algorithm. Therefore, you cannot bulk code this field, you cannot add it to another tagging layout, and you cannot use the “Apply Previous” button.
Confidence Score
Once you think you have coded enough to train the system, you can test if the system is ready by performing a Confidence Score. There is a Panel within summation called “Confidence” where you can do this.
In most instances of using Predictive Coding in the Legal world, both parties will agree to a confidence score prior to applying predictive coding. This could be low around 80%, or some may not agree to anything less than 90-95%. That is up to you and the parties involved in the suit as to what score is acceptable here. If you are happy with the Confidence Score, you can move on to applying predictive coding. If the score is lower than you would like, you need to code more documents and the calculate the score again.
Applying Predictive Coding
If you are satisfied with you Confidence Score, you can choose “Predictive Coding” from the Actions menu on the Confidence Panel. Click “Go” when you have it selected.
You can see the progress and final outcome of applying the predictive coding from the WorkList.
Viewing Results
To view the ones that are now responsive, you can filter on the ReviewResponsiveness column or the SetBy column. The SetBy column will say Predicitvely Coded for all items reviewed by the algorithm. Even the ones you manually coded.
The Tagging Layout can tell you if the document you are currently look at is Manually Coded or Predictively coded.